LINGUIST List 35.936 Review: The Language of Fake News: Grieve & Woodfield (2023)

LINGUIST List 35.936

Thu Mar 14 2024

Review: The Language of Fake News: Grieve & Woodfield (2023)

Editor for this issue: Justin Fuller <justinlinguistlist.org>
LINGUIST List is hosted by Indiana University College of Arts and Sciences.
Date: 15-Mar-2024
From: Elizabeth Craig <eccraiguga.edu>
Subject: Forensic Linguistics: Grieve & Woodfield (2023)
E-mail this message to a friend
Book announced at https://linguistlist.org/issues/34.3166
AUTHOR: Jack Grieve
AUTHOR: Helena Woodfield
TITLE: The Language of Fake News
SERIES TITLE: Elements in Forensic Linguistics
PUBLISHER: Cambridge University Press
YEAR: 2023
REVIEWER: Elizabeth Craig
SUMMARY
The Language of Fake News was authored by Jack Grieve (University of Birmingham, UK), a quantitative corpus linguist more broadly interested in dialectology, author identification, and language change, and Helena Woodfield, a doctoral researcher specializing in fake news. They present a concise outline of the linguistic characteristics of fake news from the perspective of register variation. Such a study falls under the purview of forensic linguistics in that it investigates the distinctive grammatical features of the news articles of one author, Jayson Blair, who deliberately produced both true and false reports for The New York Times (NYTs) in the early 2000s. The celebrated journalist was forced to resign after some issues with his ‘factual’ accounts arose and suspicions of plagiarism were discovered by colleagues at The Washington Post, a major rival to his publication. The discovery of the extent of his deceit eventually led to the firing of Blair and two of his editors.
In the introductory chapter, the authors offer clarity: fake news is not only false information; it must also be intentionally deceptive. They further contend that the “distinctive communicative functions” of real as opposed to fake news involve the use of differing linguistic structures. Because one is meant to inform and the other to deceive, this study indicates that we should expect the use of disparate grammatical forms. The hope is that, given these differences, a purely linguistic analysis could aid in detecting intentional deception.
The term ‘fake news’ became a popular accusation during the Hillary Clinton/Donald Trump campaigns for the US presidency in 2016, but the authors submit that fake news is as old as the news itself. The advent of the internet, a 24/7 news cycle, and news-as-entertainment (directed at selling ads) has made it much more ubiquitous. As a result, the distrust of both government and media has become extremely widespread, and everyone can choose where they get their news, be it true or false. The authors feel that these three points--the widespread dissemination of fake news, its social impact, and its distinctive linguistic characteristics--are what make such a study so important now.
They further wish to differentiate the present study from those that have gone before on fake news; such earlier studies in natural language processing (NLP) tended to focus on language content, meaning, and topic, rather than on linguistic structure. The present study potentially offers more promise of “automatically classifying” news as genuine or fake because it relies on abstract, objective categorizations (parts of speech as “principled sets of linguistic features”), rather than on superficial word or phrase choices, as in the earlier studies. The authors suggest that each methodology should serve to substantiate rather than replace the other.
The second chapter presents a critical review of past research on the language of fake news, with a focus on its shortcomings. They begin by describing the limitations of veracity-based studies utilizing NLP methods. The authors feel that machine learning systems can determine only if a news story is false, not whether it is fake, because the method analyzes language content without accounting for register variation and disinformation, or intentional deceit. Grieve and Woodfield propose that their framework, in focusing on linguistic structure, takes these factors into account through a comparison of the linguistic structures in two parallel corpora from the same author and further that this methodology offers an explanation, which is required to distinguish disinformation from mere misinformation, i.e. fake news from false news. The authors take issue with including disinformation as a subcategory of misinformation, when it should be viewed as qualitatively distinct from the latter. “(P)eople can inadvertently communicate falsehoods when they intend to share accurate information, and this should not be confused with lying…people can also state the truth when they intend to deceive if they are misinformed themselves” (p. 12-13).
The problem with NLP methods is that texts are being judged along only one dimension, whether they are true, when the author’s honesty is another dimension to be considered. Another issue with the veracity-based approaches that dominate the current research to fake news analysis is that whole articles may be classified as either fake or true, a binary distinction, while having some true or false statements within them. Furthermore, such judgments are highly subjective. These researchers set out to focus on untrue news that was intended to deceive by one author. And by analyzing the language usage of only one author, they can be certain they are not dealing with differences in register and/or dialect: the only difference in the two corpora compared is that one is true and honest (genuine), and the other is false and dishonest (fake). All other differences are controlled for; this is what is meant by a corpus being a principled collection of texts. A principle guiding this research is that studies on register variation have demonstrated that there are clear and systematic grammatical differences depending on contexts of use and function (Biber, et al. 1988).
Chapter Three goes into Jayson Blair’s stint at the NYT, his rapid climb, and how his downfall resulted from a series of noted discrepancies by several of his colleagues. Blair experienced a relatively swift rise in his career as a newspaper journalist, which some attributed to his race (African American), and he was soon promoted to the National Desk in 2002, where he covered the DC Sniper case and the Iraq War.
Chapter Four describes the building of the two parallel corpora for comparison of the language structures used in the genuine versus the fake news stories, which were determined by the newspaper. The authors decided to include only those articles written during the six-month period under scrutiny by Blair’s employer and removed any articles that were co-authored. Only the main text of the articles appears in the two corpora, with no titles, captions, etc., which would not be in complete sentences, making grammatical categorization by an automatic tagger problematic. Also, short articles (under 300 words) were discarded leaving only those between 321 and 1,825 words in length. The total corpus came to be just under 57,000 words with about a 60%/40% split between fake/true stories coming from 36/28 articles respectively. Only about half of Blair’s articles on the DC Sniper were fake, whereas all his articles on the Iraq War, which came later, were fake. From graphs, we can see that Blair became more prolific and more dishonest over time.
The authors admit here to three shortcomings with this corpus: 1) it is extremely small and therefore not conducive to a rigid statistical analysis; 2) there is a large difference in the rate of real vs. fake news depending on topic, which could point to register variation as a factor; and 3) it represents the writings of only one author, but this is by design and is meant to lend credence to the findings.
Chapter Five quantifies the main grammatical features of each of the two corpora and seeks to explain the differences in the two. The authors examine the relative frequencies among 49 grammatical features (each measured per 100 words) and establish 28 that represent ‘non-negligible’ differences. In general, “when Blair is telling the truth, he tends to write more densely and with greater conviction” (p. 38). An automated, multidimensional analysis tagger is applied claiming a 90% accuracy rate so that prior insights from register analysis (Biber, et al. 1988) could be exploited, but the authors submit that “any sufficiently accurate part-of-speech tagger would allow for similar patterns to be broadly observed” (p. 38).
Relative frequency is determined utilizing an equation referred to as Cliff’s delta, which provides non-parametric significance values for ordinal values, to determine the degree of difference in the usage of each grammatical structure in the two corpora. Blair’s genuine news articles are found to have longer average word lengths, more nouns and nominalizations, time adverbials, gerunds, and participial adjectives; the fake news articles include more emphatics, present tense verbs, perfect aspect verbs, adverbs, the copula-be, predicate and attributive adjectives, subordinators, and five types of pronouns. This finding aligns with established distinctions in the grammatical traits of informational as opposed to interactional registers. “A dense style is the standard for newspaper writing because it allows for detailed information to be conveyed in a limited space” (Biber, et al. 1988). Nominal density is a basic characteristic of informational prose. The authors offer two reasons for the lack of information density in Blair’s false reporting: he was under a lot of pressure in his job to be productive, and he did not have time to produce articles of appropriate conciseness.
Four verb features the authors found to be “highly marked” in Blair’s real articles because they would not be expected are: suasive verbs, possibility modals, by-passives, and public verbs. Also, Wh-relatives, which involve nouns, were more common in the fake news. The authors attribute this anomaly to stance, or the author’s conviction about the information being conveyed. For example, in looking at the public verb ‘to say,’ they find that he uses it in the present tense almost exclusively in the articles in which he is lying and only in the past tense in the true articles.
Chapter Six concludes with a summary of their results and the implications of this study regarding the identifying attributes of fake news. Grieve and Woodfield propose to explain why certain grammatical patterns serve to distinguish fake news from real reporting. They contend that the twenty-eight structural features identified as significant quantitatively point to a stylistic variation in the two text types regarding information density and conviction. Further, they maintain that Blair’s fake news reports were less nominally dense because of the pressure to publish a large quantity in a short period of time; this relative paucity of nouns in the intentionally falsified texts also marks them as more uncertain.
Past NLP models are again criticized here for focusing only on the veracity aspect (content) of untrue news reports, while these authors wish to draw attention to the intent to deceive (dishonesty), which they contend is revealed by subconscious linguistic choices and is, after all, of more social import. The authors hope this study contributes to the future development of large-scale, fake news detection.
EVALUATION
In this Element, the authors introduce and apply a framework for the linguistic analysis of fake news. They define fake news as false information that is meant to deceive, and they argue that there are systematic differences between real and fake news that reflect this basic difference in communicative purpose. The authors consider one famous case of fake news involving Jayson Blair of The New York Times, which provides them with the opportunity to conduct a controlled study of the effect of deception on the language usage of a single reporter following this framework. Through a detailed grammatical analysis of a corpus of Blair's real and fake articles, they demonstrate that there are clear differences in his writing style, with his real news exhibiting greater information density and conviction than his fake news. While information density can be determined by a preponderance of nouns and their cohorts (adjectives, prepositions, etc.), I feel conviction is a more subjective measure that can only be determined with some consideration given to word choice. I find it difficult to make the leap from information density to conviction in the absence of a semantic analysis, which the authors provide in their discussion of stance in Chapter Five.
One weakness admitted by these authors is that this corpus is small. Indeed, by today’s standards a corpus of half a million words is considered small because of the power of machine processing. But this size limitation was an outcome of adhering to the principle of using a single author, which leads us to another issue discussed below.
I also find here the same issue these authors raise in Chapter Two as a problem with NLP methods: not every sentence in the fake news articles is false. In other words, we are still applying a binary distinction to whole articles as true or fake when certainly there are true statements within each article, which could affect the numbers.
Another lingering question is that of authorship. The reason Blair was eventually outed was his rampant plagiarism, which leads us to wonder just how much of his later writings was his own. Is the corpus under examination here really a single-author text, a principle that was put forth as a basic criterion for the corpus construction? It would seem there would need to be a comparison of Blair’s fake news reports to the stolen reports. Were they merely copied and presented as his own, or did he make any attempts to disguise his submissions for publication? These researchers say that they eliminated from this study any articles that were co-authored. But to what extent were Blair’s fake news articles plagiarized, and were those articles included in the study? Certainly, if he plagiarized whole articles, they may have been true, but there was deception. There was no discussion of how the plagiarism was dealt with by these researchers other than their noting that it was Blair’s undoing. The plagiarized articles may not represent Blair’s writing style at all. Some comparison of the known-to-be plagiarized articles to what we know to be Blair’s authentic writings would have served well here.
This entire scenario recalls the career of another fallen-star reporter, Stephen Glass, former associate editor for The New Republic until he was discovered to have concocted stories from whole cloth in 1998. Glass was known as a meticulous fact-checker who had “provided copious notes and letters, business cards, e-mail addresses–much of which is now believed to have been fabricated” (St. John 1998). Therefore, one would expect his submissions to include names, companies, places, and dates, i.e. proper nouns, which would contribute to nominal density in his fabricated stories. In the cinematic portrayal of his career trajectory, ‘Shattered Glass’ (2003), the prodigious writer eventually admits to falsifying 27 of 41 stories. It would be interesting to determine if the linguistic patterns distinguishing fact from fiction in Blair’s journalism remain valid for Glass’ prose as well, since they both relate to the same register of language usage, newspaper reporting. I am surprised that Grieve and Woodhouse make no mention of this extremely similar case occurring just prior and about which a rather famous movie was made, especially since Stephen Glass is mentioned in the title of one of their own references (Spurlock 2016).
This work constitutes a very interesting, timely, and relevant contribution to the field of deception detection in news reporting through forensic linguistics. The ability to determine fakeness by a preponderance of certain grammatical patterns would be a useful tool indeed for discerning deception in journalistic writing in general. Statements contrary to fact can be challenging to prove, but it can be done; what is more difficult to verify with certainty is the intention to deceive through a quantitative analysis of grammatical structure (no matter how much we may want it to work). I find it hard to make this leap quite yet. It is imperative that we move swiftly forward with this kind of research in a world of machine-generated information and rapidly growing artificial intelligence capabilities. My fear is that both Blair and Glass might still be flourishing in news reporting today if they had had access to such a facilitator as ChatGPT!
REFERENCES
Biber, D. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.
Ray, Billy. 2003. Shattered Glass. Lionsgate Films. Retrieved January 12, 2024, from https://www.youtube.com/watch?v=LdtWcXAQ2Q0
St. John, Warren. 1998. How journalism’s new golden boy got thrown out of New Republic. Observer. Retrieved December 04, 2023, from https://observer.com/1998/05/how-journalisms-new-golden-boy-got-thrown-out-of-new-republic/
Spurlock, J. 2016. Why journalists lie: The troublesome times for Janet Cooke, Stephen Glass, Jayson Blair, & Brian Williams. ETC: A Review of General Semantics, 73(1), 71–76.
ABOUT THE REVIEWER
Dr. Elizabeth Craig is a freelance editor and ESL Instructor. She holds a master’s degree in TESOL and a doctorate in linguistics. [email protected]
Page Updated: 14-Mar-2024
LINGUIST List is supported by the following publishers: