«Date: Prof. Deb Roy Associate Professor of Media Arts and Sciences Massachusetts Institute of Technology Date: Dr. Allen Gorin Research Associate JHU ...»
Predicting the Veracity of Rumors in Social
Networks: Computational Explorations
by Soroush Vosoughi
Ph.D. Thesis Proposal, Media Arts and Sciences
Massachusetts Institute of Technology
Prof. Deb Roy
Associate Professor of Media Arts and Sciences
Massachusetts Institute of Technology
Dr. Allen Gorin
JHU Center of Excellence for Human Language Technology
MIT Laboratory for Social Machines
Prof. Aral Sinan Associate Professor of Information Technology and Marketing Massachusetts Institute of Technology Contents 1 Introduction 4 2 Background 6 3 Thesis Summary and Methodology 6
3.1 Anatomy of an Assertion in Social Media.................... 7
3.2 Quantifying and Operationalizing Assertions................... 7 3.2.1 The Form/Style.............................. 8 3.2.2 The Function/Content.......................... 8 3.2.3 The Agents/Users............................. 9 3.2.4 The Propagation/Cascade Dynamics................... 9
3.3 A Computation Model of Rumors......................... 10 3.3.1 What is a Rumor?............................. 10 3.3.2 Predictive Model and Real Time Rumor Veriﬁcation........... 11 4 Evaluation 11 5 Conclusion 12 6 Research Plan 13
6.1 Completed work.................................. 13
6.2 Timeline...................................... 13
6.3 Required resources................................. 13
Abstract The spread of malicious or accidental misinformation in social media, especially in timesensitive situations such as real-world emergencies can have harmful eﬀects on individuals and society. Using computational methods, this thesis investigates the nature of rumors surrounding real-world events on Twitter and Reddit, using the April 2013 Boston Marathon bombings as a case study. With the perspective that in social media both the linguistic and the network dynamics of messages need to be taken into consideration, we propose a set of linguistic and graph-theoretic features that make up the anatomy of rumors. The key idea is that there are measurable diﬀerences in the make up of false and true rumors. We extract these features using novel natural language processing and network analytic algorithms that we have developed. In this thesis, we propose a dynamic computational model of rumors composed of these features. The model will be evaluated on the rumors surrounding the August 2014 Ferguson unrest. Once fully evaluated, the model will be used to build a real-time rumor veriﬁcation system for Twitter and Reddit that can be used during realworld emergencies. This system will have immediate real-world applications for consumers of news, journalists and emergency services and can help minimize and dampen the impact of misinformation.
1 Introduction In the last decade the Internet has become a major player as a source for news. In fact a study by the Pew Research Center has identiﬁed the Internet as the most important resource for the news for people under the age of 30 in the US and the second most important overall after television . More recently, the emergence and rise in popularity of social media and networking services such as Twitter, Facebook and Reddit have greatly aﬀected the news reporting and journalism landscapes. While social media is mostly used for everyday chatter, it is also used to share news and other important information [11, 18]. Now more than ever people turn to social media as their source of news [15, 24, 14], this is especially true for breaking-news, where people crave rapid updates on developing events in real time. As Kwak et al. (2010) have shown, over 85% of all trending topics 1 on Twitter are headline or persistent news . Moreover, the ubiquity, Trending topics are those topics being discussed more than others on Twitter.
accessibility, speed and ease-of-use of social media have made them invaluable sources of ﬁrsthand information. Twitter for example has proven to be very useful in emergency situations, particularly for response and recovery . However, the same factors that make social media a great resource for dissemination of breaking-news, combined with the relative lack of oversight of such services, make social media fertile ground for the creation and spread of unsubstantiated and unveriﬁed information about events happening in the world.
This unprecedented shift from traditional news media, where there is a clear distinction between journalists and news consumers, to social media, where news is crowd-sourced and anyone can be a reporter, has presented many challenges for various sectors of society, such as journalists, emergency services and news consumers. Journalists now have to compete with millions of people online for breaking-news. Often time this leads journalists to fail to strike a balance between the need to be ﬁrst and the need to be correct, resulting in an increasing number of traditional news sources reporting unsubstantiated information in the rush to be ﬁrst [6, 7]. Emergency services have to deal with the consequences and the fallout of rumors and witch-hunts on social media, and ﬁnally, news consumers have the incredibly hard task of sifting through posts in order to separate substantiated and trust-worthy posts from rumors and unjustiﬁed assumptions. A case in point of this phenomenon is the social media’s response to the Boston Marathon bombings. As the events of the bombings unfolded, people turned to social media services like Twitter and Reddit to learn about the situation on the ground as it was happening. Many people tuned into police scanners and posted transcripts of police conversations on these sites. As much as this was a great resource for the people living in the greater Boston, enabling them to stay up-to-date on the situation as it was unfolding, it led to several unfortunate instances of false rumors being spread, and innocent people being implicated in witch-hunts [13, 16, 25]. Another example of such phenomenon is the 2010 earthquake in Chile where rumors propagated in social media created chaos and confusion amongst the news consumers .
In this thesis, we plan to develop and combine a set of natural language processing and complex network analysis tools and algorithms that enable the study and analysis of the underlying processes that develop on social media in emergency situations. More generally, we are interested in using social media as an experimental ground for studying and quantifying the nature of communicative discourse in highly connected, complex and massive communication networks (such as social media), in order to better understand and model the dynamic processes that evolve on these networks and the underlying signals driving them. Through modeling these signals and processes we attempt to explain, predict and modify how these systems behave under diﬀerent conditions. As mentioned, one such behavior that we are interested in modeling is how such systems behave during real-world emergencies (e.g., natural disasters, terrorist attacks, plane crashes, etc). Speciﬁcally, we want to model the emergence, evolution, propagation and impact of unveriﬁed assertions (or rumors) on social media during emergency situations. We then plan to use these models to predict the veracity of assertions made about such events on social media, with the goal of creating a rumor veriﬁcation tool for use in emergencies. Finally, we plan to study and experiment with possible approaches for intervening and minimizing the impact and spread of false information in these networks.
2 Background Although there has been extensive work done on measuring and quantifying information credibility and modeling the spread of information in networks, most have approached this problem either through a text and language processing or network science and complex system analytics framework. The research done in the network science domain have mainly focused on modeling various diﬀusion and cascade structures [8, 10], the spread of “epidemics” [20, 19, 9], knowledge  and information and propaganda . Work has also been done on identifying inﬂuential players in spreading information through a network [28, 1] and identifying sources of information . In a work more directly related to our research direction, Mendoza et al, have looked at the diﬀerence in propagation behavior of false rumors and true news on Twitter . In all of these cases the properties of the actual entity that is being spread–be it a message, knowledge, or a virus– is never analyzed or taken into consideration in the models. In contrast, our work will be looking at the content of the messages being spread in addition to the propagation behavior of these messages and any information that might be available about the agents involved in the propagation.
Relevant research done in text and language processing domain primarily falls either under information retrieval and comparison or semantic and sentiment analysis. The former involves using various NLP techniques to retrieve relevant information from text (or speech) and then comparing the information against a database of known-facts. The Washington Post’s TruthTeller 2 which attempts to fact check political speech in real time is a great example of such work. The latter research attempt to detect non-literal text (text that is not supposed to be taken at face value) such as sarcasm , satire  and hostility (ﬂames)  through a combination of semantic and sentiment analytic techniques.
As far as we can tell, there have been very few studies that take all these factors into consideration. Most relevant is the work of Castillo et al , where the authors have looked at a combination of linguistics and propagation factors that can be used to approximate users’ subjective perceptions of credibility on Twitter (i.e. whether users believe the tweets they are reading), they do not however focus on objective credibility of messages.
3 Thesis Summary and Methodology This work uses Twitter’s and Reddit’s response to the April 2013 Boston Marathon bombings as a case study to analyze and model the genesis, evolution and propagation of rumors. The work starts by annotating more than 20 rumors that spread about the events surrounding the bombings, followed by processing and parsing raw tweets and posts using various NLP and network analytic tools which we have developed. Leading to a computational analysis of rumors and predictive models for estimating the veracity of assertions in these mediums and ﬁnally evaluating these models on tweets and reddit posts about other real-world events and emergencies, such as the August 2014 unrest in Ferguson.
This section will explaining the following:
• The deﬁnition of rumors.
• The process through which messages on Twitter and Reddit are operationalized as a collection of computationally measurable and quantiﬁable features.
• The tools that have been built and need to be built to extract these features.
• The creation of a computational model of rumors using these features.
3.1 Anatomy of an Assertion in Social Media At any given time an assertion on social media can potentially be broken down into the following
• The Form/Style: How is the message presented? Is it well polished? Grammatical?
Does it use slang?
• The Function/Content: What is the message about? What is it intended to achieve?
• The Agents/Users: Who is presenting the message? Which platform is being used?
Which social group does the author belong to? What is the history of the author?
• The Propagation/Cascade Dynamics: What was the speed at which the message spread? What did the propagation tree look like? How many “inﬂuential nodes” did it pass through? How fast did its spread decay?
By breaking down assertions along these dimensions over time, we can create a dynamic ﬁngerprint for each assertion. We can then group false and true assertions together and look for common structural properties between assertions in each group. In addition we can look for possible signals that can diﬀerentiate between the false and true assertions. Even though our work focuses on rumors, similar characterization techniques can be used to analyze diﬀerent phenomenons in social networks.
3.2 Quantifying and Operationalizing Assertions In the section above we brieﬂy talked about how assertions in social media can be characterized by a combination of their form, function, agents and propagation dynamics. In this section we will explain in greater detail the nature of these four dimensions and describe how they are quantiﬁed.
3.2.1 The Form/Style The form of a message captures how it is presented and is assumed to be independent of its information content. There are many ways to encode the form of a message, however we have found two aspects of the form to carry the most information (and thus be more predictive) about the nature of a signal in social media. These two aspects are the sophistication and the formality
of a message. The sophistication of a message is captured through the following features:
• Type/token ratio. (E.g., number of adjectives, etc.)
• Complexity of the sentences. (E.g. embedded clauses, etc.)
• Complexity of words. (E.g., number of syllables, rarity.)
The formality of a message is captured through these features:
• Grammatical correctness of the message.
• Usage of emoticons.