Twitter Data Mining and Computational Social Science

Originally published in The Warwick Engineer Easter Issue 2016

Commonly referring to the academic sub-discipline focused on computational approaches to the social sciences, computational social science is on the rise since the advent of social media giants like Facebook and Twitter. In this discipline, computers are used to simulate, model and analyse social and behavioural dynamics.

Facebook and Twitter are two of the most popular online social network where people interact on a daily basis. However, one big distinction between the two lies in the fact that Twitter data is publicly available through its Application Programme Interface (API) whereas most Facebook content is private. As a consequence, Twitter has established itself as “the single most powerful socioscope available to social scientists for collecting […] records of human behaviour and interaction.”

The 320M monthly active Twitter users leave billions of time-stamped digital trails of their social interactions. This unprecedented quantity of data offers researches the opportunity to observe not only changes in opinion – via the so called sentiment analysis – but also in behaviour form both a macroscopic and a microscopic perspective. To illustrate the variety and usefulness of this data, the reader will find a selection of results from research articles which range from public health to politics passing by economics and finance.

Public Health

Twitter data has been successfully used to track, study and manage the spread of a disease. For instance, data from the blogging site has been used by Kostkova et al. (2014) to prove how Twitter could have offered an early warning signal of the 2009 swine flu pandemic. In fact, by collecting data from May to December 2009, and analysing those twits which contained self-reported diagnosis of the disease such as “have flu” and “have swine flu,” they have demonstrated that Twitter could have predicted the forthcoming spike in the epidemic in the UK up to 2 weeks before.


Twitter provides an extensive amount of information which has been used for socioeconomic measurements. For example, Antenucci et al. (2014) created the “University of Michigan Social Media Job Loss Index” based on the number of twitters containing sentences such as “fired” or “lost my job.” This index has been argued to provide unemployment now-casting in a more precise manner when compared to the index based on unemployment claims. Moving to a different socioeconomic indicator, O’Connor et al. (2010) have used sentiment analysis to categorise twits containing words such as “job” and “economy” as positive or negative. In so doing, they have created a consumer confidence index which has been shown to be strongly correlated with several US consumer confidence surveys.


Twitter microblogging content has been recently considered as a source of public, and more specifically, political opinion. In this regard, Tumasjan et al. (2010) where the first to create a model aimed at forecasting electoral results from Twitter data by mining selected keywords corresponding to parties and candidates running for elections. However, their methodology was limited to the use of the share of twits for each party or candidate to forecast the share of votes. This limitation was then addressed by several researches by taking into account not only the presence or absence of certain keywords, but also by extrapolating the sentiment – negative or positive – of the overall twit.


Extrapolating investors sentiment and consequently categorise it as bullish or bearish is one of the several applications that twitter data has in the financial sector. Mao et al. (2014) have applied a technique which quantifies the interdependences among multivariate time series to shed light on the predictive power of the so called “Twitter bullishness” with respect to stock market returns.  They have found a statistically significant relationship between the Twitter bullishness level and the stock market return on the following day. The results coming out of this research were so promising that in May of last year, the startup, named Market Prophit, launched its “Social Media Sentiment Index”. The index tracks tweets that use a ticker with a dollar sign in front of it such as $AAPL.

Conclusion – Obstacles and Limitations

However diverse and versatile Twitter data might be, there are still limitations and challenges surrounding its use. It has been pointed out that generalising from online to offline behaviour could result in unrealistic results which are primarily due to the fundamental difference in the demographic profile of its users. Further, retrieving, storing, analysing and validating the massive quantity of data that Twitter provides require advanced technical which social scientists might lack.

Finally, the above mentioned studies should not be taken as an exhaustive outline of the area of application of Twitter data, instead, it ought to symbolise one of the endless contributions that this blogging platform is providing to the study of the kaleidoscopic relationship between online human behaviour and the overall society.



One thought on “Twitter Data Mining and Computational Social Science

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s