Can Twitter Predict Quality of Life?

Twitter is a pulse on the human race. Each day, half a billion tweets are posted on the internet - some discuss the stock market while others fuel revolutions. As a result, hundreds of academic papers have been published describing how to use Twitter to predict everything from the spread of the flu to the outcome of political elections. After reading so many articles I began to wonder, can Twitter sentiment predict our quality of life?

Defining ‘Quality of Life’

Before attempting to answer the question, it’s important to define what exactly “quality of life” means. Furthermore, “quality of life” needs to be quantified in order for us to be able to predict it. Fortunately, there exists a national government organization whose sole purpose is to measure Americans’ well being.

Every year, the Census Bureau reports statistics such as income per capita, percent of people living in poverty, and many more for every US county in America. Therefore, I used the variables below as a proxy for quality of life. Some variables are more relevant than others, but I think all of them are interesting metrics.

  1. Population Size
  2. Average Home Value
  3. Income Per Capita
  4. Percent of People in Poverty
  5. Population per Square Mile

Collecting Data And Analyzing Sentiment

We also need a large dataset of tweets which we can analyze. I wrote a few programs to connect to Twitter’s streaming API and over the course of two weeks, I downloaded over a million tweets from around the world. I will compare each tweet’s sentiment rating to the census statistics from the area it was tweeted from.

Second, we need some way to analyze them for sentiment. Sentiment analysis is a subset of Natural Language Processing and attempts to identify how positive or negative a group of words are. I used a tool called Sentiment 140 to analyze the sentiment of each tweet that I scraped - the tool classified each tweet as either positive or negative (4 or 0). So a tweet saying “I hate this place” would receive a 0 while a tweet saying “I’m so happy” would receive a 4. While I would have liked more granularity, Sentiment 140 was the only free option available.

Below, you can see what the dataset looked like in an interactive Google Map. The color indicates if a tweet was classified as positive or negative and clicking on a tweet will reveal its contents. Feel free to zoom out, the tweets are from all over the world!

Quantifying Patterns Between Twitter And People’s Well Being

After compiling all of the data I had collected, I not only had each county’s Census statistics, but also its average sentiment rating based on tweets sent from that location. Below I outline two major relationships I discovered. You can see the output of the regressions yourself here.

Happier People Earn More

A simple regression showed that for every one point higher a county’s average sentiment rating was, the average citizen earned almost $1,000 more. Does money buy happiness or do happy people make more money? These results were in line with regressions I ran for sentiment vs average home value (positive correlation) and sentiment vs percent of people in poverty (negative correlation).

People Who Live Closer Together Are Happier

The data revealed that for every one point increase in a county’s sentiment, the population per square mile increased by over 400. I smiled after I discovered this relationship. I can’t say I expected it, but it makes logical sense. However, I don’t think that this relationship is always true. Poverty can be highly correlated with a dense population (think urban areas), but I think at a national scale, this is a valid relationship.

So Does Twitter Predict Quality of Life?

While I identified several strong correlations, that doesn’t mean the answer is yes. This is a good example of the causation vs correlation debate. While all of the relationships I quantified here are statistically significant (p value < 0.05) , it is the R-squared values that indicate the predictive power of a variable. If we take a look at the R-squared values for all the regressions I ran, we’d see that the values are all less than 0.01, meaning less than 1% of all changes in quality-of-life metrics can be explained by Twitter sentiment. You can see how noise the data is in the pictures below:

So Why Is This Interesting?

Each day, we log on to Twitter and send extremely short messages to friends and family only to forget about them just minutes after clicking “Tweet”. Yet at the scale of a million users, there exist implicit and underlying patterns in the way we communicate as a human race. These patterns aren’t hiding behind some sophisticated algorithm, but can be discovered by the most basic tools - a few simple regressions, a free sentiment analyzer, and tweets randomly chosen from around the country. On a microscopic level, our actions seem individualistic and arbitrary but when we zoom out, we see that there is a rhyme and reason in the way we humans co-exist.