You are what you tweet?
The pitfalls of twitter research
Twitter research could be a valuable tool when it’s done properly. When we take into account that we already have a bias because twitter users are not a random sample from society, we could benefit from the insane amount of data twitter collects. When we take a random sample from tweets, it could tell us something about the prevailing discourse we are in. What are the subjects people talk about nowadays? Are tweets of any value or is it just tweet chat?
The company Textwise did a tiny bit of twitter research and although I can not recall if the source is trustworthy, it gives us a good example of what the options are with twitter research. Textwise took a sample of 8.9 million tweets and analysed it with the use of Semantic Signature. First, they found out what languages were used in the sample. The following graph could be subtracted from the findings:
Then, the company took a sample of 1,000 tweets to find out what kind of messages were shared on twitter. Were the messages only about the current status from the user? (“I’m having the best f*cking latte macchiato of my life”) Or are most of the tweets of more importance and do they contain political or cultural matters? (“I can not believe Geert Wilders still isn’t kicked out of the Dutch parliament”). Textwise found out the following trend:
- User’s current status — where the user is right now, what they’re doing, etc.
- Private conversations — some twitterers seem to use the service as if it were a giant internet chatroom
- Links to web content — a URL with an article title and/or some commentary on its content. Further broken down into: links to blog and news articles; links to images and videos; and other links.
- Politics, sports, current events — discussion of these topics
- Product recommendations/complaints — recommendations or complaints about specific TV shows, movies, techie gadgets, etc.
- Advertising — posted from a company’s twitter account
- Spam — a strange phenomenon, given that an account has to be followed for anyone to see its tweets, but it does exist
- Other messages — messages that don’t quite fit under any of the above categories. Fan messages to celebrities, shoutouts to other users, web-based polls and quizzes, and so on.
As you can see, only 6% of the tweets is about politics, sports or current events. This percentage plus the ones that contain a link to a blog/news article could be seen as relevant for a broader audience. Most of the tweets are about current statuses or involve private conversations. Spam and advertising are also significantly represented in the results. The outcome of the research is partly that twitter research is a complex process in which junk data plays a large role. It’s still difficult to filter data in a semantic way so that the content can be categorized. If you want to find out what the current attitude is against a political development, you first have to filter out all the current user statuses and personal conversations, as well as all the spam and advertising. Apart from this, it would be helpful to apply a statistical test to analyze the data. Otherwise one can never suggest to have found a significant outcome from a twitter research. 1,000 subjects to state a reliable verdict about a population that contains over a billion subjects is not enough. Moreover, the pitfall with twitter of social network research is that it is build as a network. Because of that it is difficult to collect a completely randomized sample, because the one subject is not independent from the other.
An other problem is that the top 10% of all twitter users is responsible for 90% of all tweets. (Heil & Piskorski, 2009) This means that 90% of the tweets are not independent from each other at all. Note that this makes twitter a one-to-many medium rather than a peer-to-peer medium as often claimed. If such a small amount of users is responsible for almost all tweets, the bigger the change that tweets from the same person end up in one sample. This influences the whole outcome of a research. Traditional statistics might not be applicable to twitter and other social networks, because it’s based on a human population instead of a digital one. In the physical life it is unthinkable that social networks are as extensive like digital ones, so we may need to adopt an entirely new approach to analyze digital data.
Heil, Bill; Piskorski, Mikolaj. ‘New Twitter Research: Men Follow Men and Nobody Tweets’ June, 2009
Crawford, Cliff. ‘How informative is Twitter?’ August, 2010