Should you ever want to visualize the definition of “unstructured data”, there is no need to look beyond the beautiful chaos that is 503,000 ESPN tweets all harvested using DiscoverText. It would be an understatement to call an archive of this nature diverse. Posts appear in Spanish, Chinese, Korean, and Turkish. There are references to the obvious like LeBron James and his ego, and to the obscure, like European soccer club Olympique Marseille losing their best striker. In addition, there are numerous re-tweets, and the occasional post which is simply incomprehensible.
To even begin to make sense of an archive this massive seems daunting, as viewing each individual tweet is nearly impossible. However, using DiscoverText to code, train, and classify the data; it is possible to develop a better understanding of the nature of the tweets. Have a hunch that most of the tweets are about LeBron James? By coding training, and classifying the data, you may make your hypothesis about your data, and then see how accurate that hypothesis was.
In my first post, I will detail the progress I made when coding and classifying the tweets, and how the accuracy improved the more I coded data. By continuing to code your data you can improve the accuracy of your classifier over time, and have a better understanding of your data by studying the “classifier report”.
I began by creating my classifier, by using a manageable amount of codes, such as, Baseball, Basketball, American Football, Soccer, Hockey, ESPN, and Other.
To see the gradual improvement of the classifier, I began by coding a modest 200 data units. While this is only a fraction of the entire dataset, this is far more manageable than 503,000 tweets that I started with. When finished, I trained the classifier using the data I had just coded. I then decided to just classify 100 data units, and then check for accuracy.
After checking the accuracy, the classifier had a reliability of 60%. While this is certainly far from perfect, this is still impressive classifier accuracy for just an hour’s work.
My next step was to code far more data units, and then put a little more pressure on the classifier. By coding 500 data units, and classifying 10,000 data units, I could discover more about the nature of my tweets. To do this, after coding, training, and classifying, I checked the “Classification Report”, which gives a breakdown of the tweets.
What I found on the breakdown, was quite similar to what I saw when coding. From my observation, many of the tweets were specifically about ESPN’s coverage of the NBA Finals, all of which I coded ESPN. There were numerous foreign language posts, none of which, with the exception of the lone post in French, I could read, therefore I coded them as “Other”. Basketball, because of being around the time of the Finals, also took up a large percentage of the tweets. These percentages all made sense; however, I still had to check the accuracy of the classifier.
This time, I would still check 100 of the 10,000 which I classified, however, they would not be consecutive, instead, I used a simple random sample of the classified tweets. What I found, was much to my liking, making me more confident with the classification report. The classifier had an accuracy of the 70%. Much improved, but why? The easiest explanation is more training data. I quickly saw LeBron James classified as “Basketball”, Spanish posts as “Other”, and all those people ranting about ESPN’s coverage of the NBA Finals as “ESPN”. References to the Miami Heat or the Dallas Mavericks were classified as “Basketball”, and the few and far tweets regarding Hockey, were for the most part classified correctly.
What I did with the ESPN Tweets is easily replicable using DiscoverText. You may import not just ESPN tweets, but tweets from anyone. It does not have to be the “worldwide leader in sports”, ever have a hunch to scrape all of LeBron’s tweets, it is possible. Or, as you see in the post by my colleague Josh, the entire Arab Spring can be captured using DiscoverText.
A dataset of this size is great proof of the power of DiscoverText, and there is far more data which can be analyzed. When I began, I had no idea what my results would yield, for example, I had no idea the majority of ESPN tweets weren’t actually commenting on the sports, but on ESPN itself. This is a testimony to what can unlocked by using DiscoverText.