Immediately following President Obama’s speech on the evening of May 1 confirming the execution of Osama bin Laden, the DiscoverText team began collecting all twitter posts which contained the key words “Osama,” and “Bin Laden.” The bin Laden project collected more than 4.7 million tweets, 1500 at a time over the public API, thereby archiving a slice of the period in time Twitter described as the “highest sustained rate of Tweets ever.” However, this episode has not been without controversy. The good folks at Twitter reminded Texifter personnel not share the tweets.
This did leave the door wide open for us to describe what we have collected. While this might seem like a straightforward task, 4.7 million tweets is a perfect modern-day example of “information overload,” and determining how to mine the data is a challenge. A few weeks ago, we looked at the Bin Laden “re-tweet champion,” finding the individual Twitter accounts that had been the most active on this topic since May 1st.
After using DiscoverText to de-duplicate the massive archive, this still left 1 million unique posts. There are many different dimensions which could be analyzed, I even opened a discussion on LinkedIn, asking what people would like to see come for the data.
Using the tools DiscoverText’s offers, my first foray into the bin Laden data involved much parsing of the data, however, I settled on two terms for study, “Gas/Oil” and “Foreign Policy”. While there are numerous terms to search and analyze, for many Americans, these terms are quite relevant in this context. The results returned a slice of data for each term, which proved much more manageable than the 4.7 million posts. I chose to analyze these terms using a unique approach for each term. In the remainder of this blog post, I will discuss my methodology, and detail the findings when studying these terms.
When searching my de-duplicated bucket for “gas,” 2,525 pieces of data were returned, or .26% of the de-duplicated data. I formed a dataset, and from this my objective would be to find what percentage of people believed gasoline prices would rise or fall in the near future, following the death of bin Laden. I also added a “Complaint,” and “Other” category
in order to categorize errant tweets.
I manually coded 10% of the dataset, trained the classifier, and classified the dataset. After checking for accuracy, I found my newly established classifier had an accuracy of .78, an excellent number a new classifier with little training data.
As we now know, average national gas prices have fallen from a national average of $3.95 on May 1, to where they currently sit at $3.64, marking a 31 cent drop since the death of bin Laden. When reading the Classification Reports, 52% of people tweeting about “gas” believed the price would fall, while only 10% of tweeters believed gas prices would rise. These numbers are much in line with personal observations. While the fall of gas prices may or may not have anything to do with the death of bin Laden, 52% of tweeters correctly believed that gas prices would fall after the death of bin Laden. Finally, much to my surprise, the number of tweets regarding the price of oil drastically outnumbered that of my “Policy” dataset.
The “Policy” dataset was a small dataset, with only 499 data units to work with. My objective here was code for sentiment surrounding U.S. Foreign Policy. I used the codes of “Positive,” “Negative,” and “Constructive Criticism.”
52% of those tweeting about U.S. Foreign Policy used “Constructive Comments.” On the more specific side of sentiment, Negatives clearly outweighed the positives, with nearly 36% of comments about Foreign Policy on the negative side, leaving only 11% of the tweets positive. From these numbers, a few conclusions can be made. One, with only 499 tweets, a very small amount of people discussed policy. And two, of those who discussed policy, people were more interested in adding to the discussion, instead of declaring Obama’s foreign policy a positive or negative step for the country.
Scouring the bin Laden data has certainly not come to an end. In this first search, I pulled a couple of odd terms. There are endless opportunities when working with this data, DiscoverText has captured the highest volume period in Twitter history, meaning there is near endless information to expose. If there are any suggestions of what you would like to see out of the data, please contact any of the DiscoverText support specialists. Keep checking the Texifter Blog for further case studies done with the bin Laden data.