Rate Limits on Twitter

The evolution of the API opens the door for third-party developers to access
information on social media networks. In the best case, this provides a healthy, democratic flow of information. Yesterday, DiscoverText had “rate limits” imposed in terms of its access to Twitter data.  As written, the Twitter API allows unauthenticated calls of 150 per hour, per IP address.

Authorized calls (users logged on using their Twitter credentials, also known as OAuth) allow for up to 350 calls per hour, per person. In addition, the Twitter Search API has internal rate limiting mechanisms, but Twitter does not publish those specific limitations for fear of abuse.

Going over any of these limits results in the user being presented with “Error 420”, which simply means that the user is being rate limited. This hampers the ability to harvest twitter feeds within DiscoverText.  We have never had rate limit problems prior to this, but according to timestamps on articles posted on https://dev.twitter.com/, Twitter might have become more cognizant of those harvesting large amounts of data (not just us), and as a result, are cracking down on heavy users.

At Texifter, we fully respect the rules and regulations of the Twitter API, and in no way seek to disobey or bend these set rules in our flagship software product, DiscoverText. On August 18, 2011, the same day we learned of the 420 errors, we performed emergency maintenance to better cope with Twitter rate limitations. We also wanted to more gracefully handle rate limitation errors and to ensure we abide by Twitter Terms of Service. With that said, in order to continue our ability to harvest information from Twitter and perform our cutting-edge research, we are currently exploring easier and more reliable ways to harvest data.

The maintenance performed on DiscoverText stills allow 1500 items per fetch as determined by Twitter’s architecture on the public API. In addition, no extraneous error messages should result when DiscoverText is being rate limited. Some searches might be silently delayed for 5 minutes, however, these fetches will catch up as soon as they can.

In the near future, look for new developments for DiscoverText. We’ve got big plans for our social media API fetching that will greatly enhance our user’s ability to receive timely and actionable social media feeds. We don’t want to reveal too much right this moment, but we’re sure you’ll like what we have in store and in traditional Texifter style, we’ll plan a large announcement when the time is right.

Posted in general | Tagged , , , , , , , | 4 Comments

Presenting Actionable Insights

Everyday, users of social media miss out on potentially important pieces of information which they seek to find. This problem, which everyone encounters, illustrates a larger conundrum which confronts users of social media; information overload. For those who are not highly focused on a niche group, Twitter, Facebook, and other social media creates a scene of many words, ideas, and pictures. Instead of stimulating the brain with knowledge, it slightly confuses the senses.

To combat this problem, it is necessary to re-focus the common question that is posed. How does one “find actionable insights from their followers?” To solve this social media puzzle, knowledge is needed about key individuals with many followers which can be used to spread ideas, but in the end, it is the relevant content that matters most.

DiscoverText is engineered to scour Twitter and Facebook for key content and the users who produce it. For grassroots campaigns, this type of analysis is crucial. While Hootsuite and Twitalyzer allow for users to see the stats of previously identified important individuals, DiscoverText reverses the process, giving users the relevant content and the user who created it.

To illustrate this process, I will use some of the politically charged tweets which I have been harvesting for the past week amid debt crisis, market turmoil, and the beginning of the 2012 election campaign. This type of tweet harvest is perfect for grassroots political organizations, as it provides the perfect opportunity to view the content and the producers. The organizations can not only directly contact content producers, but they also see trends in the data, which they can take action on.

Having harvested over 58,000 Tea Party Tweets over a couple of days, DiscoverText
allows me to navigate this large amount of text data, find key trends and individuals, and plan a course of action. Immediately, the CloudExplorer can generate an adaptable word cloud to give users a better idea of the general trends in the data.

One might notice that “Biden” is used more frequently than “Palin,” a Tea Party juggernaut. Or, “Terrorist” is used more than 4,000 times. To get the most out of the Cloud Explorer, it is best to periodically re-generate the visualization, incorporating new items as more data is harvested. This allows users to observe and record trends in data.

While the CloudExplorer allows for users to view trending content, DiscoverText Advanced Filters open up data to a whole new level of interactivity. The top meta feature allows for the identification of the most vocal individuals, which allows users to not only find their relevant content, but key users. Amongst Tea Party tweets, that user, who tweeted about the Tea Party 77 times, happens to be “emilsoncosta,” which, after searching Twitter, he is a well-educated, politically interested individual, with a tightly concentrated network. This individual is of the right profile for grassroots organizations to contact with the hopes of better spreading their ideas to niche populations, or even gaining further insight from.

Finally, these Advanced Filter searches are not limited to the whole archive, but also can be performed on the subsets of data, such as the Cloud Tag searchers, and general searches of the archive. So, for instance, let’s say I wanted to see how was the most vocal using the term “terrorist” and “Tea Party.” I would simply just apply the same filter to the search, and I will find my most vocal user once again. The advanced filters organizational possibilities do not end there, filters can be applied which organize data by creation date and time, allowing for data to be organized chronologically.

With DiscoverText, it is possible to mitigate the worst effects of information overload. By looking at specific content and the specific source, DiscoverText allows user to eliminate clutter, and focus directly on key words, interest, and users. While it is impossible to filter all social media, it is nice to know that when I need specific information, and specific users, there is an outlet to make constantly flowing information manageable, sharpening the senses, and making the social media experience more productive.

Posted in general | Tagged , , , , , , , , | 3 Comments

ACUS Calls for ‘Reliable Comment Analysis Software’

In a recent series of recommendations, the Administrative Conference of the United States (ACUS), announced findings under the auspices of “Legal Considerations in e-Rulemaking,” from the Committee on Rulemaking. Having spent more than decade working on e-Rulemaking, I was curious to see what was at the top of their list. It was a relief to find that in the Final Recommendations, Item 1, Section A reads:

Consider whether, in light of their comment volume, they could save substantial time and effort by using reliable comment analysis software to organize and review public comments.

The ACUS report continues:

(1) While 5 U.S.C. § 553 requires agencies to consider all comments received, it does not require agencies to ensure that a person reads each one of multiple identical or nearly identical comments.

(2) Agencies should also work together and with the eRulemaking program management office (PMO), to share experiences and best practices with regard to the use of such software. [emphasis added]

At Texifter, we know quite a bit about best practices for sorting duplicate and near duplicate public comments. We have supported and trained Public Comment Analysis Toolkit (PCAT) and DiscoverText users at the USDA, NOAA, FCC, NLRB, SBA, USFWS, and Treasury departments. Our duplicate detection and near-duplicate clustering saves agencies from the expense of manually sorting non-substantive modified form letters. DiscoverText is now used in Europe by aviation regulators.

How did we get here? More than 300 agency officials attended workshops, focus groups and interviews over a 10-year period. Algorithms were developed and tested. Interfaces were designed, built, tested and re-built.  Agencies shared millions of public comments and guided us as we tailored a system to work with the bulk downloads from their email servers and the Federal Docket Management System, which gathers the nation’s public comments at Regulations.gov. If “reliable comment analysis software” is needed, Texifter’s flagship product DiscoverText has to be considered a guiding light for some of the key ACUS findings.

Posted in general | Tagged , , , , , , , , | Comments Off

DiscoverText Webinar

Learn to Use DiscoverText – Free Tutorial Webinar
Tuesday August 16 at 1:00 PM EST
Webinar Registration: https://www1.gotomeeting.com/register/750225288

This free, live Webinar introduces DiscoverText and key features used to ingest, filter, search & code text. We take your questions and demonstrate the newest tools, including a Do-It-Yourself (DIY) machine-learning classifier. You can create a classification scheme, train the system, and run the classifier in less than 20 minutes.

DiscoverText’s latest feature additions can be easily trained to perform customized mood, sentiment and topic classification. Any custom classification scheme or topic model can be created and implemented by the user. Once a classification scheme is created, you can then use advanced, threshold-sensitive filters to look at just the documents you want.

You can also generate interactive, custom, salient word clouds using the “Cloud Explorer” and drill into the most frequently occurring terms or use advanced search and filters to create “buckets” of text.

The system makes it possible to capture, share and crowd source text data analysis in novel ways. For example, you can collect text content off Facebook, Twitter & YouTube, as well as other social media or RSS feeds.

Dataset owners can assign their “peers” to coding tasks. It is simple to measure the reliability of two or more coder’s choices. A distinctive feature is the ability to adjudicate coder choices for training purposes or to report validity by code, coder or project.

So please join us Tuesday August 16 at 1:00 PM EST for an interactive Webinar. Find out why sorting thousands of items from social media, email and electronic document repositories is easier than ever.

Webinar Registration: https://www1.gotomeeting.com/register/750225288
Recoded Webinar from June, 2011: http://www.screencast.com/t/Z8ilwJSnxf

Posted in general | Tagged , , , , , , , , , | Comments Off

Klout Marketing by Spotify

In Q1 of this year, serious internet chatter surfaced about the possibility of popular Swedish online streaming music service Spotify launching in the U.S. Spotify is a tantalizing proposition for the American media consumer. While there are costs incurred, the service most resembles the glory days of Napster, when someone could find ANY song, ANY time. Spotify launched in the U.S. Thursday, July 14th, however, they did not open the floodgates for the masses. Much as Google+ did two weeks earlier, Spotify launched via “invites.” First access was granted to those with high Klout scores, and the hope was that there would be a “trickle down” affect.

I secured an invite early in the morning Thursday, and I’ll admit I was quickly hooked. Simultaneously, I began to use DiscoverText to import Twitter feeds to analyze chatter about this captivating new service. I certainly found an invite, but, how well did Spotify execute their plan? Was this attempt at social media marketing successful? In this blog post, I will discuss my goals in trying  to gauge how effective Spotify’s launch was, and how other services, such as Pandora, iTunes, and Klout, all closely related to the launch of Spotify, were discussed in terms of Spotify. I will also display how effective DiscoverText is when measuring the successes of social media marketing, and how very simple it is to search, identify, and contact vocal users.

I decided to only pull Twitter feeds for 2 days after the launch. I wanted to see the immediate indicators of Spotify’s success, not the lagging indicators. My Spotify-Twitter archive pulled 113,000 Tweets, which I de-duplicated, leaving me with approximately 100,000. Employing the DT Cloud Explorer, I immediately saw that the words “invitation,” “invites,” and “invite” were used heavily, with 16,000 tweets using these terms. I then the Advanced Filters to find the most vocal Twitter user. Not to my surprise, official Spotify accounts were the instigators, Tweeting about invites the most, however, Spotify did a phenomenal job getting many diverse users to invite people, with 7,200 users tweeting about invites. So, what person was the most vocal? A user named DevinDTA, from California, tweeted exactly 99 times about Spotify invites.

It is then possible to see exactly what this user tweeted, and, in doing so, I find the user at first heavily lobbied for an invite, and then, upon receiving one, returned the favor guiding users to invites. This user did exactly as Spotify wished, gaining a customer, and free marketing source. Using the information provided by DiscoverText, it would now be possible to contact DevinDTA, and other high users, or study their behavior specifically to figure out why these users became prolific Spotify promoters.

My second goal was to see how disruptive Spotify became to Pandora and iTunes, and how it aided the start-up Klout, which Spotify teamed with to launch its service. Again using the Cloud Explorer, I found “Pandora” and “iTunes” to be the topic of much discussion, with over 2,000 occurrences in the archive. As I parsed this collection, and created several buckets.

I found the majority of the posts to proclaim Spotify’s coming dominance over Pandora, such as one users statement, “Spotify will be a Pandora killer,” and “Spotify is my new Pandora.” While this represents a very small proportion of the tweets, a large social media backlash is something that Pandora would certainly want to be privy to. Using DiscoverText, it is possible to identify these users, contact them, and discover their motives in degrading the Pandora brand in favor of the Spotify service.

Finally, “Klout” was used 13,000 times in the archive. This experiment proved to be a great success for Klout, who’s website crashed due to the heavy traffic. This was certainly noted in posts by people complaining about their inability to access the Klout website to grab an invite. However, at this point, any publicity is positive for Klout as they try to gain followers and users. Using the Advanced Filers once again, I was able to find the most vocal, and the amount of times “invite” was used with “Klout,” which was nearly every time. You can draw the conclusion the vast majority that “Klout” use was in terms of invites, and not actually the service itself. Completing this exercise it is easy to see that this partnership was smart for Klout, as they no doubt aided Spotify and maybe got a little clout themselves.

It is increasingly likely that brands will continue to use social media to expand their reach. As seen with Spotify and Google+, it is now conceivable to launch a site only to those well connected, and let it proliferate throughout the web. Using this marketing style can be extremely fruitful for a brand, however, using DiscoverText will allow brands to more effectively manage their social media strategy by allow them to interact with their marketing strategies, rendering static social media analysis a thing of the past.

 

Posted in general | Tagged , , , , , , , , , , , , , | Comments Off

Twitter Opinion on Debt Ceiling

LATEST TWITTER OPINION POLLS REGARDING THE DEBT CEILING:

(CONSISTENT WITH ZOGBY POLL NUMBERS)


Here’s how we got there….

Since the debt ceiling fiasco began, nation-wide outrage toward Washington has been at an all time high. Naturally, here at Texifter, we decided to put our software to work and to utilize DiscoverText to collect politically charged Tweets at around the time of the senate vote.  And while we know where most politicians stand on the issue, we decided to use DiscoverText sentiment classification to better understand where the majority of Americans stand.

We collected all tweets with the key words, “Democrat,” “Republican,” “Tea Party,” “Obama,” and most importantly, “Debt Ceiling.” “Tea Party” proved to be the most mentioned of the three parties, with over 11,000 references at the time of the vote, while “Obama” trumped all parties with over 38,000 tweets. However, the term of the hour, “Debt Ceiling,” which many did not even know existed until last week, registered an impressive 15,000 tweets.

To gauge the sentiment of the 15,000 “Debt Ceiling” tweets, we used a very straight-forward coding scheme. This included the codes “Approval,” “Disapproval,” and “Other.” Using the Cloud Explorer, we were able to view the most popular terms used in conjunction with “Debt Ceiling.”  While we expected “Tea Party” and “Obama,” we did not expect to see “Giffords” and “Gabrielle.” The returning representative seemed to have garnered celebrity status on Twitter, as she was mentioned in over 1,500 tweets, in addition, when coding, I found her to be the main subject of many tweets, with the debt ceiling being secondary. Interestingly, these tweets seemed to have an overwhelming positive sentiment, while other member of congress, and President Obama seemed to be almost always associated with poor sentiment.

Our classification report revealed that 40% or 6,018 tweets expressed disapproval with the debt ceiling, while 31% (or 4,650) expressed approved.


We also utilized an “Other” category, in which tweets neither approved nor disapproved, but simply commented on the matter; this characterized 28% of the 15,000 classified tweets.

Therefore, of those 10,668 tweets that expressed a specific opinion on this contentious issue, we concluded that 44% approved while 54% disapproved of raising the debt ceiling….(See: pie chart above)

Now let’s compare our numbers with Zogby’s…. When asked, “Do you agree or disagree that the U.S. Congress should raise the debt ceiling?” 42% Agreed, 50% disagreed, 8% were unsure. (with a margin of error of +/-2.1%). Take that Zogby!

In the coming weeks, we will continue to harvest tweets, which are pertinent to the political and cultural clashes that are taking place in America. As new metrics are calculated, we will continue to report them.  If you have any questions or requests for specific metrics, do not hesitate to contact us.

 

Posted in general | Tagged , , , , , , , , , , , | 7 Comments

Hurtigruten: A Norwegian Social Media Phenomenon

Here at Texifter, we are regularly impressed by the many ways in which our increasing number of end-users discover new ways to apply DiscoverText’s capabilities to their own research. With each new success story, we are – once again – reminded of the theories of MIT professor Eric von Hippel, who argues that end-users are often responsible for substantial innovations. For that reason, we will occasionally take time to highlight some of those innovative DiscoverText users on our blog, in hopes that you – too – will be so inspired. If you have a DiscoverText success story of your own that you’d like to share, feel free to write-up your own post or e-mail me at josh@discovertext.com.

On June 16 the “Nordnorge,” a Norwegian Coastal Express ship (or “Hurtigruten”), set sail from Bergen to Kirkenes for a historic journey around the Norwegian coastline. For its 134-hour journey, millions of Norwegians (as well as viewers from around the globe) tuned-in to ‘channel NRK2’ to watch a 24-hour feed from the ship, which sailed by stunning vistas, extraordinary wildlife, and mesmerizing landscapes. A Norwegian friend of mine told me that it invoked a sense of wonder and near-hypnosis.

Throughout Norway, “Hurtigruten – minutt for minutt” was dubbed a new media phenomenon, in which social media supposedly played a significant role in its vast popularity. But one Norwegian blogger named Jacob Christian Prebensen (with the technology and new media blog, NRKbeta) was particularly skeptical about the way in which Hurtigruten’s popularity supposedly spread. He suspected that while over half of Norway utilizes social media, only a fraction of Hurtigruten’s 3 million viewers were actually inspired by social media. Therefore, he used DiscoverText to tap into the “Hurtigruten – minutt for minutt” Facebook fan-page, which had been “liked” over 60,000 times. In spite of this number, the fan page did not shed any light upon how much communication had actually occurred on the fan-page itself; and so Jacob ingested every comment on the fan-page into DiscoverText, which counted approximately 10,000 comments.

Next, Jacob used a website entitled hashtracking.com to calculate the number of tweets that included the hashtag #hurtigruten. Also on hashtracking.com, he calculated the percentage of those tweets, which were neither messages nor retweets, but rather tweets that were specifically unique. While little did Jacob know that he could have easily accomplished this task in DiscoverText using the de-duplication and clustering features, he ultimately concluded that of the 30,000+ tweets that mentioned #hurtigruten, they originated from only about 6,000 users.

Jacob’s twitter calculation, taken together with his DiscoverText Facebook calculation, ultimately helped to disprove the popular Norwegian misconception that the popularity of “Hurtigruten – minutt for minutt” was solely a social media phenomenon, as only a fraction of the program’s Norwegian viewership actually utilized social media to spread the word.

(Special thanks to Mohammed Awed for his kind help in translation)

Posted in general | Tagged , , , , , , , | Comments Off

Updates to the DiscoverText Terms of Service

Effective July 25th, 2011, Texifter has modified the Terms of Service governing its DiscoverText product and the use of Texifter’s services and sites in general.

The primary changes of importance include:

  • Paragraph 1.1: we’ve removed the phrase “and excluding any services provided to you by Texifter, LLC. under a separate written agreement.” If there is a separate written agreement between you and Texifter, the amended terms will be spelled out in the separate written agreement.
  • Paragraph 1.7: this paragraph has been deleted – it dealt with additional terms as previous laid out in 1.1 that are no longer applicable.
  • Paragraph 9.1: regarding information disclosure, we’ve added the clause at the end “without Texifter’s prior written consent except as required by applicable federal records or freedom of information law.”
  • Added paragraph 15.3: regarding Texifter’s legal obligations when contracting with a Federal Agency.
  • Paragraph 19.1: Added pledge that Texifter will notify its users via email at least three days in advance of incoming changes to the Terms of Service.

In addition, the following Terms of Service changes are applicable to any agreements with Federal Agencies as defined in paragraph 1.5 of the Terms:

  • Paragraph 20.8: new text regarding indemnification and damages.
  • Added paragraph 20.9: use of Federal Agency logos and other marks.
  • Added paragraph 20.10: assignation of rights to third parties.
  • Added paragraph 20.11: System and data security obligations.
  • Added paragraph 20.12: Federal records obligations.

As always, the Texifter Terms of Service can always be found at http://texifter.com/home/terms with a mirrored copy on DiscoverText at http://discovertext.com/terms.aspx.

We want to thank the various FCC personnel, including those at the OGC, for helping us to craft a Terms of Service that should work for any US federal agency.

Posted in general | Tagged , , | Comments Off

Bin Laden, Oil & Foreign Policy

Immediately following President Obama’s speech on the evening of May 1 confirming the execution of Osama bin Laden, the DiscoverText team began collecting all twitter posts which contained the key words “Osama,” and “Bin Laden.”  The bin Laden project collected more than 4.7 million tweets, 1500 at a time over the public API, thereby archiving a slice of the period in time Twitter described as the “highest sustained rate of Tweets ever.” However, this episode has not been without controversy. The good folks at Twitter reminded Texifter personnel not share the tweets.

This did  leave the door wide open for us to describe what we have collected. While this might seem like a straightforward task, 4.7 million tweets is a perfect modern-day example of “information overload,” and determining how to mine the data is a challenge. A few weeks ago, we looked at the Bin Laden “re-tweet champion,” finding the individual Twitter accounts that had been the most active on this topic since May 1st.

After using DiscoverText to de-duplicate the massive archive, this still left 1 million unique posts. There are many different dimensions which could be analyzed, I even opened a discussion on LinkedIn, asking what people would like to see come for the data.

Using the tools DiscoverText’s offers, my first foray into the bin Laden data involved much parsing of the data, however, I settled on two terms for study, “Gas/Oil” and “Foreign Policy”. While there are numerous terms to search and analyze, for many Americans, these terms are quite relevant in this context. The results returned a slice of data for each term, which proved much more manageable than the 4.7 million posts. I chose to analyze these terms using a unique approach for each term. In the remainder of this blog post, I will discuss my methodology, and detail the findings when studying these terms.

“Gas/Oil”

When searching my de-duplicated bucket for “gas,” 2,525 pieces of data were returned, or .26% of the de-duplicated data. I formed a dataset, and from this my objective would be to find what percentage of people believed gasoline prices would rise or fall in the near future, following the death of bin Laden. I also added a “Complaint,” and “Other” category
in order to categorize errant tweets.

I manually coded 10% of the dataset, trained the classifier, and classified the dataset. After checking for accuracy, I found my  newly established classifier had an accuracy of .78, an excellent number a new classifier with little training data.

As we now know, average national gas prices have fallen from a national average of $3.95 on May 1, to where they currently sit at $3.64, marking a 31 cent drop since the death of bin Laden. When reading the Classification Reports, 52% of people tweeting about “gas” believed the price would fall, while only 10% of tweeters believed gas prices would rise. These numbers are much in line with personal observations. While the fall of gas prices may or may not have anything to do with the death of bin Laden, 52% of tweeters correctly believed that gas prices would fall after the death of bin Laden. Finally, much to my surprise, the number of tweets regarding the price of oil drastically outnumbered that of my “Policy” dataset.

Foreign Policy

The “Policy” dataset was a small dataset, with only 499 data units to work with. My objective here was code for sentiment surrounding U.S. Foreign Policy. I used the codes of “Positive,” “Negative,” and “Constructive Criticism.”


I used the same methodology as I used with the “gas” study. I created a classifier which returned 75% accuracy, and received results which mirrored my observations.


52% of those tweeting about U.S. Foreign Policy used “Constructive Comments.” On the more specific side of sentiment, Negatives clearly outweighed the positives, with nearly 36% of comments about Foreign Policy on the negative side, leaving only 11% of the tweets positive. From these numbers, a few conclusions can be made. One, with only 499 tweets, a very small amount of people discussed policy. And two, of those who discussed policy, people were more interested in adding to the discussion, instead of declaring Obama’s foreign policy a positive or negative step for the country.

Scouring the bin Laden data has certainly not come to an end. In this first search, I pulled a couple of odd terms. There are endless opportunities when working with this data, DiscoverText has captured the highest volume period in Twitter history, meaning there is near endless information to expose. If there are any suggestions of what you would like to see out of the data, please contact any of the DiscoverText support specialists. Keep checking the Texifter Blog for further case studies done with the bin Laden data.

Posted in general | Tagged , , , , , , , , , , , | Comments Off

FB Graph API Revisited

Following the blog post about oddities in the Facebook importing that Stu was experiencing while visiting the DMI summer school, I put on my detective hat and went looking for possible reasons why things that should be available from the Facebook Graph API were not being collected by DiscoverText. This proved to be a somewhat complicated task since a lot of the Facebook realm is less than fully documented. As software developers, we share that pain point.

So, perhaps it is permissions – My first thought was that maybe it was a permissions issue, with variation between different Facebook accounts making a huge difference in what gets imported. I have two different DiscoverText accounts attached to two separate Facebook accounts just for testing, so I set both of them to import a set of test archives. Not knowing exactly where to start, I did a public post import of the search terms “Casey Anthony Trial” and then grabbed the public page for the “Cooking Channel”, a friend’s personal wall (to ensure that the test account couldn’t import it), and Mark Zuckerberg’s public page.

Importing all public comments on the Casey Anthony Trial

The Notifications page in DiscoverText contains information that can be a little cryptic. When your import first finishes, you will get a Notification that starts with the tag [Ingestion] or [Scheduled Ingestion]. This line gives you a report about what was or wasn’t added to your archive of the things DT tried to import.

Looking at the three sets of items I tried to bring into DiscoverText recently, you will see that the number of items written is sometimes less than the items found. This occurs in Facebook imports because DiscoverText only imports text and not all Facebook posts contain text. Sometimes people just post pictures, video, or links without any text to explain it. These posts show up in DT as empty items, so they are skipped. For Scheduled Ingestions (ones that have been put on a repeating schedule for DT to automatically gather more from a source), you also have the option of only gathering new items which makes the user’s job easier by not writing duplicate entries.

Except for the expected fail using the account that has no “friends” while trying to scrape my friend’s personal wall, I didn’t see any of the problems Stu reported when I ran my initial tests. However, an undocumented item of the DiscoverText architecture came to light. After running several tests, it came to my attention that the 30-day trial account was importing many fewer items than my main Enterprise account in specific situations. For the Casey Anthony Trial search, which was searching across all public posts, both accounts had very similar returns. When I imported the Cooking Channel page (Archive: Graph Test Page above) , the trial account imported a tenth of the files of my enterprise account.

Further investigation revealed that a trial account will only retrieve either a set number of items or the last month of posts and comments, whichever is larger. I don’t know the exact number and I haven’t heard back from the programmer, but I think it is around 500…or three months back in history.

 

However, this still isn’t one of the problems Stu and his class were seeing. I asked what they were searching on and he answered “Aruba”. If you are going to search on a topic, might as well be a tropical island. To be more specific, they were trying to import information from Facebook Groups having to do with Aruba. For anyone who hasn’t played with FB enough, Facebook has several types of pages, including People, Pages, and Groups.

“People” is your everyday user’s personal page. “Pages” are usually devoted to businesses or public figures and act as forums for the creators to put information out there as well as often allowing feedback from the public. Groups allow users to create discussion areas for people with common interests to gather and talk, share, and be social about that interest.

FB Archiving Off Old Groups

I found that Facebook changed its Groups format and has been prepping to archive all the old groups. This may be part of the problem, since FB makes little difference in it’s Graph API between currently active Groups and ones that have been closed for archiving. This means there are a bunch of possible groups in the import stream that will return nothing of value. This makes it harder to get good data quickly. Of the top 3 Groups that return after searching for Aruba, the first two are being archived and return less than 5 posts between the two of them. The third group, on the other hand, finally got us somewhere.

Top 3 Groups found Searching 'Aruba'

The group Aruba! has almost 2,000 members and is an Open Group according to Facebook. It also returned a notification of “1 item was written out of 1139 items found”. This was what Stu was running into. Without getting into the messy business about Access Tokens, lets just say that even though this group is “Open” and can be viewed by anyone, it isn’t completely open and requires permission to join the group.

It seems that the Graph API permissions we are using to get all of the other pages and groups don’t work for this specific setup for a group. An update to DiscoverText in the near future should clear things up.

Posted in general | Tagged , , , , , , , | Comments Off

Capturing Social Media Dissent

Immediately after the judge read the not guilty verdict in the high profile Casey Anthony criminal case, total internet traffic to major news outlets doubled, and most importantly, outsiders took to social media outlets to display their emotions and opinions. When this occurred, I used Discover Text to immediately begin scraping Facebook and Twitter feeds to archive public responses to what may be the most shocking verdict since O.J. Simpson.

My goal in importing the feeds was to capture the data, code for sentiment, train a classifier, classify the data, and formulate the percentage of people using social media to agree, disagree, or simply joke about the verdict. This situation, much like the death of Osama Bin Laden, posed the perfect situation for displaying the power of DiscoverText over unstructured social media data.

I set the feed scheduler to import public Facebook and Twitter feeds containing the keywords “Anthony” and “Legal System.” After my first feed ingestion, I had nearly 50,000 posts which I could analyze. In my next hour, I had doubled that. This feed will update every hour for the next 4 days. Over those 4 days, total content will consistently grow. Similar to my experiment using data collected from ESPN tweets, in this blog post I will discuss the amount of the data which I harvested in the first day, the diverse content of that data, the sentiment coding process, and the results I found after coding and classifying the data.

To have a better understanding of the general content, using 216,000 Anthony Twitter posts I had imported, I first generated a Tag Cloud using the newly improved Cloud Explorer feature, which displays the most used terms in an archive.

When studying the top 100 terms used, the high volume usage of “Casey” and “Anthony” certainly was not shocking. However, giving us more of an idea of where people’s minds are focused, over 10,000 people made a connection with the OJ Simpson trial. Nearly 4,000 people referenced Kim Kardashian, who had expressed a dissenting opinion of the verdict, ironically, her father defended “OJ.” Finally, adding to the many pop-culture references people thought to compare Ms. Anthony with Showtime serial killer Dexter Morgan, who was mentioned in over 3,000 posts.

Analyzing the Tag Cloud revealed many of the posts to be jokes regarding the verdict. These posts could be interpreted as disagreement with the verdict; however, in the process of coding the data, I used 3 codes, “Agree,” “Disagree,” “Jokes,” and “Other.” Looking at the Cloud Tag, one can hypothesize that the Twitter community is highly against this verdict, and they are willing to joke about, something Jay Leno cant even get away with.

To train the classifier, I coded 500 tweets, and then classified a sample of 1000. I found accuracy by validating 200 randomly selected tweets. I found the accuracy of the classifier to be 75%. An accuracy of this rate is superb considering the wide array of Twitter data that had been imported.

To vizualize breakdown by percentage, I used the Classifier Report, which gave me results very similar to what I suspected when coding. 46% of tweets disagreed with the verdict, 27% joked about the verdict, 21% tweeted something unrelated to about the verdict, and 4% agreed with the verdict.

The amount of data collected, and the passionate responses illustrate that a small, but vocal segment of the population has been actively tweeting about this case. Using the tweets harvested by DiscoverText, and the sample which I experimented with, it can be concluded the majority certainly do not agree with the jury verdict.

Posted in general | Tagged , , , , | Comments Off

Facebook Graph Mysteries

During my recent visit to the Digital Methods Initiative (DMI) summer school, hosted by my good friend Richard Rogers, I had the pleasure of spending two days teaching and working with 35 exceptionally bright students who were new to the tools and techniques that are part of DiscoverText.

They were an excellent group, highly motivated and digitally fluent. As part of the class, students put forward project ideas and formed small teams to hack out a solution to some research problem. Many of these ideas involved scraping content off Facebook via the Graph API. I watched eagerly as teams of students furiously tested out many of the “shiny new toy” functionalities they found in DiscoverText. Very quickly, they helped to articulate some of the key mysteries of the permissions managed via the social Graph.

Some data collection trends were immediately raised. For example:

  • Why is there is a numerical discrepancy between what appears on the actual public Facebook pages and groups and what is delivered via the Graph?
  • By what combination of criteria do different users get slightly (or vastly) different results for the same query?
  • Why is there often a substantial gap between the number of items the API delivers and the number of items a user of DiscoverText actually gets in the downloaded archive?

As the experiments at the DMI continue, and users of DiscoverText all over the world start asking some of the same questions, we hope to better document here on the blog the precise way in which your credentials, and the settings of diverse Facebook users, impact the data collection made possible using the DiscoverText-Facebook API.

In the meantime, I am home, but the DMI students are still pounding away on the Graph and DiscoverText raising excellent questions and generating new feature ideas we will surely use.

Posted in research | Tagged , , , , , , | 1 Comment

Making Sense of ESPN Tweets

Should you ever want to visualize the definition of “unstructured data”, there is no need to look beyond the beautiful chaos that is 503,000 ESPN tweets all harvested using DiscoverText. It would be an understatement to call an archive of this nature diverse. Posts appear in Spanish, Chinese, Korean, and Turkish. There are references to the obvious like LeBron James and his ego, and to the obscure, like European soccer club Olympique Marseille losing their best striker. In addition, there are numerous re-tweets, and the occasional post which is simply incomprehensible.

To even begin to make sense of an archive this massive seems daunting, as viewing each individual tweet is nearly impossible. However, using DiscoverText to code, train, and classify the data; it is possible to develop a better understanding of the nature of the tweets. Have a hunch that most of the tweets are about LeBron James? By coding training, and classifying the data, you may make your hypothesis about your data, and then see how accurate that hypothesis was.

In my first post, I will detail the progress I made when coding and classifying the tweets, and how the accuracy improved the more I coded data. By continuing to code your data you can improve the accuracy of your classifier over time, and have a better understanding of your data by studying the “classifier report”.

I began by creating my classifier, by using a manageable amount of codes, such as, Baseball, Basketball, American Football, Soccer, Hockey, ESPN, and Other.

To see the gradual improvement of the classifier, I began by coding a modest 200 data units. While this is only a fraction of the entire dataset, this is far more manageable than 503,000 tweets that I started with. When finished, I trained the classifier using the data I had just coded. I then decided to just classify 100 data units, and then check for accuracy.

After checking the accuracy, the classifier had a reliability of 60%. While this is certainly far from perfect, this is still impressive classifier accuracy for just an hour’s work.

My next step was to code far more data units, and then put a little more pressure on the classifier. By coding 500 data units, and classifying 10,000 data units, I could discover more about the nature of my tweets. To do this, after coding, training, and classifying, I checked the “Classification Report”, which gives a breakdown of the tweets.

What I found on the breakdown, was quite similar to what I saw when coding. From my observation, many of the tweets were specifically about ESPN’s coverage of the NBA Finals, all of which I coded ESPN. There were numerous foreign language posts, none of which, with the exception of the lone post in French, I could read, therefore I coded them as “Other”. Basketball, because of being around the time of the Finals, also took up a large percentage of the tweets. These percentages all made sense; however, I still had to check the accuracy of the classifier.

This time, I would still check 100 of the 10,000 which I classified, however, they would not be consecutive, instead, I used a simple random sample of the classified tweets. What I found, was much to my liking, making me more confident with the classification report. The classifier had an accuracy of the 70%. Much improved, but why? The easiest explanation is more training data. I quickly saw LeBron James classified as “Basketball”, Spanish posts as “Other”, and all those people ranting about ESPN’s coverage of the NBA Finals as “ESPN”. References to the Miami Heat or the Dallas Mavericks were classified as “Basketball”, and the few and far tweets regarding Hockey, were for the most part classified correctly.

What I did with the ESPN Tweets is easily replicable using DiscoverText. You may import not just ESPN tweets, but tweets from anyone. It does not have to be the “worldwide leader in sports”, ever have a hunch to scrape all of LeBron’s tweets, it is possible. Or, as you see in the post by my colleague Josh, the entire Arab Spring can be captured using DiscoverText.

A dataset of this size is great proof of the power of DiscoverText, and there is far more data which can be analyzed. When I began, I had no idea what my results would yield, for example, I had no idea the majority of ESPN tweets weren’t actually commenting on the sports, but on ESPN itself. This is a testimony to what can unlocked by using DiscoverText.

Posted in general | Tagged , , , , , , , | Comments Off

Top #GameofThrones Tweeters

Below is another in our series of training videos. In this episode, we introduce you to a great new feature for peaking into the list of top values in a particular meta data field. In the example here, I show how to find the top Tweeters in a collection of more than 150,000 tweets about the HBO show Game of Thrones. Hats off  to “The Rabbit01” “Fan of ThronesApp” & “WiCnet” for their prodigious tweeting.

Posted in general | Tagged , , , , , , , | 2 Comments

Monitor Middle East Protests

Dear faithful users and intrigued future users of DiscoverText,

My name’s Josh and I’m one of 3 user support specialists at Texifter LLC. For my first Texifter blog entry, I’m going to demonstrate how I’ve been using DiscoverText to capture minute-to-minute protest tweets in the Middle East, ever since the beginning of the Arab Spring. Then I’m going to show off some of the awesome functions that DiscoverText lets you perform with that data.

To get started go to your dashboard and click start a new project. Then name your project, and you’re ready to go. (see below)

Now you’ve got a completely empty project that you want to fill with lots of data. In DiscoverText you can import your own data from your computer, or you can pull it offline from Facebook, Twitter, YouTube, and other places. For this blog entry, I’ll be sticking with Twitter, only because that’s where so much exciting stuff is happening.

So to import a twitter feed, click  “Import data” under Project Options (see below).

Next, you’ll see a whole bunch ways to bring data into DiscoverText. Click the Twitter icon.

Next you’ll need to name the archive where your tweets are going be stored. Below, you can see that that I named this archive after what I’m initially interested in getting some information about: The City of Hama.

Now, type in your twitter search term, click the Twitter sign in button, and then click next.

The last step before you import tweets is something called the “Live Feed Scheduler. This feature allows you to continuously or periodically pull tweets into your account, even when you are not online. If you’d like to just get as many tweets as you can (there is a maximum of 1500 for each import), as fast as you can, just leave almost everything as is, but click the drop-down menu where it says 1 hour and change it to 5 minutes.  Never fear, you can always run multiple feeds, if need be. (And you might want to if each import is producing over 1500 tweets per 5 minutes… using this technique, some users have searched MILLIONS of tweets at one time!!! Are you up for the task!?!)

At last, your tweets are coming soon. Grab a quick cup of coffee and your data will be ready in a couple minutes. You can also follow the progress of the data import beneath the notifications link if you’re feeling impatient. (see below)

At last, the cool part has arrived and you should now see the name of your archive in the navigation tree on the left side.

Usually the first thing I’ll do before I start playing with a twitter archive is I’ll have a quick glance at the the comments. So I click the name of the new archive in the navigation tree and then click the listing options button (it looks like this: ) at the center-top.  Select 100 items per page and click save.

Now you can browse the tweets easily..

Now let’s say you want to start organizing the tweets according to content. In the example above, we can see every mention of the City of Hama, but now you would like to see every mention of – say – the army, the police, and the secret service within Hama. DiscoverText makes it super easy to do this. At the top of your document list, type your first search term and press enter.

If you’d like to keep your new search results, select the checkboxes of those tweets that you want to sort and then click add to bucket: “Selected.” (Buckets are your saved searches) Create a bucket name and perform the remainder of your searches. (see below)

Here are the results on a search for “Secret Service”….

and the search results for the Army…

Just like that, you can analyze what tweets are saying about the government’s behavior in one particular city. (For example, It took me just a few minutes to figure out that police officers in Hama have (supposedly) been walking around in civilian clothing!)

Now, let’s expand the search! Instead of just pulling in tweets about Hama, let’s also pull in tweets about Aleppo, Homs, Damascus, and Deir el-Zur. All you have to do is right click the name of the project, click import data, and repeat the process above.

Now, I let DiscoverText import several rounds of tweets, and as you can see from the picture below, I’m now looking at 5 different archives and over 19,000 tweets! (To learn how to remove duplicate tweets, click here)

To search all of those archives at once, click the name of the project (in large letters) at the top of the navigation tree.

Now if we search for, “Police,” we can monitor police behavior in all five cities at once.

We can see 72 mentions of police…

and 153 mentions of the Army…

and 44 mentions of the secret service.

Clearly, there is a lot of chatter on twitter about the Syrian army in those 5 cities. Now let’s say you want to organize, categorize, and/or code what is being said about the Army.

The first thing you’ll want to do is create a new bucket, just like you did before.

Next, right click that bucket in the navigation tree and click “create dataset.”

On the next page, click create dataset (or click here to learn more about different kinds of datasets you can design). Next, pick the categories and coding scheme that you’d like to use. As you can see below, I used three different codes, but you can use as few or as many as you want.  When you’re all set, click finished.

The next thing t0 do is decide who you want to code the dataset you just created. You can assign it to yourself or any of your peers in DiscoverText. (For more on Peers, click here) When you’re finished assigning coders, click “set chosen coders.”

To start coding right away, click “Code Dataset.” (see above).

This is what coding might look like for you:

When you’re done coding, click the stop icon.

To get a full coding report, all you have to do is click the “Analytics / Export” button on the left, and click “reports”.

Click “Dataset Summary Report,” customize your report, and a minute later you will be looking at a full report of everything that has been coded, with great visuals that will look something like this:

 

That’s about all for now. This has been just a glimpse at some of the ways I’ve been playing with DiscoverText. If you like what you’ve seen, sign-up now and “like” us on Facebook and LinkedIn. And, of course, if you have any questions, feel free to e-mail me anytime at josh@discovertext.com. I’m always happy to help!

Enjoy!

Posted in general | Tagged , , , , , , , , , | 3 Comments