Texifter, LLC. Blog | Search, Filter, Cluster, Code & Classify Text | Page 8

The Apple Shake-Up: Part I

Posted on October 13, 2011 by Josh Sowalsky

For Apple, last week will be remembered as one of its toughest and finest on record, and not just for its tumultuous stock prices, but for the staggering range of emotions that the company evoked throughout the week.

The week began with news that frustrated the apple community: After months of tech-speculation, an iPhone 5 would not be hitting the streets as expected. What many concluded was either a disinformation campaign or just bad P.R., the iPhone 4S launch was supposedly a disappointment for many, but how can we be so sure? Bloggers and journalists were clearly saddened, but for all we know, their unhappiness stems from the thousands of hits their blogs won’t receive and the millions of copies their papers won’t sell… If journalists and bloggers want to gauge sentiment, they should probably start with Social Media analysis like we do.

Using the new GNIP Power Track in DiscoverText, we tracked and recorded every use of the word “iPhone” on twitter from about 3:00am on October 3rd to about 10:30pm on October 6th. All told, we collected about 3.1 million tweets. Prior to mid-morning on the 4th, we were ingesting about 4 iPhone tweets per second, but by late afternoon we were collecting about 130 tweets per second. By the morning of the 6th, social media chatter had settled-down a bit and leveled out to about 7 tweets per second. Our question was: what sentiment laid beneath this noise of jokes, links, advertisements, announcements, and irrelevance?….Only one way to find out: DiscoverText Classification.

We split our data into two subsets, the first contained only tweets prior to the launch that mentioned “iPhone,” whereas the second contained tweets a few hours after the launch. (This is because for a few hours after the launch, some people still didn’t know that iPhone 5 was not coming.) By quickly programming a sentiment classifier, we concluded that prior to the Phone 4S launch, 65% of the iPhone comments were positive, optimistic, or enthusiastic, and just a few hours later only 26% of the iPhone comments were positive and 40% of them were negative.

Bloggers and Journalists were – indeed – correct, and these numbers demonstrate that. Check back soon for a sentiment analysis on the passing of Steve Jobs.

Tweet This Post

Posted in general | Tagged Apple, Data Mining, DiscoverText, iOS5, iPhone, IT, Sentiment Analysis, Texifter, Text Analytics | Comments Off

The OBL Script

Posted on October 13, 2011 by Stuart Shulman

#ir12 – IGNITE TALK SCRIPT

Slide #1 – Hi I’m Stu and I’m going to talk today about the power of crowds for performing certain tasks. I want to thank Josh Sowalsky for managing this project and drafting the slides.

Slide #2 – When you study politics, you are really studying how people organize each other, like how the early Romans organized their form of democracy or how citizens spam their government.

Slide #3 – You might study why a society (which is really just a big crowd) accepts certain leaders’ positions even when they say and do things that huge crowds don’t like and do not believe.

Slide #4 – Over the last year, we’ve seen crowds come together to accomplish incredible things that nobody, especially political scientists, saw coming. This is happening all over the world.

Slide #5 – We’ve heard hype about “the power of Twitter & Facebook to bring about political change.” We get it. The Internet is changing politics, but what else can crowds do together online?

Slide #6 – A few years ago, a fellow named Jeff Howe coined the term “Crowdsourcing” arguing the Internet will empower groups to gather online to get things done in ways previously unimaginable.

Slide #7 – Wikipedia is a remarkably social source of information, potentially edited by anyone. It catalogues human knowledge and, with the exception of prankster gags, the project is working.

Slide #8 – Wikipedia is a fun project that most people don’t contribute to. What about not so fun tasks? There’s lots of interesting research people want to do, but it requires many eyeballs to sift data.

Slide #9 – It used to be expensive to hire encyclopedia writers. Our Ignite research question is: Can we harness the power of a crowd for doing large-scale text research in academia?

Slide #10 – Social science research can cost thousands or hundreds of thousands of dollars. Thank-you NSF! You train & pay students, cross your fingers, and hope the results are valid.

Slide #11 – We decided to take the OBL tweets to an online crowd for an experiment. Over 4 million tweets that mentioned Osama or bin Laden were collected right after he was assassinated.

Slide #12 – A group out of Harvard did their own experiment using the Crimson Hexagon. With unsupervised classification methods, they found that 27% of the bin Laden tweets contained humor.

Slide #13 – We offered a $25 Amazon.com gift card to volunteer comedic souls willing to classify 750 tweets. Whoever found the funniest tweet was promised an additional $100 gift card.

Slide #14 – About 60 people volunteered and after two days, more than 22,000 tweets had been coded by 26 self-proclaimed funny coders. Some even asserted comedian status. Total project cost: $750.

Slide #15 – Our crowd found only 23% of the bin Laden tweets were attempts at humor. Most were not funny. Only 3% of the tweets were coded as either hilarious or very funny.

Slide #16 – Measuring humor is hardly scientific. Just as a crowd on Twitter may re-tweet and mutate semi-humorous memes, a crowd analyzing Twitter may find the same items offensive or lame.

Slide #17 – The audience today is going to select the winner. I’ll show you six funny tweets. To vote, tweet the number of the best joke using the hashtag OBL colon hyphen close parenthesis.

Slide #18 – If you like “Brad Pitt leads a group of Jewish soldiers,” tweet #OBLemoticon1. If you like “go through airport security for the rest of his life” tweet #OBLemoticon2, if you like “Only a democrat” tweet #OBLemoticon3.

Slide #19 – If you like “Well done gay people” tweet #OBLemoticon4. If you like “Reaganomics works” tweet #OBLemoticon5 and if you like “blood and rage cocktail” tweet #OBLemoticon6.

Slide #20 – There is very little funny about the legacy of Osama bin Laden. Humor has always been a part of how humanity deals with brutality. This experiment was not about bin Laden or humor. It is about the Internet and the future of research methods. Thanks!

Tweet This Post

Posted in general | Tagged #ir12, #OBL, crowdsourcing, Data Mining, DiscoverText, Texifter | Comments Off

The OBL:-) Voting Options

Posted on October 11, 2011 by Stuart Shulman

Note: Voting will remain open until the end of #IR12. Thanks for the idea of posting the ballot Sarah! Tweet #OBL:-) plus the number of the Tweet you think is funniest.

Tweet This Post

Posted in general | Tagged #ir12, #OBL, Bin Laden, DiscoverText, Funniest Tweets, Texifter | 1 Comment

Crowdsourcing: the #OBL:-) Project

Posted on October 10, 2011 by Stuart Shulman

Tomorrow morning I will present this 5-minute talk at #IR12. If you want the sneak peek, or if you cannot make it to Seattle for the talk, here is your chance to see the results and, more importantly, be a part of the voting for the funniest bin Laden post-mortem Tweet.

Tweet This Post

Posted in DiscoverText, research | Tagged #OBL, Crowdsource, crowdsourcing, humor, methods, Research, tools | 2 Comments

Bin Laden Tweets Take Center Stage at AoIR

Posted on October 10, 2011 by Joseph Delfino

On Tuesday, October 11^th, Texifter Founder and CEO, Stu Shulman, will be attending the annual Conference of the Association of Internet Researchers in Seattle, Washington. The Twelfth Annual AoIR Conference (you can follow it on Twitter @ #IR12) brings together approximately 400 dedicated researchers in one location to discuss and display their work amongst their peers. The AoIR conference is a 12-year old interdisciplinary venue for inspiring new ideas, presenting cutting-edge studies and encouraging collaborations between scholars in the area of internet studies. The schedule for the conference is packed over 3 days, and brings together researchers from various disciplines and areas like: information, communication, sociology, business, political science, design, engineering, art and more.

Stu will be participating in the white-knuckle opening event of the conference, when he gives an Ignite Talk on Texifter’s latest crowdsourcing project using DiscoverText, a post mortem using a sample from over 4 million collected Osama bin Laden Tweets. Two weeks ago we reported that we had coded over 23,000 of the Tweets. Now it is time for Texifter to display some more of the funniest Tweets live at the conference, and discuss some of the methodology behind the project, and why we choose to research the topic. However, this is no TED Talk, ideas move faster in the Internet sphere. Time will be short, and the heat will be on, Ignite talks are only 5 minutes in length, and just 20 slides long, meaning speakers have 15 seconds per slide to articulate their point to the audience. For those not going to the conference, you can still see the talk, simply check back for video of exciting talk!

Tweet This Post

Posted in general | Tagged AoIR, Bin Laden, crowdsourcing, data ming, DiscoverText, ignite talks, QDAP, Social Media, Texifter, twitter | 1 Comment

Power Track for Twitter

Posted on October 5, 2011 by Stuart Shulman

It works!

In less than 48 hours, the GNIP-enabled Power track for Twitter has pulled in more than a quarter million Tweets for the rule “Google”:

257, 284 to be precise, but that number changes faster than I can keep up with it. Along with the full firehose of Tweets, we also get some very nice metadata. When it is available, we are getting Klout scores, location data, and other useful metadata for filtering. The beta is only a few days old, but we already see huge potential.

Tweet This Post

Posted in general | Tagged Data Mining, DiscoverText, GNIP, Klout, Social Media, Texitfer, Text Analytics, twitter | Comments Off

DiscoverText + GNIP + Klout = Analytic Power

Posted on September 29, 2011 by Joseph Delfino

On October 1^st, Texifter staff will begin testing the GNIP Firehose for Twitter, which delivers 100% of the Tweets you want, based on the criteria which users provide. This is a remarkable tool and will greatly contribute to the evolution of DiscoverText as a major social media text analytic toolkit.

Currently, Twitter restricts its public API to 150 unauthenticated calls, per hour, per IP Address. Going over any of these limits results in the user being presented with “Error 420”, which simply means that the user is being rate limited, and the feeds will begin harvesting after a break. This most recently hampered DiscoverText users back in August, when users began seeing near constant rate limitations.

While Twitter has always included these regulations in their public API rules, Twitter might have become more cognizant of those harvesting large amounts of data (not just us), and as a result, are cracking down on heavy users.

The GNIP “Power Track” Firehose will eliminate all the rate limits of the Twitter public API, and allow DiscoverText users to harvest unregulated data with extremely robust metadata. The Firehose allows users to run a search where GNIP guarantees ALL Tweets will be harvested. This is an improvement upon the current 1,500 Tweet limit imposed by Twitter. In addition, searches have the ability to use certain operations to guarantee only certain results.

For example, Twitter searches on DiscoverText will allow users to specify a Klout score (or range of scores), as a filter. Only the Tweets with that score, or falling within that range, will be archived. To assure that searches will be unhindered by bandwidth and processing constraints, Texifter staff has been working assure that data will stream unhindered into DiscoverText databases.

The metadata which is harvested via GNIP is extremely robust, and it will allow users to make deeper more insightful inferences about their data. Currently, the public API only provides basic information, such as user, time, and date, along with the Tweets. The GNIP-enabled Full Firehose opens to door to more metadata, which will allow for more complete and insightful analysis, here are some of the new metadata features:

Filter on KLOUT Scores. Data can be analyzed according to a person’s internet presence, or influence, as determined by their Klout score.
All #Hashtags will be harvested, allowing users to find associations between hashtags, specific users, text, and re-tweet patterns.
Find the actual number of re-tweets in a collection.
Specify the language of posts. Want only English posts? Use the operator “eg”.
Leverage location data, may include country, city, and the coordinates. Soon, using the Google Maps API, DiscoverText will have the ability to map Tweets, allowing users to find hotspots.
Determine the number of tweets per user.

These are just a few of the powerful options. Advanced metadata filters within DiscoverText will be modified to allow the user to search for all of these within the Advanced Search. All these features will be available on DiscoverText in private beta beginning October 8, 2011. Starting October 1, Texifter personnel will be internally testing the Firehose. On the 8^th, a small numbers of users will be granted first access to using the Firehose through DiscoverText. Depending on how smoothly the beta proceeds, additional users will be granted access each day. We are currently holding a sign-up for the beta and we are still taking applications.

Tweet This Post

Posted in general | Tagged Archive, DiscoverText, GNIP, Power Track, Research, twitter | 5 Comments

22,348 bin Laden Tweets: Coded!

Posted on September 29, 2011 by Josh Sowalsky

It took all of three days for 27 individuals from around the country (and maybe the world) to collectively code 22,348 Tweets that mentioned Osama bin Laden, following his assassination. This crowdsourcing feat, sponsored by the Qualitative Data Analysis Program at UMASS Amherst, was the first of its kind. Tasked with assessing the humor of each tweet, our participants included students, out of work comedians, former comedians, couch potatoes, professors, and even a few productive elements of society.

Given our objective to find the funniest bin Laden tweet out there, we are now tasked with wading through the mass of human decision-making completed in DiscoverText. And thanks to the phenomenal crowd of coders that made this project possible, we can easily discard unpleasant tweets such as:

“Bin Laden was a coward and a piece of sh*t. He deserves no respect. Feel free to spit on his f*cking grave. #obl

and

“With all due respect @BarackObama we could care less about Bin Laden.. Why is gas still $4 a gallon?”hahaha omg”

…And start laughing at tweets like:

“Yes Bin Laden is killed by American forces but more importantly when will the Playstation Network be up?”

and

“R.I.P Osama Bin Laden – World Hide And Go Seek Champion (2001 – 2011)” champion tweet .. Omg xD”

Stay tuned for more information on this project or sign up for the QDAP crowdsourcing listserv at https://list.umass.edu/mailman/listinfo/crowdsource.

Tweet This Post

Posted in DiscoverText, general, product, research | Tagged analytics, Bin Laden, coding, Crowdsource, DiscoverText, QDAP, Tweets, twitter | 1 Comment

Crowdsourcing bin Laden Humor

Posted on September 23, 2011 by Joseph Delfino

This Saturday, Texifter will begin its first foray into crowdsourcing, the relatively new
phenomenon made possible by complex and fast information networks. Crowdsourcing seeks to tap the collective intellect of a large group to find a result instead of relying on a specialized few. Back in May, we collected an archive of over 4 million tweets that mention Osama bin Laden. Bin Laden Tweets have been analyzed prior, however, these analyses have been devoid of human interaction. With the need for human interaction and such a large dataset, the situation presents itself perfectly for a crowdsourcing project.

The objective of the project is simple: Participants will be asked to “code” or rate at least 750 tweets using a simple scheme designed to find the funniest Tweet in a large sample collected after the death of Osama bin Laden. We estimate the task will take no more than 2.5 hours. Our team will review a sample of each persons’ coding to ensure there is basic compliance with the coding manual. After the task is completed and validated, coders will receive a $25 Amazon.Com gift card.

The crowdsourcing does not end with the coding. In October, Texifter CEO Stu Shulman will be delivering a brief presentation on the bin Laden Tweets at the Association of Internet Researchers Conference. The funniest Tweet will not be determined by Texifter staff, but instead, through face-to-face deliberation in an audience of 300+ people who will vote via Twitter. There is a special bonus for the coder who spots the winning Tweet! That coder will receive an extra $100 Amazon.com gift card, in addition to the original $25. The winner and the 4 runners-up will each receive 6 months of free Professional DiscoverText service. If you are interested, do not hesitate to sign-up, but the clock is ticking!

Tweet This Post

Posted in general | Tagged big data, Bin Laden, coding, collaboration, crowdsourcing, Data Mining, Text Analytics | 3 Comments

Major Functionality Updates

Posted on September 17, 2011 by Joseph Delfino

Thanks to our superb development staff, Texifter has launched new updates to DiscoverText on both the front-end and back-end. The updates unlock a new level of speed, interactivity, and functionality. Overall, the updates enhance user’s ability to mine data for with more precision for more granular detail. Here are some major highlights of the update:

Major strides have been made in improving the coding and classification process by significantly updating the ability to mine data directly from reports. When coding, users can now specify a color for each code and have the ability to edit dataset codes (text/color) when classifier is applied to dataset.

The colors will be displayed on the classification chart, which now has the ability to be directly drilled into, allowing users to bypass advanced filters to take a quick view at their classified data. Simply clicking a desired section of a chart will allow users view all items which have been classified with the specific category. This same process can be completed when viewing metadata breakdown. Advanced Filters have not been left out, as of today; users now can search for “MAX” when using classification filters.

In addition to new mining tools, visual changes have been made. Charts can now appear in pie, pyramid, or funnel form, display the number of units, and can appear in large form in a separate window.Visual customization can be formed by selecting the “report options” tool, which gives users several different options when displaying classification reports.

On the back end, SIFTER™ Version 3.0 is now running on the system, which includes Sifter Controller Service; Sifter Stack Service; Sifter Administration Application; and Installer Packaging. This adds an entire new slate of functionality. Those using DiscoverText to mine Twitter feeds will be delighted to know that you now may reply to Tweets directly through DiscoverText, a major function for those looking to monitor social media. Finally, the update Adds more robust .docx handling for ingestion and more robust metadata can now be collected when harvesting Twitter feeds.

This major update is the first of many exciting additions to the DiscoverText system. On October 1st, DiscoverText will be launching a beta test using the GNIP PowerTrack, allowing users to avoid Twitter API regulation and harvest 50X-100X more Tweets. Also in Q4, Texifter will lauch the LDA topic classification platform, a revolutionary way to classify data accurately classify data without human interaction. Visit the Texifter Blog often for more updates, if there are any questions or inquiries, please contact any Texifter team member.

Tweet This Post

Posted in general | Tagged charts, DiscoverText, graphs, SIFTER, software, visualization, Web | 1 Comment

100 Million Documents

Posted on September 16, 2011 by Stuart Shulman

A few days ago, DiscoverText ingested document 100,000,000. It was probably a Tweet. Researchers woke up today to a newly optimized database of more than 300,000,000 meta data values. Over the coming days and weeks, we will be rolling out a series of new feature announcements, including powerful new drill down tools, improved reporting capabilities, as well as the beta for our LDA unsupervised and semi-supervised topic model tools.We have been getting great submissions for the October beta test of the GNIP “Full Firehose for Twitter” as well. There is still time to apply for this excellent opportunity.

Over the last two weeks, Texifter has held highly productive meetings with JetBlue, GeoEye, Forrester, ESPN, VisionCritical & Google. The enthusiasm at these meetings has left our product development team fired up for some championship start-up play.

Tweet This Post

Posted in general | Tagged DiscoverText, documents, ESPN, Forrester, GeoEye, Google, JetBlue, Texifter, Tweets | Comments Off

GOP Twitter Benchmarks

Posted on September 9, 2011 by Joseph Delfino

PRE-DEBATE (September 1^st – September 7^th)

It’s September again, summer vacations have ended, schools are back in session, and the presidential primary election campaign is in full swing. While Zogby International and Gallup polling may tell you that Perry and Romney are the favored Republican, you shouldn’t be so sure. Results generated from Twitter data via DiscoverText are in for the first week of September: Rick Perry was the talk of the town, however Romney is nowhere close to Perry – at least on Twitter.

Of the eight Republican candidates (at Wednesday’s debate), Rick Perry has attracted more attention on Twitter in the last week than any other candidate – by far. Sure, any press is good press, in this case it’s not exactly press. These are real people with real sentiments – positive, negative, and neutral. Either way, people are talking and Rick Perry and Ron Paul are who they’re talking about.

Despite low polling numbers and sparse media attention, Dr. Ron Paul is neck and neck with Perry, and even three times (on 9/3, 9/5, and 9/6) surpassed Perry’s twitter mention-count. Battling for third place in the mention count were former Massachusetts governor Mitt Romney and Minnesota Congresswoman Michele Bachmann. Businessman Herman Cain held the spot for fifth. Rounding out the field, at the bottom of the pack were Jon Huntsman, Newt Gingrich, and Rick Santorum.

This Twitter poll conducted by Texifter analysts accounts for all mentions of candidates, regardless of sentiment. It is no surprise that prior to the GOP debate, Rick Perry held the lead for mentions on Twitter. Since entering the race two weeks ago, Perry has been the center of Republican attention. The Texan began his campaign by insulting Chairman of the Federal Reserve, Ben Bernanke, and has not relented in his unique political candor, which has brought him to the top of many national polls. Perry has drawn comparison to former President George W. Bush, because of his Texas roots, lackluster academic career, and his “good ‘ole boy” looks and charm.

But it is Ron Paul’s second place status that leaves room for debate. Ron Paul is a candidate that tows neither the Democratic or Republican party line. He advocates for both isolationist foreign policy and archly conservative fiscal policy, and receives a fraction of the media attention that his fellow candidates receive. And his polling numbers don’t account for all of this twitter attention, which ultimately will make a political impact.

POST-DEBATE (September 8^th)

The day after the debate, Twitter mentions of Republican Presidential candidates soared, as the energy of the debate brought about vibrant online discussions. While many of the pundits and newspapers reported on the battle between Mitt Romney and Rick Perry, it was Rick Perry and Ron Paul that traded places for top mentions, prior to the debate. On Thursday, September 8, Ron Paul re-captured the lead, taking 23% of the Twitter mentions. Rick Perry, having captured slightly over 40% of the mentions on the 7th captured roughly 18% of the mentions.

In addition, the range of twitter-mentions following the debate substantially decreased following the debate. This is apparent by looking at the range between the most mentions and the least mentions. On September 7, the range between the most mentions and least mentions was 38%, but on September 8 that range fell to only 15%.

In the contest for third place, more changed following the debate. Mitt Romney, who held 3^rd place on September 7, lost his position to Michele Bachmann, who ended the day with 13% of the day’s mentions. Mitt Romney slipped to 5th place, virtually deadlocked with Jon Huntsman, who also held 10% of mentions. While this may be alarming for the Romney campaign, he has seen drastic fluctuations in the past week. Romney fluctuated between 5% and 20% of mentions; this 15% spread is second only to Rick Perry, who has held between 20% and 40% of the candidate mentions throughout the week. As for Jon Huntsman, he experienced a higher percentage of mentions after the debate than he experienced all week. He consistently received less than 10% of all mentions, and, on multiple days, only generated roughly 3% of candidate mentions. For Herman Cain, Jon Huntsman’s gain reflects Cain’s loss. Cain slipped into the band of bottom feeders, along with Newt Gingrich and Rick Santorum.

Texifter analysts have been collecting Twitter feeds mentioning Republican candidates vying for the party’s nomination. Feed collection began in late August, and will continue to import data throughout the Fall. This will undoubtedly give Texifter analysts new insights into the campaign. For the next 3 months, Texifter will continue to update the blog with posts regarding these feeds. Within the next two weeks, analysts will monitor sentiment and topics discussed by each candidate. Check back with the Texifter Blog soon to see the results of this next experiment. If there are any questions or comments, or something you might like to see, do not hesitate to email any of the Texifter staff.

Tweet This Post

Posted in general | Tagged analytics, debate, election, GOP, issues, presidential, Republican, Tweets, twitter | 4 Comments

PowerTrack for Twitter

Posted on September 5, 2011 by Stuart Shulman

DiscoverText is preparing to launch a short and exclusive beta test period using “PowerTrack for Twitter Firehose Filtering” a service provided by GNIP. Compared to the “rate limited” service offered by DiscoverText through the public Twitter API, the “Full Firehose” is 50-100 times the volume with powerful Klout, language and keyword filters.

If you would like to participate in this trial, please leave us your contact information and tell us a little bit about your work. We will not be able to offer this trial service to everyone, so please make the case for the value you or your organization will add as beta testers.

Tweet This Post

Posted in general | Tagged archiving, datamining, DiscoverText, Firehose, GNIP, PowerTrack, Research, Tweets, twitter | 5 Comments

Topic Modeling Using LDA

Posted on September 3, 2011 by Joseph Delfino

Peter Gustav Lejeune Dirichlet

Prior to the groundbreaking research of Blei, Ng, and Jordan, delivered in a 2002 paper, the world of latent Dirichlet allocation(LDA) was underdeveloped and far from being used in the commercial world. LDA, a powerful statistical learning algorithm, is a generative model that allows sets of observations to be explained by unobserved groups which explain why some parts of the data are similar. Recently, the DiscoverText developers engineered a topic modeling and clustering system using the LDA techniques. Developing and adapting this exciting technology for expanded use is an integral part in the future of the DiscoverText text analysis toolkit.

An example of an LDA model is this: A user might have specified the creation of two topics
that can be classified as HOT and COLD. However, the classification is arbitrary because the topic that encompasses these words cannot be named. Furthermore, a topic has probabilities of generating various words, such as sun, summer, and Florida, which can be classified and interpreted by the viewer as “HOT“. Naturally, hot itself will have high probability given this topic. The “COLD” topic likewise has probabilities of generating each word: snow and blizzard might have high probability. Unlike the common collection, coding, and classification of data typically undertaken by the staff at DiscoverText, developing a topic model using the LDA algorithm within DiscoverText requires no human interaction except to specify the number of topics the algorithm is supposed to develop. For every document in an archive, it is assigned a score as to how well it fits in each topic category. From this, DiscoverText and the SIFTER™ Natural Language Processing modules work their magic to group the documents into a set of clusters based on how well each one is similar to other documents in the same cluster.

The topic modeling and clustering algorithms being engineered by Texifter personnel were inspired by a client that wanted to build a comprehensive topic model, but did not know where to start. After successfully manually building the customer a topic model, experimentation began on creation using DiscoverText’s (currently in-alpha) LDA modeling and clustering. From this automated processing, a topic model system using 8 topics was engineered, all giving significant insight into the customer’s business. For example, one of the bunches returned the keywords: people, crew, enjoy, fun, culture, environment, coworkers, time, meet, and team. All of these words brought together without the intervention of human coding are noticeably similar, and fit perfectly into what we called Topic 2, which could be named “Culture, Environment and Coworkers.”

Following the naming of topics, DiscoverText allows a classifier to be built around the topics, and an entire dataset to be classified according to the topic model developed by the LDA-based clustering. DiscoverText’s Automated Topic Clustering tries to find the best fit for even coverage across all topics found in the corpus. After viewing the results, the topic model classifier has so-far yielded promising results. In the future, this will allow the user to re-assign documents to topics, and update the underlying model The generative model can add new documents and infer their topics based on existing model, or update the underlying model with the new data. Look for the LDA modeling and clustering processes to be in beta by the beginning of fall. If there are any questions or comments regarding DiscoverText or using DiscoverText’s LDA Topic Model Platform, please email any of the knowledgeable DiscoverText staff.

Tweet This Post

Posted in general | Tagged analytics, Data Mining, Dirichlet, LDA, model, NLP, topic, unsupervised | 2 Comments

Shaky News Coverage of Libya

Posted on August 23, 2011 by Joseph Delfino

This weekend, opposition forces in Libya fought a ground war to oust long-standing dictator Col. Gaddafi. Simultaneously, a global news war also waged, as news sources sought to be the shining star in broadcasting the latest from Tripoli. Some of the best news came directly from Twitter, which, to no surprise, exploded with information as the opposition forces advanced. With this surge in information being reported on Libya, using DiscoverText, I began downloading Twitter feeds on Sunday evening, using the keywords “Libya,” and “Tripoli.”

By 6AM the following morning, about 15 hours later, I had collected about 25,000 posts for both keywords. After lurking on Facebook and Twitter, and using the Tag Cloud to see
trends in the data, I had noticed many people seemed upset with the quality of the news coverage, especially the coverage from U.S. sources. Josh, my colleague, found that CNN is in need of a geography lesson. Using the collected posts, my goal was the find which global news source had the highest public satisfaction. To calculate this, I searched my “Libya” archive for major world news outlets, including, “CNN,” “Fox News,” “MSNBC,” “Al-Jazeera,” “Sky,” and “BBC.” I created individual buckets and datasets for each news source, and would code these for sentiment by using the coding scheme “Positive Mention,” “Negative Mention,” “Re-Tweet/Tag,(includes links)” and the “Other (includes general references, etc).”To perform this experiment, I would code 25% of each dataset, and then leave the rest up to the powerful Bayesian classifier built into DiscoverText.

After coding 137 of the 550 “Al-Jazeera” tweets, I found the Doha, Qatar based news service to be heavily praised. With comments such as, “If Americans want to watch actual live coverage of Libya, they should download Al Jazeera English to their IPhones” and “The revolution is being televised on Al-Jazeera English. Even better than Sky. BBC nowhere,” Al-Jazeera clearly seemed to do an excellent job covering Libya. After classifying, the report held true to my experience in coding, and showed that most either re-tweeted Al-Jazeera tweets, or were positive about the news service.

Only slightly over 1% of the tweets were negative about the Al-Jazeera coverage, making it the news service to beat. Another positive for Al-Jazeera was the number of people who re-tweeted, or just mentioned the news service. This outpouring of support and recognition of Al-Jazeera by a mostly American audience should come as welcome news. The service is currently aggressively expanding its English offerings, and the will soon be on television sets throughout the United States under the name Al-Jazeera English.

The usually highly regarded British news service BBC seemed to heavily ridiculed not for its coverage, but its lack of urgency to cover the issue. When, coding, I noticed this from statements such as, “BBC America still running with Top Gear, Al Jazeera live from Benghazi,” and “More #BBC Journalists at Glastonbury than in #Libya? I want my money back.”

In the coding report, my thoughts were essentially mirrored, as over 21% of the tweets were negative. One positive for BBC was that nearly 50% of the tweets were re-tweets, meaning a fair amount of people use BBC as their news source, and are happy with the coverage. The moral for BBC, thanks to DiscoverText, should be, when the next Middle East uprising comes to fruition (Syria?), it would be wise for the BBC to turn off Top Gear and immediately switch to once-in-a-lifetime events to satisfy viewers.

I also collected tweets which mentioned the other British news outlet, Sky Broadcasting. This outlet is not as well known as BBC, but according to posts, the organization made their mark when reporting on Libya. The chatter about Sky was overwhelmingly positive. Often, Sky was compared favorably with other world news outlets, with posts like, “Sky News and al Jazeera are making US cable coverage of #Libya look like Wayne’s World.”

The classifier report revealed an astonishing 65% of mentions positive, a number only rivaled by Al-Jazeera. One, tweet which best described the relationship between the two noted the crisp, rivieting images coming from Sky, and the superb analysis from Al-Jazeera, suggesting the ultimate combination of analysis and images.

When coding the CNN data, the statements seemed to contain the most mixed sentiment. From the posts, it seems that CNN’s coverage appeared to be thrown together at the last minute, and factually inaccurate. Many people re-tweeted the post, “#Libya is not #Iraq..not begun by NATO, this; begun by the will of masses of Libyans. Somebody tell StevenCook on CNN to stop comparing to Bagdad.” Another example of people’s discontent with the CNN coverage was statements such as, “Sky News web coverage of Libya approximately 1,000% better than CNN’s.”

According to the classification report, over 30% of tweets regarding CNN were negative. This high number could be for a few reasons. CNN is commonly regarded at the best source for news in the U.S., and is watched by those more critical and knowledgeable about the subject. In the future, CNN might be wise to employ a Middle East expert to discuss the issues, instead of relying on the “janitor and the cleaning lady,” as one tweet described the Sunday late night coverage of the events unfolding in Tripoli.

The other two U.S. news outlets, FoxNews and MSNBC provided considerably less material to analyze, with only 150 tweets between the two. When coding MSNBC, it seemed the only reason anyone mentioned MSNBC was in order to drag the network over the coals by giving the network dubious honors such as, “In fairness, though, I would respect the analysis of the bear on MSNBC more than John Bolton’s on Fox News.” It seems the network mainstay decided instead to show re-runs of Lock-Up and other reality news shows, instead of the covering Libya.

The classification report revealed that over 60% of the tweets about MSNBC were negative, much in-line with what I witnessed. Much of the same occurred with FoxNews, with the majority of the posts complaining about the often maligned conservative bias. This was displayed in tweets like, “So how do you think this plays out on Fox News and WSJ? Obama let al-Qaeda take over Libya?”

The anti-Fox stance became more evident when the classifier report displayed over 65% of the tweets negative about Fox’s coverage of Libya. Using the DiscoverText tools revealed the winners and losers in the Libya coverage. Al-Jazeera, and somewhat surprisingly, Sky Broadcasting, drew high marks from the tweeting public.

Tweet This Post

Posted in DiscoverText, product | Tagged analytics, DiscoverText, Gaddafi, journalism, Libya, news, R&D, Texifter, twitter | Comments Off