Texifter, LLC. Blog | Search, Filter, Cluster, Code & Classify Text | Page 7

200 Million Document Milestone

Posted on November 27, 2011 by Josh Sowalsky

Only a few months ago, DiscoverText’s data ingest counter was steadily moving toward 100 million documents. This week, the software’s document ingest counter will cross the 200 million document line. You can see this for yourself on our homepage. With the weekly addition of powerful new analytical tools, DiscoverText is reinventing the field of online text analytics for researchers, marketers, lawyers, and fortune 500 companies in the U.S. and increasingly around the world. As more researchers, managers, directors and VPs at companies like JetBlue, ESPN, Google, Merck, Nike, Nokia & Facebook begin to experience the power of DiscoverText, the corresponding data use and computational procedures by our users increases exponentially.

We are steadily increasing our data throughput capacity to preserve speed and agility inside our system. Recently, we acquired a new disk array system to provide users with top of the line hardware for performance and reliability.

These hardware upgrades, alongside continual database optimization efforts and the steady addition of new memory, support the SIFTER processing core of DiscoverText. SIFTER web services allow our users to better leverage the system features. As a result, certain computationally intensive features have become real time offerings. If you have not yet done so, be sure to sign up for a free trial account at DiscoverText.com and email us if you have any questions. We are here to help you get started.

Tweet This Post

Posted in general | Tagged big data, DiscoverText, Document Search, Document Storage, E-Discovery, Machine Learning, Milestone, Texifter, Twitter API, Twitter Mining | Comments Off

Holiday Shopping Update

Posted on November 24, 2011 by Joseph Delfino

Week 3
Due to a busy week in California, and preparations for Amsterdam, I am going to skip right to Week 3 of the Holiday Shopping Updates, meaning analysis for the dates November 13-20. In our first analysis, we captured nearly 35,000 Tweets using the Twitter API which mentioned “Christmas Shopping.” This week, that number nearly doubled, with the ingestion of 60,000 Tweets. In just two weeks, it is clear that with Thanksgiving 2011 nearly a memory, people are cognizant of the fact that Christmas is just 5 weeks away. Comparing the rule “mall” also reveals a steady tick upward. For the dates November 13-20, 235,000 Tweets mentioned the word “mall”, a 35,000 Tweet increase from 2 weeks ago.

Using our Holiday Shopping Classifier on Tweets emanating from this past week reveals a slight uptick in people who have started Christmas Shopping, but not enough to make a serious dent. The largest gain was in the “other” category. After some digging, I found the “other” category to be overloaded with advertisements for Black Friday sales. This prompted me to start a “Black Friday” archive, which I will analyze after the event. Interestingly, the majority of the advertisements were not from large outlets, but from small, local business attempting to use Twitter to drive traffic to the store on a day usually reserved for the major retail players. Once compiled, the information leads me to believe that the majority of people are waiting until Black Friday to commence their shopping- a theory of course which can be tested as my “Black Friday” archive gets bigger.

Moving to the more granular picture to individual retailers reveals some interesting trending topics. For the past 3 weeks, I have been collecting “Nordstrom” Tweets, which I classified using the codes “Shopping at Nordstrom,” “Official Advertisement,” “Unofficial Advertisement,” and “Other.” The results turn out to be extremely positive for Nordstrom as the Tweets reveal some excellent free advertising, with 50% of Tweets discussing a shopping trip to Nordstrom, and another 26% of Tweets coming in the form of Unofficial Promotion. At this time of year, this is exactly the type of Tweets which retailers like to see. Personally, trends revealed to me that there is a sale on shoes, and the service in the shoe department is extremely helpful, the perfect type of message to proliferate throughout the Twittersphere.

Check back next week for a special Black Friday edition of the Holiday Shopping Update. If you have any questions, or want to see a specific metric investigated, you can email me at joe@discovertext.com.

Tweet This Post

Posted in DiscoverText, product, Twitter | Tagged Holiday Shopping, Market Research, Tweets, Twitter Analysis | Comments Off

Predictive Analytics, Amsterdam

Posted on November 18, 2011 by Stuart Shulman

Texifter personnel travel next to Amsterdam to present and exhibit at the “Predictive Analytics Innovation Summit” on November 22 & 23. On Day 1, I will give a talk titled “Humans and Machines Working Together,” that details our work on custom machine classifiers. Use cases will be drawn from HR analytics pilot projects involving JetBlue and ESPN internal survey data. On Day 2, I will be serving as Chairperson and moderator. The full program is available online and we look forward to sharing what we learn on this blog, and through our tool development, in the weeks and months to come.

Tweet This Post

Posted in general | 3 Comments

Visiting Google & Facebook

Posted on November 17, 2011 by Stuart Shulman

Thanks to some excellent ground work by Joe Delfino and Sean Kelleher, Joe, Sean & I were able to make a pilgrimage to Google, Facebook & Reputation.com for a wildly exciting day of briefings with Q&A. While I’d love to share the details, I can’t! Big secret ;-) However, I can share a few pictures and stories from our day in Silicon Valley…

Stu at Google – Take away message: “This was a great meeting!”

Sean at Google – “I could move to California.”

Joe at Google, after spending the week in the Bay Area attending the 2011 Sentiment Symposium, and the Text Analytics News Conference. “I am already (in my mind) living in California and running the west coast operation.”

Stu and his well used Camaro. While running a bit behind schedule on the way to Reputation.com, it is alleged the driver took advantage of the fast moving California 101 freeway, the state’s liberal u-turn policy, certain optional passing strategies based on scenes from action and/or science fiction film, and his passengers stomachs.

Joe at Facebook – Joe Delfino got us this meeting. Joe gets meetings. Joe is a meeting-getting animal. We like Joe.

When my son saw this picture of his Dad at Facebook on Facebook, he said: “Wow Dad; you look really happy!” I sure was happy. We had come from Google feeling deeply engaged by one of the greatest companies in the history of capitalism and we were sitting in the lobby of another. We had lunch with a gracious host at the company cafeteria and a demo with a diverse group of Facebook sentiment analysts. After years of academic presentations, the freedom to present in jeans and a QDAP t-shirt was a perk that I could probably get used to. The meme ‘west coast office’ was heard frequently as we blazed out of Palo Alto and headed for Redwood City.

After the long day in Silicon Valley, the team got stuck in 101 rush-hour traffic, slightly grouchy and despondent, but made it to a wonderful restaurant, Burma Superstar, in the Pacific Heights neighborhood for beer, food, and good company near a place where a Hobbit had been spied. By the time we had returned the Camaro and made it to the train to the SFO terminal for our red eye, we all realized the magnitude of the day we had. It was a huge lift for our confidence and an exciting glimpse into where Texifter is going. It is nearly certain that Texifter will be back on the West Coast soon.

Tweet This Post

Posted in DiscoverText, Facebook, general | Tagged DiscoverText, Entrpreurship, Facebook, Google, Silicon Valley, Startup, Texifter | 2 Comments

Holiday Shopping Update

Posted on November 8, 2011 by Joseph Delfino

From now until the New Year signals the close of the holiday season, Texifter analysts will focus DiscoverText’s tools on holiday shoppers. We will take a close look at their behavior and sentiment as they hit the stores during the “most wonderful time of the year.” With November now a week old – and Christmas just 7 weeks away – it’s a great time to see if shoppers have begun to loosen their wallets.

Texifter analysts used the Twitter API in the first week of November to capture nearly 35,000 Tweets mentioning “Christmas Shopping,” ingested nearly 200,000 Tweets into DiscoverText using the rule “mall,” and ingested nearly 100,000 Tweets using the names of popular Christmas retailers, such as Saks, Nordstrom, and Macy’s. As examples of holiday season shopping indicators, these are only the beginning of archives we will use to continually harvest data throughout the holiday season.

Taking a random sample of 5,000 Tweets, we created a dataset as the base for a custom classifier built around holiday shopping comments. Over the full shopping season we will continuously refine the classifier using our unique learning engine. It will help us monitor the pace at which people are entering and exiting the Christmas Shopping season.

Using the categories of “Not Started”, “In Progress,” and “Finished,” the numbers indicate most people might still be recovering from a Halloween Hangover. Only 34% have started their Christmas shopping and a mere 7% have finished. With 7 weeks to go and Black Friday sales still to come, we will see how fast this minority grows. It’s possible consumers in the spending powerhouses of the New York and Boston regions were homebound due to power outages from a freak October snowstorm putting a damper on the “In Progress” number. However on November 1st, a a traditional phone survey by national consumer group Valpak found 29% of people had already started Christmas shopping. Given timing differences between this survey and our less costly analysis of tweets, the results are close. Over the past week, other surveys have found consumers starting a bit late. The majority of Tweeting shoppers have not started their shopping. Although they very often Tweeted about the need to start, they lacked an urgency to do so. This might be a signal of a December rush.

To get a better handle on consumer sentiment, we studied the same sample of 5,000 tweets. Given what we found, it’s hard to believe only 34% have begun shopping. As part of our sentiment classifying process, we coded any Tweet as Positive if it mentioned a successful shopping experience. Neutral Tweets simply mentioned the need to shop or that someone had made a shopping trip but didn’t provide any positive or negative description. Negative Tweets mentioned a barrier preventing a person from shopping, such as money, or weather.

The classifier revealed that sentiment is overwhelmingly positive, with the majority of comments being quite positive. Many people were excited by the upcoming Holiday season and the prospect of heading out to begin their shopping. This positive outlook was accompanied by the oft repeated mention of “savings,” and the ”need to save” for Christmas. While this was not coded for, it might be an indication consumers will temper their enthusiasm with some degree of concern for cost. While people are excited about the season, they may be holding tight to their wallets as they shop for gifts this year.

Next week, we will dive into the details and start digging into whether consumers’ Tweets about individual retailers may affect the holiday season.

Tweet This Post

Posted in general | Tagged Active Learning, analytics, API, bayesian classifier, Custom Classifiers, Data Mining Text Mining, DiscoverText, Holiday Shopping, Human Coding, Machine Learning, Public Opinion, Sentiment Analysis, Texifter, Twitter API, Twitter Mining | 2 Comments

DiscoverText Introduces Tools for Random Sampling

Posted on November 1, 2011 by Josh Sowalsky

DiscoverText is rolling-out an addition to its analytical toolkit: random sampling. The Web-service already offers an array of tools for text analytics and rigorous, team-based qualitative data analysis. These functions include the ability to code and annotate text, measure inter-rater reliability, adjudicate coder validity, attach memos to text, cluster duplicate and near-duplicate documents, share documents, and to classify text using an active-learning Naive-Bayesian classifier. While still in beta, random sampling is a key new addition.

After DiscoverText users amass extraordinary amounts of social media data (for example via the Public Twitter API, the GNIP Powertrack, or the Facebook Social Graph), they can now more easily extract a random sample for analysis. The size of the sample is decided by the user in order to accommodate to iteration, experimentation and other scientific methods. The option is streamlined into the dataset creation process. On the new dataset creation page, you see a sample size prompt.

This additional method for data prep and analysis augments current information retrieval techniques, such as search with advanced filtering. It also builds up our framework for expanding available NLP methods from straightforward Bayesian classification, which aims to analyze substantial quantities of data in their original bulk-form, to a menu of computationally intensive methods that can iterate more quickly and effectively against random data samples. For example, the LDA topic model tool we are releasing will be faster and more effective against smaller random samples.

This new feature accommodates both an additional analytical approach as well as the opportunity to easily compare results between competing (or complimentary) analytic methods. We look forward to experimenting with this new tool and hearing about how random sampling will enhance the research of our users and users to come.

Special Note to DT Users: We need to turn this feature on one account at a time while we are testing it. Drop us a line if you want to try the tool.

We’ll keep you posted on the launch as more dataset modifications are pushed live. As always, if you have any questions, feel free to email us anytime at help@discovertext.com. Your feedback is crucial. Sign up and try it out for yourself at discovertext.com.

Tweet This Post

Posted in DiscoverText, Facebook, general, GNIP, product, research, Twitter | Tagged analytics, API Graph, Bayesian, Bayesian classification, classification, Code Text, coding, Data Mining, DiscoverText, Firehose, Gary King, GNIP, Machine Classifiers, Machine Learning, methodology, Natural Language Processing, NLP, PowerTrack, Qualitative Data Analysis, random, Random Sample, Random Sampling, Research, research methodology, sample, Sampling, Social Graph, Social Media, social media monitoring, Texifter, Text Analysis, Text Analytics, twitter, Twitter API | 2 Comments

Archiving Public Facebook Content: Technical & Legal Issues

Posted on October 27, 2011 by Stuart Shulman

This is an updated version of a very popular early video. It describes some of the technical and legal issues when you “Connect” with Facebook credentials to DiscoverText. The video offers instructions about privacy and demonstrates how you can use a free 14-day trial to start archiving public Facebook content.

Tweet This Post

Posted in DiscoverText, Facebook, product | Tagged API, archiving, DiscoverText, Facebook, Social Media, social media monitoring | 3 Comments

Why People Want GNIP’s Power Track Twitter Firehose

Posted on October 25, 2011 by Stuart Shulman

We have been delighted with the response to our call for beta testers to try the GNIP-enabled PowerTrack for Twitter. You can still sign up. Round 1 of the beta test concludes on October 31, 2011. Even just testing the system’s data filtering and collecting capabilities for 1 or 2 days, or as few as 1-2 hours, may convert you to a devoted GNIP via DiscoverText user. As part of taking beta tester applications, we asked folks to tell us something about how they planned to use the beta test opportunity. Thanks to “Wordle” we can visualize an answer to the question: “Why do people want to take part in the GNIP beta test via DiscoverText?”

Tweet This Post

Posted in DiscoverText, GNIP, product, Twitter | Tagged beta test, data, Data Mining, DiscoverText, filter, Firehose, GNIP, Power Track, Research, search, Social Media, twitter | Comments Off

Apple Shake-Up, Part II: Tweets in Memoriam

Posted on October 25, 2011 by Josh Sowalsky

On October 5th, the passing of Steve Jobs rocked the world. Millions were touched by the loss of one of the world’s great innovators who has firmly joined the ranks of Guttenberg, Franklin, and Edison. Over the 24 hours following his passing, an outpouring of grief was expressed online, as millions updated their Facebook profiles and tweeted their grief, inspiration, and cathartic woes. But what were those millions saying? How many were sad, inspired, remorseful, emotionless, or expressing some other sentiment. We were determined to learn more about them.

As soon as news of his passing was announced, we immediately began importing a GNIP feed in DiscoverText. For the first hour of that feed, tweets were ingesting at a rate of over 230 tweets per second, until that influx of Steve Jobs tweets actually crashed the Twitter server! (Thus, the apparent drop-off in the graph.) Luckily, Twitter was soon back online and after 24 hours, over 4.4 million tweets had mentioned Steve Jobs… and we had collected all of them.

This is an astounding amount of data, and far more than any person (or team of people) can sort, so we used a topic-classifier in DiscoverText to organize this trove of data. Classifying a sample of 100,000 tweets, this is what we found: Far more individuals expressed admiration for Steve Jobs and his legacy rather than heartfelt remorse. And while there were a few jokesters out there, far more tweets were inspiring.

Using a sample of about 25K tweets, about 40% of the them expressed admiration, 25% expressed sadness, 24% expressed the fact that he had passed away, 5% of the tweets were miscellaneous, and the final 5% were humorous, while not – in fact – particularly funny.

Steve Jobs left his mark on the world, not only technologically but emotionally as well; and these numbers demonstrate the extent to which that is true.

Related and coming soon: Twitter Eulogies: Social Media’s Response to the deaths of dictators and innovators.

Tweet This Post

Posted in general | Tagged analytics, Apple, Code Text, Data Mining, datamining, death, DiscoverText, GNIP, Jobs, Machine Classifiers, Machine Learning, Social Media, Steve, Steve Jobs, Texifter, Text Analysis, Text Analytics, Tweets, twitter, Twitter API | 1 Comment

Perry Loses Ground Post Debate

Posted on October 24, 2011 by Joseph Delfino

Prior to the CNN GOP Primary Debate last Tuesday, October 18, Texifter analysts had been collecting Tweets relating to every major Republican contender since late August. Earlier in the fall, we reported on the number of Tweets per candidate, which darkhorse Ron Paul and Texas Governor Rick Perry fought for the top spot. Now with the debate fresh on people’s mind, and the Iowa caucuses less than 3 months away, we have decided to take a look at candidate sentiment before and after the debate for the top 3 candidates in the Republican race, which at the time were Mitt Romney, Herman Cain, and Rick Perry.

To gauge sentiment before and after the debate, we only considered Tweets which were created the day before, and the day after the debate, meaning, no Tweets on October 18 were considered in the analysis. Both the day before, and after the debate, Herman Cain was the most mentioned candidate. In the two days, Cain registered nearly 30,000 Tweets, while his competitors, Romney and Perry did not register that same amount combined. The number of Cain Tweets was probably inflated by his ultra-popular, but heavily ridiculed 9-9-9 Tax Plan, which was rumored to be taken from popular computer game Sim City, and thus caused a major spike in Tweets. To analyze the Tweets, we formed individual classifiers, for each candidate, and trained them on data regardless of whether the Tweets were from before or after the debate.

Prior to the debate, no candidate had especially gained the support of a major audience on Twitter. The closest a candidate came to Twitter approval was Mitt Romney, who’s 19% pro-candidate mentions made him the most popular of the Republican candidates. Prior to the debate Romney registered many headlines, and in general, far less scathing statements than his cohorts.

The once high-and-mighty Rick Perry had very tough luck, only garnering 12% pro-candidate mentions prior to the debate. His problems do not end there, Perry, also racks up the highest number of anti-candidate mentions prior to the debate. These Tweets often mentioned is less than stellar performance in previous debates, cryptic messages on the economy, and the cruel irony of his similarities to former President George W. Bush. Herman Cain is the middle-man, and, unlike Perry is commonly mentioned in positive Tweets which highlight his great “All-American story and his seeming clarity on tough issues.

However, this is dwarfed by comic mentions of his 9-9-9 Tax Plan, his tenure as the CEO of the Godfather’s Pizza Chain, and some of his off-color statements during the campaign. In the dubious honor category, Tweets relating to Herman Cain were by far the funniest of the bunch, with statements such as, “when people realize that Herman Cain SOLD pizza, and IS NOT actually pizza, his poll numbers will plummet,” it would be fun to conduct another crowdsourcing project to look for the funniest of the bunch.

Post-debate, while overall post debate numbers did not fluctuate greatly, Rick Perry was certainly the big loser. Perry, who has the smallest amount of pro-candidate Tweets to begin with, dropped from 12% to 4%. To make matter worse, the number of anti-candidate Tweets increased by 11 percentage points, meaning 65% of Perry Tweets were negative. Perry had many angles which criticism came from, most visible was his “showdown” with Mitt Romney, his attempts to paint Mitt Romney’s religion as bizarre, and for addressing Herman Cain as “brother” during the debate. None of this bodes well for Perry, who, has been plummeting in the polling for the New Hampshire primary, and now has fallen well behind both Cain and Romney.

Mitt Romney improves a fair amount post-debate, as the percentage of anti-Romney Tweets decreased by 7 percentage points. However, this 7 percent move when to the Neutral Tweet category, as the number of Pro-Romney Tweets remained stagnant at 19%. Neutral Tweets can be the best devicesfor publicity, and could be a goal of the Romney campaign. Most of these Tweets were simply headlines with a Romney current event, which can proliferate online.

Post-Debate, there are numerous Tweets which have re-Tweeted headlines directly pertaining to the debate, the vast majority of which include the name “Romney” in the headline. Many of these highlighted his “showdown” with opposing candidate Rick Perry, which most news outlets declared Romney the big winner. There are many positive mentions of how Romney successfully beat Rick Perry in the debate, however, there is much negative commentary on how Romney is the “banksters” choice, who will continue to hold Wall St. in a higher regard than the average person. There is also extensive discussion of Romney’s time as Massachusetts governor, and the health plan which he created, one Tweet reads, “mitt Romney saying get rid of “Obama-care” but his healthcare plan in Massachusetts was more or less the same.” meaning people realize Romney is changing view on an issue he once fought for as governor, an image that a Romney campaign manager would surely like to erase from the Twittersphere.

Finally, Herman Cain, numbers stayed the steadiest of any of the major candidates. Post-debate the number of positive Tweets increased nominally, going from 14% to 15%, with the difference stemming from a one point decrease in negative Tweets. While this may look positive for Cain, statistically, this change is meaningless.

The majority of the post-debate Tweets continued similar themes from prior to the debate. Jokes relating to his status as a former CEO, and his stance toward social welfare and occupy Wall St. Having been the former CEO of Godfather’s Pizza, he is the brunt of many pizza related jokes, including one, “when people realize that Herman Cain sold pizza, and IS NOT pizza, his poll numbers will plummet,” this gets my vote for funniest Cain Tweet. Being the only African American candidate in the Republican primary contest, Cain does garner some racial comments, although the vast majority do not seek to be offensive. The overwhelming majority pertain to a debate incident in which Mr. Perry referred to Mr. Cain as “brother,” while calling Mr. Romney “sir.” The names calling is adding more problems for the Perry campaign, the Tweets joke that Rick Perry was displaying a “southern good ‘ole boy” racial bias, and Cain should have been more aggressive in identifying this. Taking this as the exception, there is very little mention of his standing as the sole African American candidate in the primary contest.

While no candidate improved greatly after the debate, the sentiment post debate certainly does not bode well for Rick Perry, who is the brunt of jokes from all parties. Herman Cain continues his charge forward, but his momentum seems to be stalling as people begin to examine him more closely and take him as a serious candidate. Finally, Mitt Romney continues to lead the pack. There is certainly a long road ahead, New Hampshire and South Carolina can change the game, but as it stands now, Romney is the man with the target on his back.

Tweet This Post

Posted in general | Tagged Bayesian Classifiers, Data Mining, DiscoverText, GOP, Herman Cain, Machine Learning, Mitt Romney, Republican Party, Republican Primary, Rick Perry, Social Media, Texifter, Text Analysis, twitter, Twitting Monitoring | Comments Off

GNIP Power Track Tutorial: Getting Started via DiscoverText

Posted on October 23, 2011 by Stuart Shulman

This is an 11-minute tutorial covering how you get started using the GNIP Power Track for Twitter (the “full firehose”) to capture large numbers of Tweets for analysis.

Tweet This Post

Posted in DiscoverText, product | Tagged analytics, DiscoverText, GNIP, Power Track, Tutorial, Tweets, twitter | 2 Comments

GNIP’s Power Track

Posted on October 20, 2011 by Stuart Shulman

This short video talks about some of the advantages when using the GNIP-enabled Power Track for gathering Tweets via DiscoverText.

Tweet This Post

Posted in DiscoverText, product | Tagged datamining, DiscoverText, GNIP, PowerTrack, twitter | 4 Comments

DiscoverText on the Road

Posted on October 18, 2011 by Joseph Delfino

For the past month, Texifter technical and business development staff have been occupied with multiple large projects including, the cultivation of the GNIP Beta, updating system performance, experimenting with lead generation, and digging through the #OBL:-) Funniest Tweets. However, this has not stopped Texifter from taking DiscoverText on the road, allow us to display the software to new audiences, and give new faces the opportunity to interact and witness the power of the cutting edge tool by participating in demos and attending talks highlighting the software.

At the Conference on Social Informatics in Singapore, Founder and CEO Stu Shulman was one of two researchers who conducted tutorials with other academics and researchers, which allowed SocInfo attendees to demo DiscoverText and work directly with Stu for information and insights. Following SocInfo, Texifter made a major appearance at the Association of Internet Researchers Conference on the other side of the Pacific, in Seattle. There Stu gave an Ignite Talk on the #OBL:-) Tweets, which eventually lead to a crowdsourced winner. The presentation led to some great talk on Twitter, with such praise as,

“#StuartShulman : Political scientists did not see this coming – they never do! Yes and amen, that’s why I moved to media studies :-) #IR12,”
and
“#OBL:-)1-6 Great talk!.”

Taking DiscoverText on the road does not end with AoIR. In fact, the 2011 calendar get busier as the year comes to a close. In the upcoming months, there will be opportunities to see DiscoverText and talk to a member of the development team, when they will be on-site at the Sentiment Symposium in San Francisco, Text Analytics West in San Jose and the IE Group Predictive
Analytics Conference in Amsterdam. At Text Analytics West, I will be introducing DiscoverText to the masses at North America’s premier text analytics conference, spending 2 days on-site in San Jose. In Amsterdam, the Predictive Analytics conference gives people another opportunity to hear Stu speak, when he takes the stage alongside major names from the European business community, including Banco Santander, Telefonica, and AstraZeneca, discussing the benefits of using analytics in business.

Both of these conferences will be great opportunities to see and hear about the power of DiscoverText. If you have any questions about DiscoverText, or you are interested in meeting at one of these conferences, please contact us.

Tweet This Post

Posted in general | Tagged #ir12, #OBL, Demo, DiscoverText, GNIP, ieGroup, Presentation, SocInfo, Stu Shulman, Texifter, TextWest, Travel | Comments Off

The Crazy Life of a Tweet

Posted on October 15, 2011 by Josh Sowalsky

On the evening of May 1^st, Osama bin Laden was killed in Pakistan. Over the following hours (and days), bin Laden was mentioned by millions across the twitterverse. Among the first of those tweets came from a young, gay, homeschooled man named Josh from a conservative, Christian, military family in Indianapolis.

He used and continues to use the twitter name “Calebressas,” and currently has a moderately low klout score of 31. Nevertheless, on the evening of the 1^st, he tweeted, “2011: US allows gays into the army. Later in 2011: US army kills bin laden. WAY TO GO GAYS.” Moments later his tweet went viral and in his own words, his twitter was “Blowing up!!!!!”

Thousands of twitter users began retweeting his words, sometimes citing him and sometimes not. It is not possible to identify if anyone actually tweeted this prior to Josh. Josh’s tweet is our best guess. Additionally, it is difficult to tell how many retweeted his words, as many retweets altered his original form.

One way in which this tweet changed was when one user switched the tweet’s final word from “Gays” to the words “Gay People.” We are not sure exactly when this happened and who made this change, but we can say – for certain – that it was a man named Bill Taylor (@billytwitty, klout=39) whose version of this tweet also went viral. Thousands more saw this version, though Josh likely didn’t know it.

Soon after this, as bin Laden’s name was mentioned like rapid-fire, a girl named Desi (@desilove, klout=38) retweeted the same viral tweet that Bill Taylor sent across the twitterverse. (Whether Desi saw it from Bill Taylor is also unknown.) And finally, at 12:48am EST on May 2^nd, a fellow named Alan Yanuard using the twitter name “Aquayers,” having seen Desi’s tweet, tweeted this: “RT @desilove: 2011: US allows gays into the army. Later in 2011: US army kills bin laden. WELL DONE GAY PEOPLE!”

DiscoverText ingested millions of bin Laden tweets that night using the Twitter API, and five month later, the Qualitative Data Analysis Program at UMASS recruited nearly 30 “crowdsourcers” to find the funniest tweet. Over 22,000 tweets were sorted and Alan Yanuard’s tweet made it into the top 6. From there, Stu Shulman presented those six bin Laden tweets to the Association of Internet Researchers, where the audience members tweeted their favorite. The votes were entered and Alan Yanuard’s tweet was the winner

And so ends the life of this bin Laden tweet. Congratulations to Amanda Crosby who uncovered this tweet (#18542) and will be awarded $100; and thank you to all of our tweeters, coders, voters, and dissenters, without whom this project would not have been possible.

Tweet This Post

Posted in general | Tagged #OBL, analytics, API, Bin Laden, Code Text, coding, Crowdsource, crowdsourcing, DiscoverText, Research, Social Media, Texifter, Text Analysis, Text Analytics, Tweets, twitter, Twitter API | Comments Off

Mining for Leads

Posted on October 14, 2011 by Joseph Delfino

Texifter is pioneering the use of machine-learning methods to harvest essential information from unstructured social media data. For example, Twitter feeds can generate top line and bottom line growth. This requires a text analysis tool that moves beyond simply displaying information. To do this right, the tools need to become more intelligent as users to interact with data. DiscoverText is engineered to harvest large amounts of unstructured social media data to gain insights into potential business, which fosters the creation of new strategies to drive value and insights.

Social media proliferation means that people who do not use Facebook, Twitter, or LinkedIn are now in the minority. According to “Socialnomics” author, Erik Qualman, using social media is no longer of question of yes or no, but of how well it is used. Text analysis platforms are essential in business strategies, increasingly looked to by businesses as a way to generate new revenue.

Recently, Texifter analysts have started to use DiscoverText as a lead generation tool, attempting to find potential customers on social media channels. In preliminary research, Texifter has been able to engineer 3 custom lead generation and 2 business insight classifiers. These classifiers were formed not only around large, visible corporations, but also smaller, more social media obtuse industries, such as legal services and survey generation.

Using Twitter, analysts harvested information on one field in general, the legal profession, and 2 specific businesses, Starbucks, and McDonalds. With these archives the goal was to create a custom lead generation classifier which could be continuously refined and used over time, with the goal of identifying potential clients and business segments which could be studied further and improved upon. Whether data is big or small, DiscoverText is suited to handle the harvested text.

Legal Services
Formation of the Legal Services classifier began by harvesting Twitter for a group of “law-centric” Tweets. This yielded over 4,000 Tweets, easily enough to begin working on a
classifier. Using the coding scheme “Spam,” “Random Tweet,” and “Potential Customer,” about 400 tweets were coded, and the remainder of the set classified using the new classifier. This showed that 21% of data had the potential to be a possible legal customer, the majority of which, were often people looking for a “divorce lawyer.” This type of insight can be found by using DiscoverText’s interactive graphs, by selecting the corresponding section of the graph. Using the reply tool within DiscoverText, it is possible to respond to these Tweets immediately after discovering them. It was not shocking to find that the majority of the overall set was marred with spam, most often lawyers advertising their services, however, the classifier was very effective in segregating them when classifying. With the high amount of spam, lawyers strategy needs to change. Instead of producing large amounts of spam, lawyers should employ social media text analytics to search for their clients, instead of crowding the already noisy Twittersphere.

Real Big Data- Highly Visible, Often Mentioned Mega-Corporations
The McDonalds and Starbucks archives combined included nearly 70,000 Tweets-this coming over a couple of days, using the normal Twitter API, harvesting 1-2% of all Tweets-meaning these corporations are mentioned over 5 million of times a day. Once broken into manageable datasets, it was possible to form multiple lead generation and business insights classifiers, all specifically tailored to these large corporations business. A Starbucks dataset of over 1,000 Tweets was coded using the scheme “Potential Customer,” “Location Insight,” “Random Tweet,” and “Spam.” The classifier revealed that 46% of the Tweets were potential customers, and 18% of the data provided a location insight, meaning that over 60% of the data was valuable information. The potential customers often Tweeted about their upcoming Starbucks visit, often posing the question of what to purchase, opening the door for suggestions and given Starbucks the opportunity to promote a drink. Location insights often proved that there are many people who do not have the luxury of a Starbucks-across- from-a-Starbucks, and, that quality of service sometimes can differ depending on location- both important pieces of knowledge for any business, which can now be acted on.

Using the McDonalds data, a slightly different, “business segment” classifier was created, with the goal of developing a multi-layered classifier, one which could classify data on two levels, looking for potential leads, the coding scheme sought to identify Tweets based on different aspects of McDonald’s business, specefically with the goal of finding areas which could be improved.

Interestingly, for a “restaurant,” only 15% of the Tweets mentioned the food at McDonalds. Sure it is great to know that 15% of the Tweets are about food, but what type? Using the multi-layered classifer approach, we can create another classifier specifically tailored to just food comments, which will answer that question. When this is done, it reveals that the vast majority of Tweets do not specify a particular item at McDonalds.

However, when a particular food item was mentioned, the fries took first place. “America’s Favorite Fries,” might be working, however other brand names such as “Big Mac” and “McFlurry” might not be. Additional steps in the business insights process are endless with DiscoverText. From here, it possible to continue classifying by sentiment, or taking the individual categories again creating a new classifier.

Small Data-Survey Generation
DiscoverText recently began using the GNIP PowerTrack, taking social media lead generation and business development to another level, by giving the system the ability to ingest 50-100 times more Tweets than the regular Twitter API allowed, and much more robust metadata. This will only increase the amount of data which DiscoverText can ingest, allowing businesses to gain even more exact insight into their data, and to continuously monitor their brands on Twitter. Aiding a small survey company, we began feeds which pertained to the creation of online survey. Using the PowerTrack, more than 175,000 Tweets were harvested. but, how do we find such a specific request for survey creation help in such a large pile of Tweets?

Using just keyword searches, and DiscoverText’s Cloud Explorer, we were able to identify numerous customers who needed help generating survey participants, as well as a handful of people who needed help creating their surveys, all of which could be pursued by the survey creation company. Going out on a limb, using our built in response tool, I contacted a Twitter user who needed help creating a survey. Using the metadata which had been harvested, I knew that his Tweet was fresh and ripe to answer. The immediate response worked, as the user acknowledged my Tweet, and was thus persuaded to check into the new survey generation site.

DiscoverText as a lead generation tool is the perfect synthesis of analysis and monitoring tools which give great advantage for business to drive value and growth online. In the future, Texifter will be posting more material on how to use DiscoverText for Lead Generation, if there is something you would like to see, contact us. Please visit the Texifter website to view our new lead generation product sheet.

Tweet This Post

Posted in general | Tagged bayesian classifier, custom classifier, Data Mining, DiscoverText, lead generation, Social Media, social media monitoring, Texifter, text analyrics | 1 Comment