Five Pillars of Text Analytics

Document relevance is a key challenge for social media research. The specific problem of “word sense disambiguation” is widespread. If I am interested in “banks” where money is stored, I want to exclude mentions of river banks. If I am “Delta” airlines, I do not want to see social data about Delta faucets, Delta force, or those pesky river deltas. If I run a sports team like the Pittsburgh Penguins, the massive numbers of Facebook posts and Tweets about flightless but adorable birds are equally problematic. There are very few social media analytics projects that can easily avoid the challenge of sorting relevant and irrelevant documents.

At Texifter, we have refined a powerful set of tools and techniques for doing word sense disambiguation. This 5-minute video uses the example of Governor Chris Christie to illustrate how the five pillars of text analytics can help anyone to identify and remove irrelevant documents from an ambiguous social data collection. The principles are very similar to spam filtering in email; we use the same mathematics. Using DiscoverText, we argue an individual or small collaborative team can create a custom machine classifier for the task in just a few hours. Someday, we hope to get this down to a few minutes.

Posted in DiscoverText, general, product, research, Social Media | Tagged , , , , , , , | 2 Comments

Big Data TechCon


Posted in general | Comments Off

DiscoverText: A Vital Research Tool for Social Media

Longtime DiscoverText User Jacob Groshek

I’ve been using DiscoverText for several years, primarily in an academic research capacity but also working with journalists to help them reach broader audiences through social media.  From an academic standpoint, DiscoverText was the backbone of collecting Facebook and Twitter data for a study on the 2012 Presidential election that was published in Social Scientific Computer Review.  When working with the New England Center for Investigative Reporting, we use DiscoverText to collect social data and mine that to find users interested in topics being covered by the center and to share stories with them.  Raw data can be exported for use in third party software, as in the case of this work on co-mentions about flooding.

Altogether, DT is a vital tool to not only collect and gather data but also to code and analyze data.  It is simply the best place to begin with social data, and offers utilities many other entities do not, including the ability to clean data and minimize redundancies such as those created by bots.  DiscoverText and Texifter personnel have my highest endorsement. It is a model enterprise for users at all levels who are looking to engage in a rich and thorough analysis of social media data.

Posted in DiscoverText, Facebook, product, research, Social Media, Twitter | Tagged , , , , , , | 1 Comment

DiscoverText as a Teaching and Research Tool

Conducting research on the impact of large projects and events is difficult as each undertaking is unique. Traditional quantitative techniques face limitations of internal validity while qualitative research faces challenges of external validity. However, projects and events generate a massive amount of social media traffic that can be used to understand stakeholder interactions before, during and after delivery.  In addition to research, they also provide an avenue to enhance teaching and learning activities as students can collect social media data to apply new research techniques such as text mining. At Bournemouth university, we’ve launched a project called Festim that aims to develop research and teaching using data from social media networks.

For research, the initial objective  is to  enable the evaluation of social impacts, an area that is difficult to assess using conventional qualitative and quantitative approaches.  In the teaching domain, we wish to develop Reusable Learning Objects that can guide future graduate researchers seeking to apply social media data. We also wish to widen the range of research options available to undergraduate students  to include social media analysis.

We were fortunate to get a trial enterprise subscription to DiscoverText, which we used to support all of these activities. For research, DiscoverText enables us to understand the online narratives around events on Facebook, Google+, and Twitter. So far, we have been able to create a taxonomy that compares festivals by online stakeholder engagement. Our team is also exploring the nature of discussions that generate engagement across multiple platforms. We’ve used DiscoverText to uncover the nature of the temporary communities of interest that are created on Social Media  from the discussions around festivals.

Undergraduate researchers have also deployed DiscoverText. One student has used the platform to compare the impact of music events while another has explored how social media is used to recruit volunteers.  For teaching, our students have been using DiscoverText to understand the content of discussions on Facebook pages of case study companies as a way of illuminating current issues.

Posted in DiscoverText, Facebook, general, research, Twitter | Tagged , , , , , | 1 Comment

Tools for Text – Lecture at Northeastern University Monday March 10, 2014

Tools for Text

Dr. Stuart W. Shulman
Founder & CEO of Texifter
Research Associate Professor of Political Science
University of Massachusetts Amherst

12pm – 1:15pm, Monday, March 10
Center for Complex Network Research
5th floor Dana Building, Northeastern University (take elevator on left)

Tools for reviewing, coding, and retrieving text found in qualitative data analysis packages carry with them no particular attributes for ensuring the reliability or accuracy of the recorded observations. Based on 13 years of multidisciplinary experience, this presentation guides researchers through key aspects of measuring coder validity and reliability as part of building custom machine classifiers. The presentation demonstrates how text mining and related analytic tools focus attention on unexpected or difficult to code concepts, which in many cases will constitute the most interesting terrain for deeper investigation.

Posted in general | Comments Off

Texifter News: Migration to Azure and the Big Boulder Initiative

A brief follow up on Texifter. We successfully migrated “DiscoverText” ( to Microsoft’s Azure. It was very smooth, though we are going through a period of diminished search and filtering capabilities while the data re-indexes. Otherwise, the other capabilities appear stable.

We also launched a new beta product on Azure to allow users to get free estimates (and buy the data) self-serve from the full history of Twitter. The live prototype is “Sifter” (

Finally, I have been elected a board member and Treasurer for the Big Boulder Initiative ( In that capacity, I will be playing a role helping to organize the social data industry association that will launch in June at Big Boulder.

2014 is looking good for Texifter. On January 31, 2014, the company re-acquired of all assets and intellectual property related to DiscoverText, including the Sifter stack of language technologies for de-duplication, clustering, coding, and machine-learning, as well as the “CoderRank” patent.  Going forward, we believe these tools can make a significant impact on the history of information.

Posted in general, Texifter | Tagged , , , , , , , , , | 1 Comment

Collecting Facebook & Twitter Data

This is an updated 4-minute tutorial on how to collect public Facebook data via the Open Graph API using DiscoverText.

This is an even shorter 75-second tutorial on how to collect Twitter data via the public API.

Posted in API, Facebook, Social Media, Twitter, Twitter | Tagged , , , , , , , , , , , | Comments Off

New Product Testing –

Update 2.12.2014
The beta has been renamed Sifter and moved to


Posted in general | Comments Off

Digital Methods Initiative Winter 2014 Slides

It was a great joy to return to the University of Amsterdam and give this talk to my old friend Richard Rogers and his 100+ attentive workshop attendees.

Posted in DiscoverText, general, product, Social Media, Texifter, Twitter | Tagged , , , , , , , , , , , , , | 1 Comment

Free Gnip-enabled Historical Twitter Estimates

Use search and powerful @Gnip Power Track operators to find the exact slice of Twitter history that you need.

Search every tweet in history

Search every tweet in history via the Gnip-enabled Power Track for Twitter

Posted in general | Comments Off

Win Historical Twitter Datasets

Just about six hours left to win valuable historical twitter datasets and powerful text analytics software. This is by far our best Facebook raffle yet. To enter:

  1. Login to Facebook
  2. Visit this URL:
  3. Tweet about the raffle, follow DiscoverText on Twitter, or like on Facebook.
  4. Do all three to increase your chances.
  5. Refer friends to do better still.

The winner will get three 10-day historical Twitter  datasets, with Power Track search operators enable by our friends @gnip as well as gratis use of the DiscoverText software platform. Runners up will also get valuable software prizes for a full year.

Posted in DiscoverText, general, product, Social Media, Twitter | Tagged , , , , , , , , , , , , , , , | 1 Comment

DiscoverText Sweepstakes

SIOP 2013 DiscoverText Sweepstakes 
Win One Free Year of DiscoverText Enterprise Individual Access

I would like to invite you to enter the SIOP 2013 DiscoverText Sweepstakes. All you need to do is sign up online for a free, 30-day, no obligation trial:

It should only take 60-90 seconds to sign up. The deadline to sign up is April 19th, 2013 to be entered in this round of sweepstakes. This drawing is not limited to SIOP 2013 booth visitors. You can tell friends, colleagues, professors and students, family, everyone you work with, and anyone else you like about the trial and sweepstakes.

Users report that they love DiscoverText ( and the sweepstakes winner will get a valuable prize.

If you have any questions once you are in the free trial, or about text analytics more generally, I would be delighted to hear from you.

To watch a brief DiscoverText customer testimonial, please visit:

Posted in general | 3 Comments

Joining Vision Critical: Reflections of an Inventor

As of today, DiscoverText is part of a larger company: Vision Critical, a market research technology provider that works with more than a third of the world’s top 100 brands. The thrill of joining a successful and growing firm headquartered in Vancouver is amplified by my pride in what DiscoverText can bring to Vision Critical’s customers, and by my excitement about what we’ll be able to offer our existing customers now that we are part of Vision Critical.

I am personally joining Vision Critical as Vice President for Text Analytics, and while I will still be based in western Massachusetts, I’ll have a chance to work with Vision Critical staff and clients at the company’s offices across North America and around the world. My task at Vision Critical is to work with every colleague to add a new analytic dimension to the integrated product suite. We will further develop DiscoverText so that it becomes a seamless, world class text analytics solution for Vision Critical customers and research personnel.

To that end, we have started drawing on the software engineering expertise and market research experience of the Vision Critical team. As we move deeper into 2013, current DiscoverText and Vision Critical customers will benefit from a growing array of powerful tools, scientifically-informed methods, and access to new data types, all backed by a robust IT infrastructure. Whether you are working with panel survey data, emails, customer service data, or one of the many Gnip-enabled premium social media feeds, my job is to shorten the time it takes you to reach valid and reliable, data-driven insights. “Better insights faster” is the operative theme.

I am honored and deeply grateful to have this opportunity to join Vision Critical. On a personal note, as someone who grew up in Vancouver (and has fond memories of tossing Frisbees with my family on Spanish Banks), it’s wonderful to be joining a company that is one of Vancouver’s great success stories. While it’s now truly a global company, with half its employees based in offices as far-flung as New York, London and Hong Kong, I look forward to regular visits to Vancouver HQ.

The top priority now is to bring a rigorous and innovative approach to the analysis of text into an elegant and ever more useful software framework. I am confident that DiscoverText will continue to grow more powerful in many interesting an unexpected ways. On behalf of my colleague and trusted Chief Technology Officer Mark Hoy, I can say unreservedly we are pumped up to be a part of a vibrant organization like Vision Critical.

Posted in DiscoverText, product, Texifter | Tagged , , | 7 Comments

Gnip Power Track Expansion

It’s official. Starting in January 2013, DiscoverText customers will be able to purchase monthly access to four vibrant Gnip-enabled Power Track data feeds. Building on current successes with Twitter, we are pleased to offer unprecedented federated Power Track access to WordPress, Disqus, and Tumblr as part of our social #bigdata offering. Keep an eye on the blog for the launch in early January.
The DiscoverText Gnip Offering

Posted in DiscoverText, Disqus, GNIP, Social Media, Tumblr, Twitter, Twitter, WordPress | Tagged , , , , , | 1 Comment