Five Pillars of Text Analytics

Document relevance is a key challenge for social media research. The specific problem of “word sense disambiguation” is widespread. If I am interested in “banks” where money is stored, I want to exclude mentions of river banks. If I am “Delta” airlines, I do not want to see social data about Delta faucets, Delta force, or those pesky river deltas. If I run a sports team like the Pittsburgh Penguins, the massive numbers of Facebook posts and Tweets about flightless but adorable birds are equally problematic. There are very few social media analytics projects that can easily avoid the challenge of sorting relevant and irrelevant documents.

At Texifter, we have refined a powerful set of tools and techniques for doing word sense disambiguation. This 5-minute video uses the example of Governor Chris Christie to illustrate how the five pillars of text analytics can help anyone to identify and remove irrelevant documents from an ambiguous social data collection. The principles are very similar to spam filtering in email; we use the same mathematics. Using DiscoverText, we argue an individual or small collaborative team can create a custom machine classifier for the task in just a few hours. Someday, we hope to get this down to a few minutes.

About Stuart Shulman

Stuart Shulman is a political science professor, software inventor, entrepreneur, and garlic growing enthusiast who coaches U13 boys club soccer and in the Olympic Development Program with a national D-license. He is Founder & CEO of Texifter, LLC, Director of QDAP-UMass, and Editor Emeritus of the Journal of Information Technology & Politics. Stu is the proud owner of a Bernese/Shepherd named "Colbert" who is much better known as 'Bert. You can follow his exploits @stuartwshulman.
This entry was posted in DiscoverText, general, product, research, Social Media and tagged , , , , , , , . Bookmark the permalink.
  • Indeprensus Blog

    Nice video. But I do have questions. Everyone uses their cellphones and Tablets these days. Spelling mistakes are bound to happen. How are you are taking care of spelling mistakes during analysis ?
    Example. Someone wants to write “I want to ride horse” but he types “I want 2 ride hose” . Here word “Hose” is legal dictionary word but is be out of context.
    I do not see many tools caring about spelling mistakes also context of words.
    So for me still Search is itself a big task !!!

    Found this https://sourceforge.net/projects/falcontextsearch/
    Its heavy.. slow.. amature but still a start.

    Or google will come up with document search tool like google search queries. :D

    • DiscoverText

      There is a technology called ‘fuzzy’ search. We do not happen to employ it. Bayesian classification does offer a way around misspelled key words as it looks at teh whole bag of words and not individual key words.