DiscoverText took a leap forward a few weeks ago with the addition of a beta text classifier from the developers over at uClassify (www.uclassify.com). Integration of this tool into a one-of-a-kind active-learning system inside DiscoverText allows users to create and use topic, mood and sentiment classifiers on the fly. The need to make this kind of human language technology widely available was recognized years ago by uClassify founder Jon Kågström.
“We recognized that classifiers are mostly present at universities research departments and expensive commercial companies. We want to change that. We want everyone to have the possibility to use a top notch classifier.”
The joining of these two technologies, DiscoverText & uClassify, makes it possible for anyone to automate the tagging of very large datasets after only coding a fraction of the items. This machine-learning classifier for the masses greatly speeds up initial discovery and analysis of very large text data sets, including social media comments and open-ended survey answers. At uClassify, there is an index of user-created classifiers in addition to the many developed in house.
What is a classifier and how does it work?
According to the FAQ at uClassify,
A text classifier answers the question ‘To which predefined category is this text most likely to belong?’ For example, a classifier trained on web categories can answer “I am 99% certain that [this] web page belongs to the category food.”
A classifier works by using a theorem from Thomas Bayes that tells us we can predict with greater probability what something is by observing evidence about the item even if two things are very similar and you can’t be 100% sure what the item is. A good example of this is if you have two boxes that look exactly alike but have slightly different content. Suppose Box A has 10 red balls, 6 green balls, and 4 blue balls in it, whereas Box B has 5 red balls, 4 green balls and 11 blue balls. Without taking all the balls out and counting the colors, you can’t know which box is A and which is B very easily, but using Bayes theorem, you can pull out a couple of balls from one box and, based on the evidence of the balls you pulled, estimate the probability that the box is either A or B.
With a text classifier, instead of boxes you have categories and in the place of the colored balls you have text. You train a classifier by giving it large amounts of text that you have selected to be in a specific category and letting the classifier figure out what components in the text make each category different.
What can a classifier do?
Classifiers look for patterns in text that allows them to put the text into user defined categories. The user tells the program which category each section of training text falls into and the classifier then identifies similarities and differences. When the classifier is then used on a new piece of text it tries to find these same similarities. This open method allows for a wide variety of classifiers. One of the first commercial uses of the uClassify classifiers was for spam filtering. Radian6 (www.radian6.com), a social media monitoring company, teamed up with the guys at uClassify and trained a classifier to weed out spam blogs (blog.uclassify.com). Other text classifiers that have been built and trained since then include sentiment, mood, topic/category, gender identifier, language identifier, age analyzer, and many more (www.uclassify.com/browse).
My own projects have included creating classifiers to classify group support between three separate universities in an online contest and one to classify like and dislike of some national store chains in Facebook comments. In light of the open nature of uClassify technology and the DiscoverText platform, we hope our users find this statement by the uClassify team to be true:
“We find it enormously exciting to see what happens when a tool for creativity is given to the community. We hope to see all kinds of beyond-our-imagination classifiers and incredible web applications being built.”
If you have an idea for a classifier, go for it. We think great things are possible. If you get stuck at any point in the process, just drop us a line here at DiscoverText so that we can help you out.
Who are the uClassify team?
Much praise must go out to Jon Kågström, Roger Karlsson, and Emil Kågström, the three member team from Sweden that makes up uClassify. Jon has been working with text classification since 2004 and wrote a master’s thesis on “Improving Naive Bayesian Spam Filtering” (uClassify.com/About). Jon is a prolific programmer and has developed an assortment of free applications available on his homepage Codeode.com (www.codeode.com) that people should definitely check out. When Roger isn’t working on making the servers run better for uClassify, he is also working on his own programming projects over at Kephyr.com (www.kephyr.com). Roger’s motto at the top of the site says it all, “Nice software for nice people!” Emil (Jon’s brother) works on uClassify in his spare time. Like the other two, Emil also keeps a personal website (www.kipnic.com) that specializes in resources for webmasters. Not busy enough in his small amount of off time, he also created and updates a hockey pool website (www.yoursportpools.com).