Watch a DiscoverText Webinar

We have been holding Webinars for the last few years as a way to get the word out about our software, but also to learn about the issues people are having gathering, organizing, filtering, searching, coding and analyzing text data. When folks sign up but miss the Webinar, we often get requests for the recording. So, here is a recent DT Webinar.

Posted in DiscoverText, product | Tagged , , , , , | Comments Off

A Classifier for the Masses

DiscoverText took a leap forward a few weeks ago with the addition of a beta text classifier from the developers over at uClassify (www.uclassify.com). Integration of this tool into a one-of-a-kind active-learning system inside DiscoverText allows users to create and use topic, mood and sentiment classifiers on the fly. The need to make this kind of human language technology widely available was recognized years ago by uClassify founder Jon Kågström.

“We recognized that classifiers are mostly present at universities research departments and expensive commercial companies. We want to change that. We want everyone to have the possibility to use a top notch classifier.”

The joining of these two technologies, DiscoverText & uClassify, makes it possible for anyone to automate the tagging of very large datasets after only coding a fraction of the items. This machine-learning classifier for the masses greatly speeds up initial discovery and analysis of very large text data sets, including social media comments and open-ended survey answers. At uClassify, there is an index of user-created classifiers in addition to the many developed in house.

What is a classifier and how does it work?
According to the FAQ at uClassify,

A text classifier answers the question ‘To which predefined category is this text most likely to belong?’ For example, a classifier trained on web categories can answer “I am 99% certain that [this] web page belongs to the category food.”

A classifier works by using a theorem from Thomas Bayes that tells us we can predict with greater probability what something is by observing evidence about the item even if two things are very similar and you can’t be 100% sure what the item is. A good example of this is if you have two boxes that look exactly alike but have slightly different content. Suppose Box A has 10 red balls, 6 green balls, and 4 blue balls in it, whereas Box B has 5 red balls, 4 green balls and 11 blue balls. Without taking all the balls out and counting the colors, you can’t know which box is A and which is B very easily, but using Bayes theorem, you can pull out a couple of balls from one box and, based on the evidence of the balls you pulled, estimate the probability that the box is either A or B.

With a text classifier, instead of boxes you have categories and in the place of the colored balls you have text. You train a classifier by giving it large amounts of text that you have selected to be in a specific category and letting the classifier figure out what components in the text make each category different.

What can a classifier do?
Classifiers look for patterns in text that allows them to put the text into user defined categories. The user tells the program which category each section of training text falls into and the classifier then identifies similarities and differences. When the classifier is then used on a new piece of text it tries to find these same similarities. This open method allows for a wide variety of classifiers. One of the first commercial uses of the uClassify classifiers was for spam filtering. Radian6 (www.radian6.com), a social media monitoring company, teamed up with the guys at uClassify and trained a classifier to weed out spam blogs (blog.uclassify.com). Other text classifiers that have been built and trained since then include sentiment, mood, topic/category, gender identifier, language identifier, age analyzer, and many more (www.uclassify.com/browse).

My own projects have included creating classifiers to classify group support between three separate universities in an online contest and one to classify like and dislike of some national store chains in Facebook comments. In light of the open nature of uClassify technology and the DiscoverText platform, we hope our users find this statement by the uClassify team to be true:

“We find it enormously exciting to see what happens when a tool for creativity is given to the community. We hope to see all kinds of beyond-our-imagination classifiers and incredible web applications being built.”

If you have an idea for a classifier, go for it. We think great things are possible. If you get stuck at any point in the process, just drop us a line here at DiscoverText so that we can help you out.

Who are the uClassify team?
Much praise must go out to Jon Kågström, Roger Karlsson, and Emil Kågström, the three member team from Sweden that makes up uClassify. Jon has been working with text classification since 2004 and wrote a master’s thesis on “Improving Naive Bayesian Spam Filtering” (uClassify.com/About). Jon is a prolific programmer and has developed an assortment of free applications available on his homepage Codeode.com (www.codeode.com) that people should definitely check out. When Roger isn’t working on making the servers run better for uClassify, he is also working on his own programming projects over at Kephyr.com (www.kephyr.com). Roger’s motto at the top of the site says it all, “Nice software for nice people!” Emil (Jon’s brother) works on uClassify in his spare time. Like the other two, Emil also keeps a personal website (www.kipnic.com) that specializes in resources for webmasters. Not busy enough in his small amount of off time, he also created and updates a hockey pool website (www.yoursportpools.com).

Posted in general | Tagged , , , , , | 2 Comments

Peer Network Privacy Controls

Tonight, we introduce two slightly overlooked but very powerful features of DiscoverText – building peer networks within the system and privacy controls. The peer functions and privacy controls within DiscoverText have been around since the inception of the system, so I’ll only cover the new features in depth, however, feel free to take a look at the help wiki for more information.

Although DiscoverText is quite a powerful system when used by itself, if you form networks of peers within DiscoverText, you can assign those peers to help toil through your analysis. Imagine if you have a 300,000 item archive and you wish to perform some deep in-depth analysis of it. You can certainly use DiscoverText’s browse, search and classification tools to analyze some of that data, however if you want to perform a deeper-dive into figuring out the meat of the content, you can crowdsource via permissions within the system to other users for building archives, buckets, and datasets. Also, peers form the foundation of the coding tools where you can annotate documents and also provide active learning and training for the classification tools.

As a preliminary step to crowd-sourcing your analysis within DiscoverText, you must find and build a peer network. The first step in doing this is to find peers. From the navigation menu, choose “Peers / Requests”. This will show the users in your current peer network. If you click “Find Peers” in the Peer Options on the right, you’ll be able to search for and request peers to be in your network.

DiscoverText has the ability to search by a user’s first or last name in the search box, however recently, the programming team at Texifter has added the ability to also be able allow you to filter your results by a user’s company, education, or affiliations. This can help you easily build peer networks within your company, school, or other organization.

Of course, this depends on the individual users filling and keeping up to date their account information. With this though privacy concerns may arise. To limit who can search for you and see you in the search results, we have an option to not be listed in the search results as well as our newly added custom peer search filters. To change your privacy settings, go to your account and click on “Privacy Settings”.

The initial screen will allow you to filter our what other users can view on your profile. Beneath this is the option to opt-out of being able to be searched. Of course, if you opt-out of this, none of your peers on the system may be able to find you and be able to form a peer network with you.

Say for example, you work at Texifter, and you only want other Texifter staff to be able to search and send you peer requests. As a prerequisite, you need to be able to be displayed in the search, however if you click on the “add or edit custom filters”, you will now be able to filter out who can search for you.

In the “Edit Custom Peer Search Filters” page, you can create filters to limit who can view you in the search results. In the example below, I’ve created a filter so that only other users with an email address that ends in “texifter.com” will be able to find me in the search results.

Basically, you have the choice to allow or deny a user based on specific criteria. At the moment as a pilot, we are limiting the field choices to email and country, but will expand later depending on the amount of use this functionality receives. Once you select your field, you can select the clause (“is”, “ends with”, etc.) and the value for the filter.

When a user searches for peers within the system, your individual filters are applied in order to determine whether or not that user will display you in their search results. We’ve added move up / down buttons on the right of each filter so you can change the order as well.

Let us know what you think about these new features and others, and as always we welcome you to contact us with any questions, issues or other feedback.

Posted in DiscoverText, product | Tagged , , , , , , | Comments Off

70,000 Game of Thrones Tweets

I wanted to test the new Cloud Explorer developed by Texifter as part of the DiscoverText suite of text analysis tools. Since the revisions to the old tag cloud tool were inspired by comments from a Game of Thrones fan, Dr. R. Charli Carpenter, I decided to run the 70,000 Twitter Tweets I have in my archive through the new tool. After dropping out a few numbers and one offensive term, and changing the colors on a few items, this was my visualization:

Cloud Exlporer Visualization of Most Frequent Terms in 70,000 Game of Thrones Tweets.If after looking this visualization I then wanted to know why the term “getglue” was appearing more than 4,000 times (4,537 times to be precise), I can click on the term to search and see that @getglue is the frequently occurring Twitter address of a “social network for entertainment” that I had never heard of.

That is just one reason we call it DiscoverText!

 

 

 

Posted in general | Tagged , , , , , | Comments Off

Search All Public Facebook Posts

All this week, we’ve been highlighting some of the new features and functionality recently added to DiscoverText. Tonight, Texifter staff is excited to present the ability to search across all available public Facebook posts and pull those comments into a DiscoverText archive.

If you are using Facebook and perform a search, you are dynamically shown the people and pages that correspond to your search, however, if you look towards the bottom of the list, you will see a link to “show more results”. Clicking on this will display a full list of the results as well as a small menu on the left showing you the different types of objects that are in the search results set:

The red circled item above is intrinsically powerful – it allows you to search across all available public posts where your keyword(s) appear. This is great inside of Facebook, however if you have hundreds or thousands of search results, it can be a daunting task to find and analyze all that text. This is where the DiscoverText archive, filter, search and classify method comes in handy.

In previous posts, we’ve shown you how to scrape Facebook with DiscoverText and how to link an existing DiscoverText account to a Facebook account, however in the past, we’ve only allowed the scraping of pages, groups and personal wall feeds. Now, you have the ability to search across all public Facebook posts and import those comments. This is a big functionality upgrade.

To do this, log into your DiscoverText account using your Facebook credentials (or, link to Facebook from the Data Import screen inside of DiscoverText), and click on the “Facebook” live feed import type:

Note that “free” DiscoverText accounts do not have the ability to import live feeds, so you will not see this option if you do not have a professional or enterprise DiscoverText account (or, if you are still in your 30-day trial).

Next, as you would with any other live feed import, select the archive to import your data into, or, choose to create a new archive:

Press “Continue” when ready, and finally, on the search screen, you will see a new option to search across all public posts:

Pressing the “Search Public Posts” button will query across all publicly available posts on Facebook and add those to your archive. Also, as with any other Live Feed data type in DiscoverText, you will have the option to set scheduled fetches so you can have DiscoverText continuously fetch new data as it becomes available for your query.

You should note however that this will only search and gather all public user posts – that is, it will not search and pull in data across posts on page or group walls. Unfortunately, this is a limitation of Facebook’s search and DiscoverText is at the mercy of the Facebook Graph API in this regard.

We hope you enjoy this new functionality and find it useful. As always, for any suggestions, question, or comments, contact us at any time.

 

Posted in DiscoverText, product | Tagged , , , , , , | 3 Comments

“Cloud Explorer” – More Than Just a Tag Cloud

Continuing from last night’s blog post regarding Texifter’s latest work regarding the new navigation design and functionality in DiscoverText, tonight I’d like to introduce you to DiscoverText’s Cloud Explorer.

Most traditional tag clouds such as Wordle and TagCrowd only give you a static view of the terms in your text, however, the new and improved DiscoverText Cloud Explorer allows you not only to visualize the most frequent terms in your archives, but also allows you to customize the terms, term colors, and drill down into your archive by searching directly in DiscoverText on the terms.

To create a tag cloud in DiscoverText, first go to your archive details either by right clicking on the archive in the navigation tree, or, clicking on the “Archive” link in the title when an archive is selected:

 

Then, in the “Archive Options” on the right side, choose “Generate Tag Cloud”:

The tag cloud generation will start and run asynchronously in the background:

After the tag cloud has completed its creation, you will receive a notification. To view your tag cloud, go back to your archive’s detail page and choose “View Tag Cloud”. Initially, your tag cloud may look a bit plain:

In the upper-right in the Cloud Explorer, you will find a set of controls for changing the number of words in your tag cloud, increasing or decreasing the font size, or increasing or decreasing the variation of the font sizes. In the future, we may add various options for different layouts as well.

Clicking on any term in the cloud will bring up a context menu:

For any word in the tag cloud, you can drill down further and run a dynamic search by choosing “search term” for the highlighted word.

If you are the archive owner, and wish to permanently remove words from the tag cloud, choose “remove from tag cloud permanently”. With this, you can remove very common words such as “the”, “and”, “http”, etc.

If you choose “remove from tag cloud for this session”, then the term will be removed from this particular tag cloud while you are logged into DiscoverText. If you log out and log back in, it will reappear. This is good for performing tag cloud-based triage when trying to analyze your archives to choose keywords without performing any destructive edits.

Finally, again if you are the owner of an archive, you will also see the option to set a term’s color. If you click on the small color swatch to the right, a small color selection popup will be shown (as it is shown in the screenshot above). You can choose a color to assign to the term and click “set term color” to permanently set the word’s color. This is useful for highlighting key terms while performing your analysis.

I hope this gives you a bit of an introduction to one of the many new and exciting functions of DiscoverText we’ve been working on. You should also note that DiscoverText’s Cloud Explorer is only available for those users either still within the 30 day trial period, or those with professional or enterprise licenses.

As always, if you have any questions or comments, feel free to contact us.

Posted in DiscoverText, product | Tagged , , , , , | 6 Comments

Download the Recent DiscoverText Webinar

For the next 14 days, you can download and replay the entire 145 MB DiscoverText Webinar. Your PC may require a Codec from GoToMeeting to play the file properly.

Many thanks to Stranded Wind for a very generous evaluation!

Posted in general | Tagged , , , , , | Comments Off

Next Free DiscoverText Webinar Tuesday 12 PM EST

Register here for the next live, free, interactive training Webinar with Dr. Shulman.

This Webinar introduces new and existing DiscoverText users to the basic document ingest, search & code features, takes your questions, and demonstrates our newest tool, a machine-learning classifier that is currently in beta testing. This is also a chance to preview our “New Navigation” and advanced filters.

DiscoverText’s latest additions to our “Do it Yourself” platform can be easily trained to perform customized mood, sentiment and topic classification. Any custom classification scheme or topic model can be created and implemented by the user. You can also generate tag clouds and drill into the most frequently occurring terms or use advanced search and filters to create “buckets” of text.

The system makes it possible to capture, share and crowd source text data analysis in novel ways. For example, you can collect text content off Facebook, Twitter & YouTube, as well as other social media or RSS feeds. Dataset owners can assign their “peers” to coding tasks. It is simple to measure the reliability of two or more coder’s choices. A distinctive feature is the ability to adjudicate coder choices for training purposes or to report validity by code, coder or project.

So please join us Tuesday June 7 at 12:00 PM EST (Noon) for an interactive Webinar. Find out why sorting thousands of items from social media, email and electronic document repositories is easier than ever. Participants in the Webinar will be invited to become beta testers of the new classification application.

Posted in general | Tagged , , , , , | Comments Off

Dwindling Osama bin Laden Tweets and the RT Champs

The running count in my DiscoverText “bin Laden” project is ~4.5 million unsharable Tweets. Though we can’t share them, we can describe them.

One of the interesting features of this dataset is the rapidly dwindling Tweet rate over the month of May. At the peak, we were looking at hundreds of thousands of Tweets per day, and over the last four days in May the total never crossed 15,000.

If you tag cloud the archive with 2.4 million bin Laden Tweets, this is what you see:

Some of the champions in the Re-Tweet derby (with the # of RTs we captured shown in parentheses):

RT @workforfood: ☑ Saddam Hussein ☑ Osama Bin Laden ☐ Internet Explorer (via @dudup) (2,053)

RT @TititicaReaI: CD DO BIN LADEN: 01 – Grenade ; 02 – Firework ; 03 – Dynamite ; 04 – Airplanes ; 05 – Blow ; 06 – Toxic ; 07 – Party in The USA (2,060)

RT @omgfacts: Bin Laden’s death was announced on May 1st, 2011. Hitler’s death was announced on May 1st, 1945. (2,396)

RT @sickipediabot: So Osama Bin Laden is dead… Amazing what the Americans can do when the Playstation Network is down. (2,512)

RT @jimmyfallon: Buried at sea? Tough year for the ocean-BP, Japan radiation & now “Hey mind if we put Bin Laden in here?” #FallonMono (3,948)

RT @TWlTTERWHALE: Please do not click on any links saying Osama Bin Laden EXECUTION Video! This is a virus that hacks accounts. RETWEET! (4,144)

The four largest RT sets were a single hoax post with a variety of now expired bit.ly URLs pointing to the “server at re-login.twitter.w2c.ru.”

Posted in general | Tagged , , , , , , | 2 Comments

Connect Existing Facebook & DiscoverText Accounts

Many people have asked us “How to I import Facebook data if I have a regular DiscoverText Account?” – The short answer is that there is no way to pull in Facebook feeds within DiscoverText unless you register and login with the “Connect with Facebook” option… until now!

We’ve received many requests for this, and I’m happy to announce that it is now possible to link an existing DiscoverText account to a Facebook account to allow the importation of Facebook pages and groups.

Now, there are three ways you can allow your Facebook account access for DiscoverText to import Facebook data on your behalf:

1) The traditional method of registering and authorizing your DiscoverText account via the “Connect with Facebook” button from the home or registration page (see the help wiki page for an overview). Of course, with this method, you’ll be creating a brand new DiscoverText account.

2) If (and only if) the Facebook email address you use to log into Facebook is the same as the email address on file for your DiscoverText account, click on the “Connect with Facebook” button on the home or registration page – this should allow you to directly link your Facebook account to your DiscoverText account.

and (3) – linking an existing DiscoverText account inside of DiscoverText to connect with Facebook. To do this, from the Import Data / Import Archive page, you will see a new feature to Connect with Facebook:

Clicking on this button will contact Facebook and ask it to authorize DiscoverText to connect with your Facebook account. If you are not already logged onto Facebook, you will see the familiar Facebook screen:

After logging into Facebook, if this is the first time you’ve allowed DiscoverText to link up with your Facebook account, Facebook will ask you to authorize DiscoverText to access your Facebook account on your behalf. This permissions are crucial to allowing your DiscoverText account to pull in Facebook feeds on your behalf, especially the “Access my data at any time” permission, which will allow scheduled feeds to fetch new data for you while you are not logged onto DiscoverText.

Finally, you’ll be sent back to the Import Data page, where you should now be able to access and import Facebook data:

We hope this new feature is helpful for you. As always, if you have any questions or comments, please feel free to contact us!

 

Posted in DiscoverText, product | Tagged , , , , | 1 Comment

New DiscoverText Import Available: Congressional Bills Via GovTrack

Tonight we’ve added a new import ability to DiscoverText – for any user with a Professional or Enterprise license (as well as the 30-day free trial license), you can now directly import data on Federal Congressional bills.

Thanks to the excellent efforts of GovTrack and the Sunlight Foundation, we are able to use the GovTrack search API to provide you the ability to search for bills, then using the Sunlight Lab’s Congress API, pull in the bill summaries and related metadata as new importable documents into DiscoverText.

As of the moment, only the summary of the bill text is available to us via the API, however, we hope that in the near future, we will be able to not only import the full text, but also integrate this new import option as a scheduled feed to automatically pull in new statuses on bills as they become available.

Here’s a quick primer on the new import functionality:

1) From your data import page for a project, select “GovTrack Data”:

2) Enter your search criteria and select the Congressional Term to search in:

3) From your list of search results, pick and choose which items to add to your “shopping cart” of bills to import into an archive. Also from here, you can go back to the search page to enter a new search to add more items to your “cart”:

4) When you have your items in your cart and are ready to “checkout”, click on the “view selections” link to view your selections. From this view, you can remove any and all items from your selections:

5) Finally – when you have the items ready in your selection “cart”, click the “create archive” link at the top of your selection list to create an archive:

Once you click the “Import” button, sit back and allow DiscoverText to gather your results in the background and create (or add to) an archive.

We hope you enjoy and find this new import ability useful – keep on the lookout here and via the @DiscoverTextDev and @Texifter twitter accounts for new and exciting updates to DiscoverText – and as always, feel free to contact us with any questions or comments.

 

Posted in DiscoverText, product | Tagged , , , , | 2 Comments

Coding Text – Part Three

Researchers interested in large text collections and their itinerant coders tend to muddle through with limited collaborative, cross-disciplinary resources upon which to draw. The generic criteria for high-quality codebook construction and effective coding are underdeveloped, even as the tools and techniques for measuring the limits of manual or machine coding grow ever more sophisticated. In that paradox there may be the seed of a partial solution to some of these issues. The ability to quickly and easily pre-test coding schemes and produce on-the-fly displays of coding inconsistencies is one way to more uniformly train coders to perform reliably (hence usefully) while ensuring a satisfactory level of valid observations. By the same token, the ability to permit an unlimited number of users to review or replicate all the coding and adjudication steps using a free, web-based platform would be a large and bold step onto our methodological and metaphorical bridge.

What are needed are more universal annotation metrics, a standard lexicon, and widely shared, semi-automated coding tools that make the work of humans more useful, fungible, and durable. Ideally, these tools would be interoperable, or combined in a single system. The new system would allow human coders to create annotations and allow other experts to efficiently examine, influence, and validate their work. At a deeper level, this calls for much better and more transparently codified approaches to training and deploying coders—an annotation science subfield—so that a more coherent and collaborative research community can form around this promising methodological domain.

Investigators in the social sciences use reliably coded texts to reach inferences about diverse phenomena. Many forms of public-sphere discourse and governmental records are readily amenable to coding; these include press content, policy documents, speeches, international treaties, and public comments submitted to government decision-makers, among many others.

Systematic analysis of large quantities of these sorts of texts represents an appealing new avenue for both theory building and hypothesis testing. It also represents a bridge across the divide between qualitative and quantitative methodologies in the social sciences. These large text datasets are ripe for mixed-methods work that can provide a rich, data-driven approach both to the macro and micro view of large-scale political phenomena.

Traditionally, social scientists working with text use a variety of qualitative research methods for in-depth case studies. For many legitimate and pragmatic reasons, these studies generally consist of a small number of cases or even just a single case. As Steven Rothman and Ron Mitchell note, the reliability of data drawn from qualitative research comes under greater scrutiny, as increased dataset complexity requires increased interpretation and, subsequently, leads to increased opportunity for error. The case study method is plagued by concerns about limitations on its external validity and the ability to reach generalized inferences. With the proliferation of easily available, large-scale digitized text datasets, an array of new opportunities exist for large-n studies of text-based political phenomena that can yield both qualitative and quantitative findings.

More to the point, high-quality manual annotation opens up the possibility for cross-disciplinary studies featuring collaboration between social and computational scientists. This second opportunity exists because researchers in the computational sciences, particularly those working in text classification, IR, opinion detection, and NLP, hunger for the elusive “gold standard” in manual annotation. Accurate coding with high levels of inter-rater reliability and validity is possible. For example, work by the eRulemaking Research Group on near-duplicate detection in mass e-mail campaigns demonstrated that focusing on a small number of codes, each with a clear-cut rule set, has been able to produce just such a gold standard.

Reliably coded corpora of sufficient size and containing consistently valid observations are essential to the process of designing and training NLP algorithms. We are likely to see more political scientists using methodologies that combine manual annotation and machine learning. In short, there are exciting possibilities for applied and basic research as techniques and tools emerge for reliably coding across the disciplines. To unleash the potential for this interdisciplinary approach, a research community must now form around the nuts and bolts questions of what and how to annotate, as well as how to train and equip the coders that make this possible.

Posted in general | Tagged , , , , , , , , | 1 Comment