Prior to the groundbreaking research of Blei, Ng, and Jordan, delivered in a 2002 paper, the world of latent Dirichlet allocation(LDA) was underdeveloped and far from being used in the commercial world. LDA, a powerful statistical learning algorithm, is a generative model that allows sets of observations to be explained by unobserved groups which explain why some parts of the data are similar. Recently, the DiscoverText developers engineered a topic modeling and clustering system using the LDA techniques. Developing and adapting this exciting technology for expanded use is an integral part in the future of the DiscoverText text analysis toolkit.
An example of an LDA model is this: A user might have specified the creation of two topics
that can be classified as HOT and COLD. However, the classification is arbitrary because the topic that encompasses these words cannot be named. Furthermore, a topic has probabilities of generating various words, such as sun, summer, and Florida, which can be classified and interpreted by the viewer as “HOT“. Naturally, hot itself will have high probability given this topic. The “COLD” topic likewise has probabilities of generating each word: snow and blizzard might have high probability. Unlike the common collection, coding, and classification of data typically undertaken by the staff at DiscoverText, developing a topic model using the LDA algorithm within DiscoverText requires no human interaction except to specify the number of topics the algorithm is supposed to develop. For every document in an archive, it is assigned a score as to how well it fits in each topic category. From this, DiscoverText and the SIFTER™ Natural Language Processing modules work their magic to group the documents into a set of clusters based on how well each one is similar to other documents in the same cluster.
The topic modeling and clustering algorithms being engineered by Texifter personnel were inspired by a client that wanted to build a comprehensive topic model, but did not know where to start. After successfully manually building the customer a topic model, experimentation began on creation using DiscoverText’s (currently in-alpha) LDA modeling and clustering. From this automated processing, a topic model system using 8 topics was engineered, all giving significant insight into the customer’s business. For example, one of the bunches returned the keywords: people, crew, enjoy, fun, culture, environment, coworkers, time, meet, and team. All of these words brought together without the intervention of human coding are noticeably similar, and fit perfectly into what we called Topic 2, which could be named “Culture, Environment and Coworkers.”
Following the naming of topics, DiscoverText allows a classifier to be built around the topics, and an entire dataset to be classified according to the topic model developed by the LDA-based clustering. DiscoverText’s Automated Topic Clustering tries to find the best fit for even coverage across all topics found in the corpus. After viewing the results, the topic model classifier has so-far yielded promising results. In the future, this will allow the user to re-assign documents to topics, and update the underlying model The generative model can add new documents and infer their topics based on existing model, or update the underlying model with the new data. Look for the LDA modeling and clustering processes to be in beta by the beginning of fall. If there are any questions or comments regarding DiscoverText or using DiscoverText’s LDA Topic Model Platform, please email any of the knowledgeable DiscoverText staff.