Topic Modeling

📄 Peter Sachs Collopy, Topic Modeling, February 1, 2014.

In 2014, I did some work as a research assistant to Etienne Benson exploring how we might use software to interpret changing discourses about the environment. Etienne ultimately didn’t use these digital humanities techniques in his book on the topic, Surroundings: A History of Environments and Environmentalisms, but in the process I wrote this report on the potential of topic modeling specifically.

• • •

Although it has a longer history, topic modeling is now typically done using Latent Dirichlet Allocation, a hierarchical Bayesian model developed by computer scientist David Blei and his colleagues (initial article, technical description, less technical description). The premise, it’s worth stating explicitly, is a rather odd model of composition. “Documents are mixtures of topics,” explains one article, “where a topic is a probability distribution over words. A topic model is a generative model for documents: it specifies a simple probabilistic procedure by which documents can be generated. To make a new document, one chooses a distribution over topics. Then, for each word in that document, one chooses a topic at random according to this distribution, and draws a word from that topic. Standard statistical techniques can be used to invert this process, inferring the set of topics that were responsible for generating a collection of documents.” So topic modeling involves reverse engineering the creation of a text, which is conceptualized as a process of synthesizing multiple topics into a single document. As far as the computer can understand them, though, these “topics” are simply sets of words with associated frequencies. Because topics can be conceptualized as vectors, it is also relatively easy to measure correlations between words or correlations between topics.

Among digital humanists, the most widely used tool for topic modeling is MALLET (tutorial), “a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.” There are several alternatives, though, including libraries for C, Java, Python, Matlab, and R (topicmodels, lda) as well as two Java applications: Topic Modeling Tool provides a graphical interface for MALLET (tutorial), while the Stanford Modeling Toolbox is scriptable using Scala and processes CSV files. The Maryland Institute for Technology in the Humanities has produced some additional utilities for MALLET. Perhaps the easiest tool available is the Zotero extension Paper Machines, which incorporates MALLET. Paper Machines can model not only Zotero collections, but also JSTOR Data for Research CSV files, which otherwise requires some preprocessing to be legible to MALLET. Its output doesn’t seem very versatile, though.

There are several precedents for tracking the rise and fall of topics over time, though few seem to have contributed to peer-reviewed historical articles. A project by historian Robert Nelson on the Richmond Daily Dispatch from 1860 to 1865, conducted using MALLET, produced a widely-cited web presentation, and computer scientist David Newman and historian Sharon Block conducted a similar study of the Pennsylvania Gazette from 1728 to 1800. (Block and Newman also published a historical article on trends in the field of women’s history from 1985 to 2005 for which they used topic modeling to compare the prominence of research topics, but its diachronic analysis is all based on text-mining.) Historian Cameron Belvins has diachronically topic modeled Martha Ballard’s diary. Among several studies of academic journals by computer scientists, David Mimno’s work on the field of classics stands out for its careful methodology.

Perhaps the best model for using topic modeling in historical research is literature scholar Allen Riddell’s apparently unpublished paper on the history of German Studies, which contains a particularly clear technical explanation as well as a study of the rise and fall of topics ranging from gender to Goethe. Similarly, but in published articles, folklorist John Laudun and literature scholar Jonathan Goodwin used topic modeling to illuminate the dynamics of folklore studies’ turn toward performance, while Andrew Goldstone and Ted Underwood not only modeled the topics of the Proceedings of the Modern Language Association, but also produced network visualizations of the relationships between the topics.

It is also possible to produce models in which topics themselves evolve over time, changing their word composition. As a demonstration of dynamic topic models, Blei and fellow computer scientist John Lafferty produced a browsable year-by-year model of Science. Blei has also published, with computer scientist Sean M. Gerrish, a study attempting to use such models to measure the influence of scientific publications without resort to citations or other conventional bibliometric tools. Similarly, computer scientists Xuerui Wang and Andrew McCallum (the latter the initial developer of MALLET) developed a model called Topics over Time, which they tested on Presidential State of the Union Addresses and proceedings of the Neural Information Processing Systems conference. Very few humanists—possibly only Laudun and Goodwin—have used these tools.

Subjects: computing, media, technology

Category: writing