Cyril de Kock, NLP Data Scientist
How to train a topic model in an unsupervised manner on a custom dataset.
This blogpost will explain how to use the BigARTM topic modeling library, how to preprocess the data for topic modeling using Spacy and how to calculate coherence using the Gensim package.
The ultimate goal at Findest  is to realize ten thousand innovations within ten years. We help other companies and individuals innovate by finding technologies, materials and knowledge that they require. The technology scouts within Findest utilize IGOR^AI to locate the relevant technologies and deliver the results in a compact overview.
To find the right technologies the scouts have to cast around millions of scientific papers, patents and websites. They do this by composing queries which filter the search space based on functions, properties and keywords. A basic example of such a query is:
Functions: Desalinate salt water OR remove salt.
Keywords: seawater, desalination, separation technique, salt, sustainable, membrane, electrodialysis, Reverse electrodialysis, Reverse Osmosis
This query attempts to find technologies that can desalinate salt water. Notice how each pair of functions that is separated by an OR forms an action-object pair. It is refined by the addition of relevant keywords. The results from this query could be tuned further by adding keywords that the search should or should not match, such as diesel or wastewater. The query and found technologies can be examined here.
A query as such will return multiple thousands to ten-thousands of results. This makes it difficult for a user to get a feeling for what types of technologies and in which fields those technologies are present. The amount of search results is too large to process by a human and users may struggle to come up with additional keywords to filter the results. Filtering down more than hundred documents is an arduous task and could lead to the user missing the right information.
At the Findest development team we devised a topic modeling approach to provide filters to use in search. We hypothesized the modeling should provide the following benefits:
Most topic modeling approaches take a (large) dataset of articles or tweets and attempt to discover narrative structure within the data. This often comes with a lot of manual fine-tuning to discover the best possible topic distribution. An example of this is the research done by the Pew center . However, this use case requires training a new model for each query issued. This brings with it technical challenges which we will discuss in the following paragraphs.
Since searches can run up to hundreds of thousands of results we limit the scope of the topic modeling. We only consider the first thousand hits sorted by relevance. This is a relatively small amount of documents for topic modeling. However, our users are seeking a few key results among hundreds of technologies and reducing the scope is essential in finding the correct answers. Furthermore, we restrict ourselves to modeling scientific articles as these are the most relevant to finding new technology insights.
Our data consists of the titles and abstracts of the scientific articles retrieved upon issuing the query. An example of a document is the paper “Reverse osmosis in the desalination of brackish water and sea water ”. The title and abstract body are concatenated into one feature per document.
Our method of choice is the Additive Regularization of Topics Models (ARTM) model from the paper “Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization” . The model has an implementation on Github as a python module . The library also has a Latent Dirichlet Allocation  implementation in case you want to compare how it fares up against the ARTM model. The difference between ARTM models and the former two methods is that it does not make any probabilistic assumptions about the data. Instead, ARTM adds weighted sums of regularizers to the optimization criterion which are differentiable with respect to the model.
ARTM only accepts datasets in UCI Bag of Words (BOW)  or Vowpal Wabbit  format, and thus requires any tokenization and preprocessing to be completed beforehand. A convenient way of getting the data in the right format is to use the CountVectorizer class of SKlearn. The vectorizer class produces a feature matrix in BOW form which is perfectly suited for use by ARTM.
Every element is preprocessed to include only topically relevant information. This includes:
We also remove a total of 1,192 stopwords from the text. This list is composed of Scikit-learn’s English stopword list , the list of scientific stopwords curated by Seinecle  and a list of stopwords we added ourselves. Lastly, we exclude any terms that occur in more than 90% of documents as well as those that occur in less than 1% of documents.
To efficiently preprocess the data we start by putting all documents through Spacy’s pipe functionality and then subsequently tokenize the data by putting it through our custom preprocessor function. We provide a custom preprocessor function to the Countvectorizer to ensure the object works well with the output of Spacy. The code shows the functions used to execute all of these steps:
Training the model
Now that we have our preprocessed data in BOW format we can almost train our topic model. Initializing an ARTM model requires the data to be divided into batches and requires the user to specify the amount of desired topics. This is not desirable for live topic modeling and later we will explain how we tackled this problem. For now, the ARTM model is initialized as follows:
The model is initialized with a few regularizers we estimate suit our purposes. You can find all regularizers and their descriptions in the ARTM documentation . We also add the TopTokensScore as a metric to keep track of as it can later be used to retrieve topic representations. Here topic_names is a list of names, each representing one topic. We simply went with topic1, topic2 etc.’ The dictionary is the dataset dictionary as produced by the BatchVectorizer object of the ARTM package. The method conveniently splits your data into batches for the ARTM model to learn from. Example usage can be seen below:
Here data is the result of the earlier preprocessing and cv is the CountVectorizer object we used to transform our data. Now that we have our model we can train it with the following one-liner:
The topic representations generated by the trained model can be easily retrieved using the TopTokensScore metric we added earlier. This will get us the five most representative words for each of the topics generated by the model. This is illustrated in the following code fragment:
The resulting topic representations are:
We can see clear distinctions between the representations. Each topic seems to focus on a specific niche and there is no overlap in words between the topics. Attentive readers may have noticed we never specified the amount of topics we wanted the model to converge to. This is due to the fact we automatically determine the amount of the topics. The next section explains how.
Estimating N topics
To determine the optimal amount of topics to initialize the model with, we iterate over a range of values and pick the N that provides the highest coherence. Coherence is a metric that estimates how interpretable topics are based on how well topics are supported by the data. It is calculated based on how often words appear together in the corpus and how they appear in the topic representations. We use the Gensim implementation of Coherence  which requires the topic representations (e.g. Topic 0 is represented by the words ‘osmosis’, ‘reverse’, ‘reverse osmosis’, ‘treatment’, ‘osmosis membrane’) and the tokenized documents in the corpus. Our CountVectorizer only gives us the BOW and does not save the tokenized documents, luckily we can get the tokenizer function used by the vectorizer and use to tokenize our data once again as follows:
We can calculate the Coherence as shown:
The function returns a value between zero and one where higher values indicate a stronger coherence, which in turn indicates a better model. Now that we have a method to determine the best model we can iteratively decide which amount of topics best suits our purposes. This is done by simply looping over a range of topic counts while keeping track of the coherence score of the best performing amount of topics.
The process of automatically determining the amount of topics is likely less optimal than tweaking the results under human supervision. However, the described approach to topic modeling does allow us to perform fast, unsupervised topic modeling of the scientific abstracts returned by a query to IGOR^AI. In turn, this allows users of the technology scouting service to filter the search results and to learn new keywords from the generated topics, resulting in a better search experience and swifter discovery of technologies. We hope this blogpost was both clear and instructive and helps you develop your own topic modeling approach.