Technology scouting using IGOR^AI
  • COMPANY
    • ABOUT
    • CUSTOMERS
    • TECHNOLOGY SCOUTING
    • FINDEST BLOG
    • R&D SOFTWARE TOOLS
  • IGOR^AI software
    • About IGOR^AI
    • IGOR^AI for librarians
    • IGOR^AI for technology scouts
  • Scouting service
    • About the scouting service
    • Example >
      • Chemicals
      • Coatings
      • Ingredients
      • Process technologies
      • Products
      • Sensors
      • Suppliers
      • Applications
      • Competitors
      • Sustainable alternatives
      • Other technologies
  • DEMO
    • Software DEMO
    • Service DEMO
    • Webinar
  • Jobs
    • Careers
    • Senior Full Stack Developer
    • Junior Technology Scout
    • Technology Scout - Chemistry
  • LOG IN

Findest blog

Modeling Scientific Literature with Bigartm

6/7/2022

0 Comments

 

Author

Cyril de Kock, NLP Data Scientist
​

How to train a topic model in an unsupervised manner on a custom dataset.

This blogpost will explain how to use the BigARTM topic modeling library, how to preprocess the data for topic modeling using Spacy and how to calculate coherence using the Gensim package.


Introduction
The ultimate goal at Findest [1] is to realize ten thousand innovations within ten years. We help other companies and individuals innovate by finding technologies, materials and knowledge that they require. The technology scouts within Findest utilize IGOR^AI to locate the relevant technologies and deliver the results in a compact overview.

To find the right technologies the scouts have to cast around millions of scientific papers, patents and websites. They do this by composing queries which filter the search space based on functions, properties and keywords. A basic example of such a query is:

Functions: Desalinate salt water OR remove salt.

Keywords: seawater, desalination, separation technique, salt, sustainable, membrane, electrodialysis, Reverse electrodialysis, Reverse Osmosis
This query attempts to find technologies that can desalinate salt water. Notice how each pair of functions that is separated by an OR forms an action-object pair. It is refined by the addition of relevant keywords. The results from this query could be tuned further by adding keywords that the search should or should not match, such as diesel or wastewater. The query and found technologies can be examined here.

Problem statement
A query as such will return multiple thousands to ten-thousands of results. This makes it difficult for a user to get a feeling for what types of technologies and in which fields those technologies are present. The amount of search results is too large to process by a human and users may struggle to come up with additional keywords to filter the results. Filtering down more than hundred documents is an arduous task and could lead to the user missing the right information.
At the Findest development team we devised a topic modeling approach to provide filters to use in search. We hypothesized the modeling should provide the following benefits:

  • the topics structure the search results,
  • the topics can be used to filter the documents,
  • Users can extract additional keywords from the topics representations.

Most topic modeling approaches take a (large) dataset of articles or tweets and attempt to discover narrative structure within the data. This often comes with a lot of manual fine-tuning to discover the best possible topic distribution. An example of this is the research done by the Pew center [2]. However, this use case requires training a new model for each query issued. This brings with it technical challenges which we will discuss in the following paragraphs.

Data
Since searches can run up to hundreds of thousands of results we limit the scope of the topic modeling. We only consider the first thousand hits sorted by relevance. This is a relatively small amount of documents for topic modeling. However, our users are seeking a few key results among hundreds of technologies and reducing the scope is essential in finding the correct answers. Furthermore, we restrict ourselves to modeling scientific articles as these are the most relevant to finding new technology insights.
Our data consists of the titles and abstracts of the scientific articles retrieved upon issuing the query. An example of a document is the paper “Reverse osmosis in the desalination of brackish water and sea water [3]”. The title and abstract body are concatenated into one feature per document.

Topic model
Our method of choice is the Additive Regularization of Topics Models (ARTM) model from the paper “Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization” [4]. The model has an implementation on Github as a python module [5]. The library also has a Latent Dirichlet Allocation [6] implementation in case you want to compare how it fares up against the ARTM model. The difference between ARTM models and the former two methods is that it does not make any probabilistic assumptions about the data. Instead, ARTM adds weighted sums of regularizers to the optimization criterion which are differentiable with respect to the model.

Preprocessing
ARTM only accepts datasets in UCI Bag of Words (BOW) [7] or Vowpal Wabbit [8] format, and thus requires any tokenization and preprocessing to be completed beforehand. A convenient way of getting the data in the right format is to use the CountVectorizer class of SKlearn. The vectorizer class produces a feature matrix in BOW form which is perfectly suited for use by ARTM.
Every element is preprocessed to include only topically relevant information. This includes:
​
  • Spacy tokenization and lemmatization,
  • removing punctuation and digit tokens,
  • removing tokens that are shorter than three characters,
  • lowercasing the text.

We also remove a total of 1,192 stopwords from the text. This list is composed of Scikit-learn’s English stopword list [9], the list of scientific stopwords curated by Seinecle [10] and a list of stopwords we added ourselves. Lastly, we exclude any terms that occur in more than 90% of documents as well as those that occur in less than 1% of documents.

​To efficiently preprocess the data we start by putting all documents through Spacy’s pipe functionality and then subsequently tokenize the data by putting it through our custom preprocessor function. We provide a custom preprocessor function to the Countvectorizer to ensure the object works well with the output of Spacy. The code shows the functions used to execute all of these steps:
Picture
Link to Github: here
Training the model
Now that we have our preprocessed data in BOW format we can almost train our topic model. Initializing an ARTM model requires the data to be divided into batches and requires the user to specify the amount of desired topics. This is not desirable for live topic modeling and later we will explain how we tackled this problem. For now, the ARTM model is initialized as follows:
Picture
Link to Github: here
The model is initialized with a few regularizers we estimate suit our purposes. You can find all regularizers and their descriptions in the ARTM documentation [11]. We also add the TopTokensScore as a metric to keep track of as it can later be used to retrieve topic representations. Here topic_names is a list of names, each representing one topic. We simply went with topic1, topic2 etc.’ The dictionary is the dataset dictionary as produced by the BatchVectorizer object of the ARTM package. The method conveniently splits your data into batches for the ARTM model to learn from. Example usage can be seen below:
Picture
Link to Github: here
Here data is the result of the earlier preprocessing and cv is the CountVectorizer object we used to transform our data. Now that we have our model we can train it with the following one-liner:
Picture
Link to Github: here
The topic representations generated by the trained model can be easily retrieved using the TopTokensScore metric we added earlier. This will get us the five most representative words for each of the topics generated by the model. This is illustrated in the following code fragment:
Picture
Link to Github: here
The resulting topic representations are:
  • Topic 0: [‘osmosis’, ‘reverse’, ‘reverse osmosis’, ‘treatment’, ‘osmosis membrane’]
  • Topic 1: [‘pressure’, ‘fresh’, ‘fresh water’, ‘tank’, ‘pipe’]
  • Topic 2: [‘plant’, ‘technology’, ‘distillation’, ‘brine’, ‘performance’]
  • Topic 3: [‘salt water’, ‘concentrated’, ‘carbon’, ‘step’, ‘desalinated’]
  • Topic 4: [‘temperature’, ‘pump’, ‘steam’, ‘utilize’, ‘gas’]
  • Topic 5: [‘power’, ‘high’, ‘brackish’, ‘sea’, ‘brackish water’]
  • Topic 6: [‘desalination device’, ‘filter’, ‘structure’, ‘body’, ‘seawater desalination device’]
  • Topic 7: [‘ion’, ‘concentration’, ‘cell’, ‘material’, ‘solution’]

We can see clear distinctions between the representations. Each topic seems to focus on a specific niche and there is no overlap in words between the topics. Attentive readers may have noticed we never specified the amount of topics we wanted the model to converge to. This is due to the fact we automatically determine the amount of the topics. The next section explains how.

​Estimating N topics

To determine the optimal amount of topics to initialize the model with, we iterate over a range of values and pick the N that provides the highest coherence. Coherence is a metric that estimates how interpretable topics are based on how well topics are supported by the data. It is calculated based on how often words appear together in the corpus and how they appear in the topic representations. We use the Gensim implementation of Coherence [12] which requires the topic representations (e.g. Topic 0 is represented by the words ‘osmosis’, ‘reverse’, ‘reverse osmosis’, ‘treatment’, ‘osmosis membrane’) and the tokenized documents in the corpus. Our CountVectorizer only gives us the BOW and does not save the tokenized documents, luckily we can get the tokenizer function used by the vectorizer and use to tokenize our data once again as follows:
Picture
Link to Github: here
We can calculate the Coherence as shown:
Picture
Link to Github: here
The function returns a value between zero and one where higher values indicate a stronger coherence, which in turn indicates a better model. Now that we have a method to determine the best model we can iteratively decide which amount of topics best suits our purposes. This is done by simply looping over a range of topic counts while keeping track of the coherence score of the best performing amount of topics.
Picture
Link to Github: here
Concluding remarks
The process of automatically determining the amount of topics is likely less optimal than tweaking the results under human supervision. However, the described approach to topic modeling does allow us to perform fast, unsupervised topic modeling of the scientific abstracts returned by a query to IGOR^AI. In turn, this allows users of the technology scouting service to filter the search results and to learn new keywords from the generated topics, resulting in a better search experience and swifter discovery of technologies. We hope this blogpost was both clear and instructive and helps you develop your own topic modeling approach.


References
[1] https://www.findest.eu/
[2]https://medium.com/pew-research-center-decoded/interpreting-and-validating-topic-models-ff8f67e07a32
[3] https://www.sciencedirect.com/science/article/abs/pii/S0011916400886385
[4] https://link.springer.com/chapter/10.1007/978-3-319-12580-0_3
[5] https://github.com/bigartm/bigartm
[6] https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
[7] https://archive.ics.uci.edu/ml/datasets/bag+of+words
[8] https://vowpalwabbit.org/
[9] https://scikit-learn.org/stable/
[10] https://github.com/seinecle/Stopwords
[11] https://bigartm.readthedocs.io/en/stable/tutorials/regularizers_descr.html
[12] https://radimrehurek.com/gensim/models/coherencemodel.html
0 Comments

The Yellow Brick Road to Innovation: Technology Scouting

1/31/2022

0 Comments

 
Reading time: 3 minutes ⏱ 

For many companies, solving R&D challenges can reach a dead end as industry tunnel vision proves to be the decisive hurdle. Tried-and-tested technologies no longer meet innovation requirements, and new technologies are needed in order to remain competitive. Technology Scouting will help look beyond traditional means to gain an innovative and competitive advantage.
 ​
Picture

What is Technology Scouting?
Technology Scouting is an umbrella term for identifying technologies by gathering information from various resources (scientific papers, patents, websites, etc.). It is part of an innovation roadmap, performed by professionals to bring a broader insight to a technical challenge, and to remain at the forefront of innovation with existing and new technologies.

​Typically, Technology Scouting manifests itself in the early stages of an innovation roadmap, from scoping the challenge at hand, to developing and acquiring the solutions therefor. Typical examples of an innovation roadmap may comprise new product development or identifying emerging technologies and technical developments in the state-of-the-art.
 

Why use Technology Scouting?
Technology Scouting helps zoom out of the typical solution space and view the technical challenge from a wider perspective. In particular, sources within and outside the respective industry are explored, whereby existing solutions, that may be non-obvious and unheard at first, can be investigated to see if they can lead to an innovation.
(...) the global Technology Scouting Software market size is forecasted to boom upwards, and see huge growth by 2026

​​Furthermore, Technology Scouting supports efficient and effective innovation management. This includes risk planning, decision making, and identifying the long-term technological trends in the industry. It can also bring information in competitive intelligence to keep tabs on competitors’ innovation projects, which might prove to be a threat to future growth.

​
What’s the market for Technology Scouting?
Given the seek to innovate more quickly and effectively, it is no surprise to see a huge investment into Technology Scouting. Many multinational companies, such as BMW and BAE systems, already have dedicated departments to foster emerging technologies and evaluate which ones have the potential to bring innovation. 

At present, web-based software is the most popular tool for Technology Scouting, taking a share of 70% of the market, according to a market research report [1]. Furthermore, given its potential, growth isn’t expected to stagger any time soon either; the global Technology Scouting Software market size is forecasted to boom upwards, and see huge growth by 2026.
 

Conclusion
Technology Scouting proves to be a key part of an innovation roadmap. It helps identify existing and emerging technologies for innovation projects and supports strategic planning and competitive intelligence. With the market for Technology Scouting software predicted to boom over the next few years, so it seems the appetite to hunt innovation has just started.

​In the next blog post, the Technology Scouting tools are discussed.
 

​
Sources
[1] http://www.precisionreports.co/global-technology-scouting-software-market-18819976
0 Comments

    Authors

    Cyril de Kock, NLP Data Scientist &
    Lucas Tang, Technology Scout

    Archives

    June 2022
    January 2022

    Categories

    All
    Technology Scouting

© 2022 FINDEST
​AMSTERDAM
0031 20 893 2700 | igor@findest.eu
KvK: 68240023 // BTW: NL857356926B01 // IBAN: NL87 BUNQ 2290 5503 88
HQ: Amsterdam, De Boelelaan 1085, 1081HV, The Netherlands 
Terms and conditions
Photo used under Creative Commons from shixart1985
  • COMPANY
    • ABOUT
    • CUSTOMERS
    • TECHNOLOGY SCOUTING
    • FINDEST BLOG
    • R&D SOFTWARE TOOLS
  • IGOR^AI software
    • About IGOR^AI
    • IGOR^AI for librarians
    • IGOR^AI for technology scouts
  • Scouting service
    • About the scouting service
    • Example >
      • Chemicals
      • Coatings
      • Ingredients
      • Process technologies
      • Products
      • Sensors
      • Suppliers
      • Applications
      • Competitors
      • Sustainable alternatives
      • Other technologies
  • DEMO
    • Software DEMO
    • Service DEMO
    • Webinar
  • Jobs
    • Careers
    • Senior Full Stack Developer
    • Junior Technology Scout
    • Technology Scout - Chemistry
  • LOG IN