The following interactive visualization was obtained after running a LDA algorithm over a corpus of abstracts selected from 2.2 million scientific papers. The corpus used here was obtained after an automated data cleaning/munging phase that allowed to reduce the number of documents to 3000, which all focus on cancer and air pollution. The goal of applying this topic modelling algorithm is to extract latent subjects of interests from raw text.
The LDA algorithm is unsupervised, so the only input it receives is the text of the abstracts and the number of topics to be discovered.
The diagram on the left is a projection of all the topics that the algorithm found out. Projecting them in a 2-dimension space allows to link the semantic similarity between two topics and their closeness in the diagram. By clicking on a topic, you can display the ranking of the 30 words that are most representative in a horizontal bar chart on the right. The red part of this chart is the frequency of a given word in the topic, the blue part is the overall frequency of the word. You can also adjust the relevance metric λ by changing its value on the upper right slider: basically, it measures the balance between the word’s probability in the topic and its marginal probability across the whole corpus.
If you find that at least one of the topics is relevant, please fill the form at the end of this page. If you have more general comments or question, feel free to get in touch at email@example.com.
This project is part of the Epidemium initiative, a data challenge that brings together cancer research and Big Data. The code and the data are available in a GitHub repository, see also this Jupyter notebook on NbViewer.