Topic discovery and document retrieval from the CORD-19 Coronavirus challenge dataset.
This is my first attempt at Natural Language Processing and an initial submission to the Kaggle Covid-19 challenge: https://www.kaggle.com/covid19
In this Jupyter notebook I walk through some of the commonly used pipelines (Latent Dirichlet Allocation, TF-IDF etc.) for unsupervised discovery of underlying topics in a corpus of scientific literature related to the novel coronavirus outbreak and retrieving information most relevant to some of the key questions for tackling the pandemic. Specifically, this mini-project aims to answer questions from one of the subtasks of the challenge:
What do we know about virus genetics, origin, and evolution? https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=567
I have described the flow of the pipeline in markdowns in my notebook. By the end of the challenge of course I have done less than I had initially hoped for, but I would still like to acknowledge the tremendous amount of learning from places like stackoverflow.com and towardsdatascience.com. Please leave a comment if you have any suggestions/feedback. I hope that this notebook can also be useful for anyone who wishes to get started on NLP!