Skip to content

Topic discovery and document retrieval from the CORD-19 Coronavirus challenge dataset

Notifications You must be signed in to change notification settings

yunpengl9071/CORD-19-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

CORD-19-NLP

Topic discovery and document retrieval from the CORD-19 Coronavirus challenge dataset.

This is my first attempt at Natural Language Processing and an initial submission to the Kaggle Covid-19 challenge: https://www.kaggle.com/covid19

In this Jupyter notebook I walk through some of the commonly used pipelines (Latent Dirichlet Allocation, TF-IDF etc.) for unsupervised discovery of underlying topics in a corpus of scientific literature related to the novel coronavirus outbreak and retrieving information most relevant to some of the key questions for tackling the pandemic. Specifically, this mini-project aims to answer questions from one of the subtasks of the challenge:

What do we know about virus genetics, origin, and evolution? https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=567

I have described the flow of the pipeline in markdowns in my notebook. By the end of the challenge of course I have done less than I had initially hoped for, but I would still like to acknowledge the tremendous amount of learning from places like stackoverflow.com and towardsdatascience.com. Please leave a comment if you have any suggestions/feedback. I hope that this notebook can also be useful for anyone who wishes to get started on NLP!

About

Topic discovery and document retrieval from the CORD-19 Coronavirus challenge dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published