CORD-19-NLP

Topic discovery and document retrieval from the CORD-19 Coronavirus challenge dataset.

This is my first attempt at Natural Language Processing and an initial submission to the Kaggle Covid-19 challenge: https://www.kaggle.com/covid19

In this Jupyter notebook I walk through some of the commonly used pipelines (Latent Dirichlet Allocation, TF-IDF etc.) for unsupervised discovery of underlying topics in a corpus of scientific literature related to the novel coronavirus outbreak and retrieving information most relevant to some of the key questions for tackling the pandemic. Specifically, this mini-project aims to answer questions from one of the subtasks of the challenge:

What do we know about virus genetics, origin, and evolution? https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=567

I have described the flow of the pipeline in markdowns in my notebook. By the end of the challenge of course I have done less than I had initially hoped for, but I would still like to acknowledge the tremendous amount of learning from places like stackoverflow.com and towardsdatascience.com. Please leave a comment if you have any suggestions/feedback. I hope that this notebook can also be useful for anyone who wishes to get started on NLP!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
LDATopicModelingAndSimilarityMetrics.ipynb		LDATopicModelingAndSimilarityMetrics.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CORD-19-NLP

About

Releases

Packages

Languages

yunpengl9071/CORD-19-NLP

Folders and files

Latest commit

History

Repository files navigation

CORD-19-NLP

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages