Skip to content

jl56923/Extractive-summarization-of-Pubmed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project 4 Summary

Overview

The overall goal of this project was to write a program that could take a query about any medical topic (e.g. a disease or drug, etc.), query Pubmed for recent open access review papers, and summarize the information in those papers and return the summary back to the user. The reason why I was interested in summarization is because the number of medical and scientific papers that is published every year keeps increasing, and it's difficult for people to keep up, especially if they are clinicians and are busy taking care of patients. Current solutions include websites like Uptodate, Dynamed, and Medscape, which curate and summarize the latest and most relevant medical information, but this requires thousands of hours from hundreds of experts to write these articles, which is very expensive and time-consuming to produce.

Therefore, I wanted to see if I could get code to do the summarization automatically. Specifically, I planned to pull papers from Pubmed, which is the main online archive of biomedical literature and which has a subset of papers which are open access and free for anybody to read.

Project design

For the design of my project, I used the API tools provided by Pubmed in order to send a query for any particular topic, and then to fetch the text of the most relevant open access review papers about that topic. Ideally, I wanted the code to also be able to cluster the review papers by topics, and to summarize the most relevant information from the papers about each topic. I planned on using different summarization algorithms to see which one would work best.

Tools & Data

In terms of the tools I used, I knew that I had to build code that could accept a query, send this query to Pubmed, and then get back the text of the most relevant review papers about that topic. Pubmed actually has quite a good API, providing different E-utilities which allow developers to query the database for different information. The E-search utility will take a string as a query and returns a list of PMIDs for papers that are about that topic. PMIDs, or Pubmed IDs, are numbers that serve as document identifiers for papers in Pubmed. I then had to convert the Pubmed ID to a Pubmed Central ID (PMID -> PMCID), because Pubmed Central is the archive for the open access papers. Not all documents are open access; only the documents that are open access even have a PMCID. Once I had the Pubmed Central IDs, I could use the efetch function to get the raw XML of the complete paper.

I then used the Element Tree XML API for python in order to parse the XML into clean text for different sections, and so I was able to store the title, keywords, abstract text, article text, and citations for each open access paper. By the time I had finished building this pipeline to query Pubmed, I was able to obtain the text data that I could then use for modeling.

Algorithms

Automatic summarization has two main categories, extractive and abstractive. Extractive summarization methods pick the most ‘important’ sentences and use those as the summary, but this means that these methods can only quote from the text. Abstractive summarization methods aim to paraphrase the content of the article, which is closer to how humans would summarize a text, but this requires text generation. I was very interested in abstractive summarization and this is also an active area of research, but the way that abstractive summarization is done these days is using neural networks, and due to time and compute constraints I was not able to really get into abstractive summarization.

There are libraries that provide methods for different extractive summarization algorithms, and I used a library called Sumy. I used the methods that Sumy provided for Text Rank, Lex Rank, LSA (latent semantic analysis), and Luhn summarization. Text Rank and Lex Rank are both based on the Page Rank algorithm, except that they treat sentences as nodes instead of pages. They differ only in how they calculate the similarity between sentences. LSA summarization uses singular value decomposition to get the most important sentences for each 'topic' in the document. Luhn summarization picks sentences that have the highest frequency of ‘important’ words.

Overall, when I ran these different extractive summarization methods on the full articles, they didn't actually do very well at picking out what I thought were the most important sentences. LSA and Text Rank/Lex Rank did seem to do better than Luhn, but it wasn't great. There are quantitative ways to judge the 'quality' of a summary, specifically calculating a ROUGE or BLEU score (these were actually both developed to score the quality of machine translation), but I didn't have time to write the code to calculate the ROUGE or BLEU scores. There are packages out there that will do this, e.g. Pyrouge for ROUGE scores, and NLTK offers functions to calculate BLEU scores, but I ran out of time.

I also tried clustering, and I was specifically interested in seeing if different clustering methods were able to differentiate between papers that were about different diseases. I used my Pubmed pipeline to get 52 review papers about Lewy Body Dementia and 36 review papers about atrial fibrillation, which are two completely different diseases. I found a set of word2vec vectors that were trained on biomedical texts, and for each paper abstract, I calculated its mean word2vec vector. If you plot the mean word2vec vector on a tSNE plot, you can see that after tSNE transformation that there's actually fairly good separation between the two groups.

tsNE true clusters

However, when I ran different clustering algorithms on the mean word2vec vectors themselves, the clustering algorithms actually didn’t do very well. These are the tSNE plots of the clusters that DBSCAN, KMEANS, and agglomerative clustering found, and you can see that in the word2vec space they actually can’t differentiate well between the two groups.

tsne DBSCAN clusters tsne KMeans clusters tsne Agglomerative clusters

What I would do differently next time & future work

I didn't get a chance to do topic modeling for this project, but I think it would be interesting to see whether or not topic modeling would be able to pick out the different diseases for each paper. I also did a little bit of experimenting to see if clustering could pick out papers about different sub-topics; for example, for the 36 review papers on atrial fibrillation, I manually labeled each of these abstracts based on looking at the titles and abstracts. When I tried clustering with KMeans and also DBSCAN to see if these could they could pick out the same groups that I did, neither KMeans or DBSCAN was able to pick out the groups that I labeled. For example, 7 of the 36 papers were about anticoagulation (and even had the word 'anticoagulation' in the paper title!), but DBSCAN and KMeans labeled these papers as belonging to different groups.

Also, for future work I definitely would want to explore abstractive summarization, potentially by implementing a neural network in Pytorch or Tensorflow. The learning curve for both of these packages wasn't really feasible to tackle in the timeframe for this project, but these are things that I definitely want to learn.

About

Extractive summarization of Pubmed

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages