Neil Saunders 2020-12-18 20:39:17
This report has 4 aims:
- obtain the identifiers for all retracted publications in PubMed
- obtain the identifiers for all articles in PubMed that cite those retracted publications
- generate citation networks based on these datasets
- explore the networks with some basic analysis
We search PubMed using the rentrez
package. Knowing that there are
currently around 8700 retracted articles, we can set retmax
to a
suitably-high number. Or run an initial search, then use the value of
es$count
in a second search. This creates a data frame with PMID
(article identifiers) in one column.
library(rentrez)
es <- entrez_search("pubmed", "Retracted Publication[PTYP]", retmax = 10000)
articles <- data.frame(pmid = es$ids)
We use entrez_link
to find citations in PubMed for the given PMID.
Multiple citation PMIDs can be stored in each row of a list column in
the data frame.
The get_cites
function took around 2.5 hours to run, but completed
successfully.
The final step is to unlist
the cites
column, generating each pair
of article PMID and citing article PMID, per row. For articles without
citations, get_cites
returns NULL and so only PMIDS with one or more
citations are retained. This is what we want.
get_cites <- function(id) {
el <- entrez_link(dbfrom = "pubmed", id = id, db = "pubmed")
el$links$pubmed_pubmed_citedin
}
articles$cites <- sapply(articles$pmid, get_cites)
articles_df <- articles %>%
unnest(cites)
Each pair of article PMID and citing article PMID looks like this.
dataset %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"))
pmid |
cites |
---|---|
32873781 |
33262330 |
32696949 |
33006362 |
32683951 |
32873282 |
32668870 |
33281107 |
32649709 |
33048995 |
32646999 |
32760174 |
32646999 |
32666253 |
32623526 |
33141364 |
32598092 |
32875064 |
32581016 |
33281107 |
We can count pmid
to find the top 10 most-cited retracted articles.
Then we can retrieve the XML summary for those articles using
entrez_fetch
and parse the XML for the article titles.
top10 <- dataset %>%
count(pmid, sort = TRUE) %>%
head(10)
x <- entrez_fetch("pubmed", top10$pmid, rettype = "xml")
titles <- read_xml(x) %>%
xml_find_all("//ArticleTitle") %>%
xml_text()
top10 %>%
bind_cols(title = titles) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"))
pmid |
n |
title |
---|---|---|
23432189 |
1081 |
Primary prevention of cardiovascular disease with a Mediterranean diet. |
12609035 |
681 |
An enhanced transient expression system in plants based on suppression of gene silencing by the p19 protein of tomato bushy stunt virus. |
16642001 |
598 |
Lysyl oxidase is essential for hypoxia-induced metastasis. |
22088800 |
563 |
Cardiac stem cells in patients with ischaemic cardiomyopathy (SCIPIO): initial results of a randomised phase 1 trial. |
24711954 |
475 |
A comprehensive review on metabolic syndrome. |
15604363 |
419 |
Visfatin: a protein secreted by visceral fat that mimics the effects of insulin. |
19524507 |
415 |
A pleiotropically acting microRNA, miR-31, inhibits breast cancer metastasis. |
21753854 |
366 |
Selective killing of cancer cells by a small molecule targeting the stress response to ROS. |
15222900 |
351 |
TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics. |
9500320 |
339 |
Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. |
Now we bring out the igraph
package. graph.data.frame
converts the
dataset to a graph. Then we can add additional attributes to the
vertices.
We’ll write out the graph as graphml to use later in Gephi.
dataset_graph <- graph.data.frame(dataset)
V(dataset_graph)$label <- V(dataset_graph)$name
V(dataset_graph)$retracted <- ifelse(V(dataset_graph)$name %in% dataset$pmid, 1, 0)
write.graph(dataset_graph, file = "../../data/retracted_pmids_citations.graphml", format = "graphml")
components
finds the connected components of the graph. groups
identifies the vertices in each component.
We can use sapply
and length
to find the top 10 largest components,
i.e. the most-connected articles.
dataset_components <- components(dataset_graph)
dataset_groups <- groups(dataset_components)
top10 <- sapply(dataset_groups, length) %>%
sort(decreasing = TRUE) %>%
head(10)
top10
## 26 4 2459 1288 1692 1902 2937 1204 882 1305
## 55444 585 567 224 195 195 191 150 144 132
So the largest connected group still contains 55444 vertices of the original 84992.
We can create a subgraph of just those articles from the largest connected group, and write it out for later use.
dataset_subgraph <- subgraph(dataset_graph, which(V(dataset_graph)$name %in% dataset_groups[[26]]))
write.graph(dataset_subgraph, "../../data/retracted_pmids_subgraph.graphml", format = "graphml")
We can create another subgraph containing only retracted articles - i.e. one in which the citing articles were also retracted.
dataset_onlyretracted_subgraph <- subgraph(dataset_graph, V(dataset_graph)[retracted == 1])
write.graph(dataset_onlyretracted_subgraph, "../../data/onlyretracted_pmids_subgraph.graphml", format = "graphml")
As before, we can find the connected components in this graph.
dataset_onlyretracted_components <- components(dataset_onlyretracted_subgraph)
dataset_onlyretracted_groups <- groups(dataset_onlyretracted_components)
top10 <- sapply(dataset_onlyretracted_groups, length) %>%
sort(decreasing = TRUE) %>%
head(10)
top10
## 350 155 1957 3145 921 1314 2083 2410 509 2460
## 55 36 31 29 26 19 19 17 14 14
And as before, retrieve the XML and article titles for groups of interest. Let’s start with the largest group. We’ll just look at the top 20 out of 55.
x <- entrez_fetch("pubmed", dataset_onlyretracted_groups[[names(top10)[1]]], rettype = "xml")
titles <- read_xml(x) %>%
xml_find_all("//ArticleTitle") %>%
xml_text()
data.frame(pmid = dataset_onlyretracted_groups[[names(top10)[1]]],
title = titles) %>%
head(20) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"))
pmid |
title |
---|---|
30233176 |
Synthesis and characterization of a novel peptide-grafted Cs and evaluation of its nanoparticles for the oral delivery of insulin, in vitro, and in vivo study. |
26586942 |
PLGA-encapsulated tea polyphenols enhance the chemotherapeutic efficacy of cisplatin against human cancer cells and mice bearing Ehrlich ascites carcinoma. |
26164001 |
SUMO-specific protease 6 promotes gastric cancer cell growth via deSUMOylation of FoxM1. |
26032092 |
Curcumin inhibits growth of prostate carcinoma via miR-208-mediated CDKN1A activation. |
25792385 |
Curcumin enhances the radiosensitivity of U87 cells by inducing DUSP-2 up-regulation. |
23399702 |
RETRACTED: Tea polyphenols enhance cisplatin chemosensitivity in cervical cancer cells via induction of apoptosis. |
23349727 |
The different role of Notch1 and Notch2 in astrocytic gliomas. |
22806240 |
Activated K-Ras and INK4a/Arf deficiency promote aggressiveness of pancreatic cancer by induction of EMT consistent with cancer stem cell phenotype. |
22363731 |
3,3’-Diindolylmethane exhibits antileukemic activity in vitro and in vivo through a Akt-dependent process. |
22261338 |
RETRACTED: Increased Ras GTPase activity is regulated by miRNAs that can be attenuated by CDF treatment in pancreatic cancer cells. |
22213426 |
Inactivation of Ink4a/Arf leads to deregulated expression of miRNAs in K-Ras transgenic mouse model of pancreatic cancer. |
21673986 |
Activated K-ras and INK4a/Arf deficiency cooperate during the development of pancreatic cancer by activation of Notch and NF-κB signaling pathways. |
21503965 |
Over-expression of FoxM1 leads to epithelial-mesenchymal transition and cancer stem cell phenotype in pancreatic cancer cells. |
21463919 |
Notch-1 induces epithelial-mesenchymal transition consistent with cancer stem cell phenotype in pancreatic cancer cells. |
21408027 |
Anti-tumor activity of a novel compound-CDF is mediated by regulating miR-21, miR-200, and PTEN in pancreatic cancer. |
20824697 |
Restoring sensitivity to oxaliplatin by a novel approach in gemcitabine-resistant pancreatic cancer cells in vitro and in vivo. |
20658545 |
Down-regulation of Notch-1 is associated with Akt and FoxM1 in inducing cell growth inhibition and apoptosis in prostate cancer cells. |
20599780 |
Cyclodextrin-complexed curcumin exhibits anti-inflammatory and antiproliferative activities superior to those of curcumin through higher cellular uptake. |
20388782 |
Gemcitabine sensitivity can be induced in pancreatic cancer cells through modulation of miR-200 and miR-21 expression by curcumin or its analogue CDF. |
20379844 |
Platelet-derived growth factor-D contributes to aggressiveness of breast cancer cells by up-regulating Notch and NF-κB signaling pathways. |
Clearly a network of cancer-related articles. How about at the other end of the top 10?
x <- entrez_fetch("pubmed", dataset_onlyretracted_groups[[names(top10)[10]]], rettype = "xml")
titles <- read_xml(x) %>%
xml_find_all("//ArticleTitle") %>%
xml_text()
data.frame(pmid = dataset_onlyretracted_groups[[names(top10)[10]]],
title = titles) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"))
pmid |
title |
---|---|
23173109 |
Strategy for prevention of hip fractures in patients with Parkinson’s disease. |
22372723 |
Efficacy of antiresorptive agents for preventing fractures in Japanese patients with an increased fracture risk: review of the literature. |
21825080 |
Once-weekly risedronate for prevention of hip fracture in women with Parkinson’s disease: a randomised controlled trial. |
21050796 |
Amelioration of osteoporosis and hypovitaminosis D by sunlight exposure in Parkinson’s disease. |
19499964 |
Efficacy of menatetrenone (vitamin K2) against non-vertebral and hip fractures in patients with neurological diseases: meta-analysis of three randomized, controlled trials. |
18384711 |
Efficacy of risedronate against hip fracture in patients with neurological diseases: a meta-analysis of randomized controlled trials. |
18306478 |
Comparison of effects of alendronate and raloxifene on lumbar bone mineral density, bone turnover, and lipid metabolism in elderly women with osteoporosis. |
17372126 |
Risedronate and ergocalciferol prevent hip fracture in elderly men with Parkinson disease. |
16538619 |
Alendronate and vitamin D2 for prevention of hip fracture in Parkinson’s disease: a randomized controlled trial. |
16087822 |
Risedronate sodium therapy for prevention of hip fracture in men 65 years or older after stroke. |
16087821 |
The prevention of hip fracture with risedronate and ergocalciferol plus calcium supplementation in elderly women with Alzheimer disease: a randomized controlled trial. |
15664003 |
RETRACTED: Menatetrenone and vitamin D2 with calcium supplements prevent nonvertebral fracture in elderly women with Alzheimer’s disease. |
12913194 |
Amelioration of osteoporosis and hypovitaminosis D by sunlight exposure in stroke patients. |
12110423 |
Amelioration of osteoporosis by menatetrenone in elderly female Parkinson’s disease patients with vitamin D deficiency. |
Something has gone awry in the world of aging bones.
In summary: nice pictures, but not many insights.
We load the graphml files into Gephi for manipulation and visualisation. The OpenOrd layout was found to be fastest, and effective in arranging the graphs.
First, the largest connected subgraph. Vertices are coloured by modularity class.
Not sure we can conclude much from this, other than that there are several highly-connected areas of the graph which presumably relate to articles about a particular topic.
We can zoom into the graph, with some difficulty as it is large. This shows just how connected a retracted articles can be. PMID 19524507 is an article titled A pleiotropically acting microRNA, miR-31, inhibits breast cancer metastasis. This article was retracted due to concerns regarding statistical analysis and data presentation.
We turn now to the subgraph containing only retracted articles and retracted citing articles. This is clearly less connected and easier to read.
Vertices are again coloured by modularity class, and vertex size reflects “authority” - a measure of informational importance.
Zooming in allows inspection of connected articles.
The large vertex PMID 22851539 is the article Tracking chromatid segregation to identify human cardiac stem cells that regenerate extensively the infarcted myocardium. It was retracted for somewhat mysterious reasons related to a figure (2E) in the article.