forked from dgrtwo/tidy-text-mining
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path09-nasa-metadata.Rmd
235 lines (177 loc) · 8.37 KB
/
09-nasa-metadata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
# Case study: mining NASA metadata {#nasa}
```{r echo = FALSE}
library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE)
options(width = 100, dplyr.width = 100)
library(ggplot2)
theme_set(theme_light())
```
There are 32,000+ datasets at [NASA](https://www.nasa.gov/), and we can use the metadata for these datasets to understand the connections between them. What is metadata? Metadata is data that gives information about other data, in this case, data about what is in these numerous NASA datasets (but not the datasets themselves). It includes information like the title of the dataset, description fields, what organization(s) within NASA is responsible for the dataset, and so forth. NASA places a high priority on making its data accessible, even requiring all NASA-funded research to be [openly accessible online](https://www.nasa.gov/press-release/nasa-unveils-new-public-web-portal-for-research-results), and the metadata for all its datasets is [publicly available online in JSON format](https://data.nasa.gov/data.json). Let's take a look at this metadata and see what is there.
## Getting the metadata
First, let's download the JSON file and take a look at the names.
```{r download}
library(jsonlite)
metadata <- fromJSON("https://data.nasa.gov/data.json")
names(metadata$dataset)
```
What kind of data is available here?
```{r sapply, dependson = "download"}
sapply(metadata$dataset, class)
```
It seems likely that the title, description, and keywords for each dataset may be most fruitful for drawing connections between datasets. It's a place to start anyway! Let's check them out.
```{r class, dependson = "download"}
class(metadata$dataset$title)
class(metadata$dataset$description)
class(metadata$dataset$keyword)
```
## Wrangling and tidying the data
Let's set up tidy data frames for title, description, and keyword and keep the dataset ids.
```{r title, dependson = "download", message=FALSE}
library(dplyr)
nasa_title <- data_frame(id = metadata$dataset$`_id`$`$oid`,
title = metadata$dataset$title)
nasa_title
```
```{r desc, dependson = "download", dplyr.width = 150}
nasa_desc <- data_frame(id = metadata$dataset$`_id`$`$oid`,
desc = metadata$dataset$description)
nasa_desc
```
These are having a hard time printing out; let’s print out part of a few.
```{r dependson = "desc"}
nasa_desc %>%
select(desc) %>%
sample_n(5)
```
Now we can do the keywords, which must be unnested since they are in a list-column.
```{r keyword, dependson = "download"}
library(tidyr)
nasa_keyword <- data_frame(id = metadata$dataset$`_id`$`$oid`,
keyword = metadata$dataset$keyword) %>%
unnest(keyword)
nasa_keyword
```
Now let's use tidytext's `unnest_tokens` for the title and description fields so we can do the text analysis. Let's also remove common English words.
```{r unnest, dependson = c("title","desc")}
library(tidytext)
nasa_title <- nasa_title %>%
unnest_tokens(word, title) %>%
anti_join(stop_words)
nasa_desc <- nasa_desc %>%
unnest_tokens(word, desc) %>%
anti_join(stop_words)
```
## Some initial simple exploration
What are the most common words in the NASA dataset titles?
```{r dependson = "unnest"}
nasa_title %>%
count(word, sort = TRUE)
```
What about the descriptions?
```{r dependson = "unnest"}
nasa_desc %>%
count(word, sort = TRUE)
```
It looks like we might want to remove digits and some "words" like "v1" from these dataframes before approaching something more meaningful like topic modeling.
```{r mystopwords, dependson = "unnest"}
mystopwords <- data_frame(word = c(as.character(1:10),
"v1", "v03", "l2", "l3", "v5.2.0",
"v003", "v004", "v005", "v006"))
nasa_title <- nasa_title %>%
anti_join(mystopwords)
nasa_desc <- nasa_desc %>%
anti_join(mystopwords)
```
What are the most common keywords?
```{r dependson = "keyword"}
nasa_keyword %>%
group_by(keyword) %>%
count(sort = TRUE)
```
It is possible that "Project completed" may not be a useful set of keywords to keep around for some purposes, and we may want to change all of these to lower or upper case to get rid of duplicates like "OCEANS" and "Oceans". Let's do that, actually.
```{r toupper, dependson = "keyword"}
nasa_keyword <- nasa_keyword %>%
mutate(keyword = toupper(keyword))
```
## Word co-ocurrences
Let's examine which words commonly occur together in the titles and descriptions of NASA datasets. We can then examine a word network in titles/descriptions; this may help us decide, for example, how many topics to look at in topic modeling.
```{r title_words, dependson = "mystopwords"}
library(widyr)
title_words <- nasa_title %>%
pairwise_count(word, id, sort = TRUE)
title_words
```
```{r desc_words, dependson = "mystopwords"}
desc_words <- nasa_desc %>%
pairwise_count(word, id, sort = TRUE)
desc_words
```
Let's plot networks of these co-occurring words.
```{r plot_title, dependson = "title_words", fig.height=6, fig.width=9}
library(ggplot2)
library(igraph)
library(ggraph)
set.seed(1234)
title_words %>%
filter(n >= 250) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8) +
ggtitle("Word Network in NASA Dataset Titles") +
theme_void()
```
This is a good start, although it looks like there may still a bit more cleaning to be done.
Let's look at the words in descriptions.
```{r plot_desc, dependson = "desc_words", fig.height=6, fig.width=9}
set.seed(2016)
desc_words %>%
filter(n >= 5000) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "indianred4", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8) +
ggtitle("Word Network in NASA Dataset Descriptions") +
theme_void()
```
Here there are such *strong* connections between the top dozen or so words (words like "data", "resolution", and "instrument") that we may do better if we exclude these very highly connected words. Also, this may mean that tf-idf (as described in detail in [Chapter 4](#tfidf)) will be a good option to explore. But for now, let's add a few more stop words and look at one more word network for the description fields. Notice how we use `bind_rows` to add more custom stop words to the words we are already using; this approach can be used in many instances.
```{r plot_desc2, dependson = "desc_words", fig.height=6, fig.width=9}
mystopwords <- bind_rows(mystopwords,
data_frame(word = c("data", "global",
"instrument", "resolution",
"product", "level")))
nasa_desc <- nasa_desc %>%
anti_join(mystopwords)
desc_words <- nasa_desc %>%
pairwise_count(word, id, sort = TRUE)
set.seed(1234)
desc_words %>%
filter(n >= 4600) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "indianred4", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8) +
ggtitle("Word Network in NASA Dataset Descriptions") +
theme_void()
```
We still are not seeing clusters the way we did with the titles (the descriptions appear to use very similar words compared to each other), so using tf-idf may be a better way to go when approaching the description fields.
Let's make a network of the keywords to see which keywords commonly occur together in the same datasets.
```{r plot_counts, dependson = "toupper", fig.height=7, fig.width=9}
keyword_counts <- nasa_keyword %>%
pairwise_count(keyword, id, sort = TRUE)
set.seed(1234)
keyword_counts %>%
filter(n >= 700) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "royalblue3", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8) +
ggtitle("Co-occurrence Network in NASA Dataset Keywords") +
theme_void()
```
These are the most commonly co-occurring words, but also just the most common keywords in general. To more meaningfully examine which keywords are likely to appear together instead of separately, we need to find the correlation among the keywords as described in [Chapter 5](#ngrams).
TODO: correlation of keywords, tf-idf, and topic modeling