-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDissertation Analysis.Rmd
365 lines (236 loc) · 10.3 KB
/
Dissertation Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
---
title: "Criminal Justice Dissertation Topic Analysis"
author: "Connor Concannon"
date: '2016-11-06'
output:
html_document:
fig_height: 7
fig_width: 9
editor_options:
chunk_output_type: console
---
## Introduction
Months ago, I was perusing the [American Society of Criminology's](http://asc41.com) website and came across a [list of recent dissertations](http://www.asc41.com/graduates1.htm), including topics, authors, and instutitions.
The list was interesting to me for a few reasons. One, I need to begin thinking about a dissertation topic, and I would like to get a sense of recent dissertations, without having to actually read them. Two, the list was an opportunity to do some exploratory analysis and use some techniques and packages that I am not too familiar with, namely web scraping. And three, I could use this as an opportunity to publish a quick analysis and begin building a portfolio of analytical projects that are not necessarily work-related.
First, load a bunch of libraries.
```{r setup, include=T,message=F,warning=F,echo=F}
#source("http://bioconductor.org/biocLite.R")
#biocLite("Rgraphviz")
#library(Rgraphviz)
library(bnlearn)
#install.packages('bnlearn')
#devtools::install_github("espanta/lubripack")
pkg <- c('tidytext','rvest','stringr','dplyr','tm','SnowballC','wordcloud','DescTools',
'ggmap','Amelia','leaflet','highcharter','LDAvis','graph','tidyr','Rgraphviz','ggplot2')
library(lubripack)
lubripack(pkg)
# suppressPackageStartupMessages(
# sapply(pkg,library,character.only=T)
# )
# options(scipen=1000)
```
## Scraping and Cleaning
Next, I take the url, extract the data and clean it up quite a bit. I have never tried any webscraping, but rvest makes it seem easy. I am sure this can be improved, chained into a neater workflow, errors removed, etc., but I am not striving for perfection. This was the most frustrating part, as the page seems to lack a consistent format for the information. I am most interested in the topics, so the rest is not a big deal. The code is not too interesting, but can be [viewed here](https://github.com/concannon/ASCScraping/blob/master/rvest.R).
After parsing out the various fields, I geocoded the institutions with ggmap::geocode
```{r,warning=F,message=F,echo=F, eval=F}
#get data
phds <- read_html('http://www.asc41.com/graduates1.htm')
d <- phds %>%
html_nodes('p , em , strong') %>%
html_text() %>%
#gsub('“','"',.) %>%
gsub('\r\n','',.) %>%
gsub("[\\]",'',.)
##Parse
#Years
yr <- "[1-9][0-9]{3}"
years <- str_locate_all(d,yr)
yearloc <- lapply(years,'[',1)
yearloc <- data.frame(yearLoc=unlist(yearloc))
years <- data.frame(Year=as.integer(str_extract(d,yr)))
#find first period
period <- str_locate_all(d,"\\.")
period <- lapply(period,'[',1)
period <- data.frame(PeriodLoc=unlist(period))
#find special characters
char <- str_locate_all(d,"\\?")
char <- lapply(char,'[',1)
char <- data.frame(CharLoc=unlist(char))
#str_sub(d,1,char$CharLoc)
#find first comma
comma <- str_locate_all(d,",")
comma <- lapply(comma,'[',1)
comma <- data.frame(commaLoc=unlist(comma))
#str_sub(d,1,comma$commaLoc)
#find second comma
comma2 <- str_locate_all(d,",")
comma2 <- lapply(comma2,'[',2)
comma2 <- data.frame(comma2Loc=unlist(comma2))
#str_sub(d,1,comma2$comma2Loc)
#extract name
matches <- cbind(period,char,comma,comma2)
matches$min <- apply(matches[,c(1,4)],1,min,na.rm=T)
names <- data.frame(Name=str_sub(d,1,matches$min))
#extract chair
chair <- str_locate_all(d,"Chaired")
chair <- lapply(chair,'[',1)
chair <- data.frame(chairLoc=unlist(chair))
chairLoc <- chair
chair <- str_sub(d,chair$chairLoc,yearloc$yearLoc-1)
chair <- str_trim(chair)
chair <- str_replace(chair,"Chaired by","")
chair <- str_replace(chair,"Chaired","")
chair <- str_replace(chair,"by","")
chair <- str_trim(chair)
chair <- gsub('January|February|March|April|May|June|July|August|September|October|November|December',
'',chair)
chair <- str_trim(chair)
chair <- str_replace_all(chair,"[[:punct:]]","")
chair <- str_replace_all(chair,"\\t","")
chair <- str_replace_all(chair,"Dr","")
chair <- str_replace_all(chair,"PhD","")
chair <- str_replace_all(chair," ","")
chair <- str_trim(chair)
##Institution
inst <- str_sub(d,yearloc$yearLoc+4,str_length(d))
inst <- str_replace_all(inst,"[[:punct:]]","")
inst <- gsub("[[:digit:]]","",inst)
#inst <- gsub("[^[:alnum:]]","",inst)
inst <- str_trim(inst)
inst <- gsub("chaired by","",inst)
inst <- gsub("chaired","",inst)
inst <- gsub("by","",inst)
inst <- data.frame(inst)
##Geocode them
univ <- data.frame(inst=unique(inst))
univ$inst <- as.character(univ$inst)
univ$inst <- gsub('CUNY Graduate Center','New York City,NY',univ$inst)
univ <- na.omit(univ)
#geo <- geocode(univ$inst)
#saveRDS(geo,'geo.rds')
geo <- readRDS('geo.rds')
univ <- cbind(univ,geo)
names(univ) <- c('inst','lon','lat')
inst <- left_join(inst,univ,by='inst')
##title
title <- str_sub(d,matches$min,chairLoc$chairLoc)
title <- str_replace_all(title,"[[:punct:]]","")
#title <- str_trim(title)
#title <- gsub("[^[:alnum:]]","",title)
title <- data.frame(Title=title)
data <- cbind(years,names,chair,inst,title)
# missmap(data)
# Desc(data)
saveRDS(data,'Clean Grads2.rds')
rm(list=ls())
```
## Analysis
This plot shows the number of cases with missing data, and for which variables data is missing. Latitude and longitude are the most problematic. I had trouble parsing out the chair from the institution, and in some instances (CUNY Graduate Center), google did not recognize the institution and did not geocode.
```{r}
g <- readRDS("Clean Grads2.rds")
g <- g %>% filter(Year>2002) %>% filter(Year<2016)
missmap(g,col=c('dimgrey','dodgerblue'))
```
##Locations
Some of these locations were not properly geocoded. You can still see the volume in the New York area, Michigan, Omaha, and Los Angeles.
```{r}
gjs <- g %>%
select(inst,lat,lon) %>%
filter(!lon=='NA') %>%
group_by(inst,lon=round(lon,5),lat=round(lat,5)) %>%
summarise(num=n())
inst <- gjs %>% group_by(lon,lat) %>% arrange(-num) %>% filter(row_number()==1)
test <- left_join(gjs,inst,by=c('lon','lat')) %>%
ungroup() %>%
select(inst.y,lon,lat,num.y) %>%
unique()
leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
#addTiles() %>%
setView(lat=40.95,lng=-96.58,zoom=4) %>%
addCircleMarkers(lng=~lon,lat=~lat,radius=~num.y/2,
data=test,
popup=~inst.y)
```
## Wordcloud
It had to be done. These are trite plots, but also a good first step in examining text data. Perhaps not surprisingly, police (and related variants) is the most common term. And by a longshot.
```{r,warning=F,fig.width=8,fig.height=8}
my_stop_words <- data.frame(word=c('univers','graduate','social','florida','houston',
'michael','john','college','jay',
'university','chaired','criminal','august','state',
'may','december','crime','justice','the',
'study','analysis','examination','and','the',
'\\?','\\???','o','othe','use','among'))
gtidy <- g %>%
dplyr::select(Title) %>%
mutate(Title=as.character(Title)) %>%
na.omit() %>%
unnest_tokens(word,Title) %>%
anti_join(stop_words) %>%
anti_join(my_stop_words)
gtidy %>%
count(word) %>%
with(wordcloud(word,n,max.words=250))
#findFreqTerms(term.matrix,20)
top20 <-
gtidy %>%
count(word) %>%
arrange(-n) %>%
top_n(20)
top20
highchart() %>%
hc_title(text='Top 20 Terms in CJ Dissertations',align='left') %>%
hc_subtitle(text='Mentions in Dissertation Titles',align='left') %>%
hc_add_series_labels_values(top20$word,top20$n,
name='Mentions',type='bar',dataLabels=list(enabled=F)) %>%
hc_xAxis(categories=top20$word)
```
## Word Association
TM has a great function, ```findAssocs() ```. This function returns a list of words with a minimum specified correlation to the input word. 'Risk' is strongly associated with modeling, testing, and instruments. The 'police' search also show the influence of public facing actions - contact and situations.
```{r,echo=T,warning=F}
ds <- VectorSource(na.omit(as.character(g$Title)))
term.matrix <- TermDocumentMatrix(Corpus(ds),
control=list(removePunctuation=T,
tolower=T,
removeNumbers=T,
stopwords=T))
findAssocs(term.matrix,"risk",.2)
findAssocs(term.matrix,"police",.2)
```
## Tidy Text
This a package from David Robinson and Julia Silge. Some examples straight from their vignettes. Risk, offender, violent, delinquency, and prison have the most negative sentiment, while work, support,intimate, fairness and trust are mostly positive.
```{r}
bing <- sentiments %>%
filter(lexicon=='bing') %>%
select(-score)
sent <- tidy(term.matrix) %>%
inner_join(bing, by=c(term='word')) %>%
count(term,sentiment,wt=count) %>%
arrange(-n) %>% filter(n>2) %>%
ungroup() %>%
spread(sentiment,n,fill=0) %>%
mutate(sentiment=positive-negative) %>%
arrange(sentiment)
sent %>%
mutate(term=reorder(term,sentiment)) %>%
ggplot(aes(term,sentiment,fill=sentiment))+geom_bar(stat='identity')+
theme(axis.text.x=element_text(angle=90,hjust=1))+
ggtitle('Term Sentiment')
ggplot(sent,aes(x=negative,y=positive,group=term))+geom_jitter() +
geom_text(aes(label=term),check_overlap = T)
```
##Conclusion
This quick analysis shows the prevalence of police-related research. I was surprised how low the term 'risk' ranked. I supposed dissertationmight often deal with risk of a certain outcome, but not necessarily mention it in the title. I became a bit more comfortable with some text packages, so it was all worth it.
```{r,echo=F,include=F}
## TF-IDF
# Term frequency-inverse document frequency is a weight used to measure how important a particular word is to a document in a collection. There is still some cleaning left to do, but this indicates that terms like 'police' and
words <- tidy(term.matrix) %>%
mutate(total=sum(count)) %>%
bind_tf_idf(term,document,count) %>%
select(-total) %>%
arrange(tf_idf) %>%
select(-count,-tf) %>%
unique() %>%
arrange(tf_idf)
words
```