Citation network

You can find our presentation from 27.10.2018 here.

Part 1: getting data

Here you can find two Python-3 codes, which extract all references from a collection of articles and make extracted material ready for Gephy.

Codes require bs4 module. You can get it using pip install beutifulsoup4.

We tried using two collections.

The final_dialogue.py uses a collection of articles from Dialogue conference. It consists of 1748 articles in Russian and English, mostly in PDF format.

It requires one more module: PyPDF2. You can also get it with pip install. So far, Russian PDF-files can't be read at all (pyPDF2 is really bad at it).

Due to some technical problems we were not able to get the citations from this collection, so we tried another one. In the future we may try using an existing Dialogue dataset.

The final_cambridge.py uses a collection of articles from Journal of Linguistics. It consists of 934 articles in English from the Most cited section.

Progress for each collection

Dialogue

step	status
Crawler	done
Get/download articles	done
Get English texts from PDF-files	done
Get Russian texts from PDF-files	is it even possible?
Get all citations	NO
Output data in right format	in theory
User-friendly interface	almost

FAIL

Journal of Linguistics

step	status
Crawler	done
Get/download articles	done
Get all citations	done
Get authors' surnames right	almost
Output data in right format	done
User-friendly interface	done

SUCCESS

Part 2: data managment

In the data we got from the previous step some autors' names were presented in different ways. Unfortunately, we couldn't get them all to the same format, but we manually corrected some of them (see a list of corrected authors).

The final version of the ready-for-Gephi .csv file can be found here.

Part 3: networks in Gephi

Full data

The graph we got consisted of 20408 nodes (= authors) and about 60000 edges (= instances of referencing).

We arranges and scaled the nodes by their in-degree (= by how many authors a certain person was referenced) and got the following graph, showing Noam Chomsky to be "the most popular" (he was referenced by 298 out of 948 authors, that's about third of them!)

Well, we decided to delete Noam Chomsky from our network :)

However, this network was still too big to be analysed, so we filetered out some nodes:

"unpopular authors" - nodes, which in-degree was less than 20. Why 20? That allowed us to make the graph much smaller without losing too much relevant data, see:

"short papers" - nodes, wich out-degree was less than 100. Why 100? Same answer as before, see:

Filetered out

Than we used the sandart clusterising function to get this:

label size represents the in-degree, node size represents betweenness centrality

As you can see, there are 7 main clusters. See this PDF file for a better quality image.

Analysing clusters

Comparative linguistics
Syntax
Specialising in English (grammar mostly)
Sociolinguistics & morphology (really mixed one, we think it shlould be sub-clusterised)
Typology
Speech analysis and phonology
Specialising in Japanese & Oceanic syntax (it is questionable, whether A.M. Zwicky and G.K. Pullum should be here)

In my opinion, they should be also deleted from this network at all. (C) Anna Polyanskaya

Morphology
A very small and very specific cluster of two Canadian linguists from Quebec: Carole Paradis and Darlene Lacharité

However, this clusters are not comletely "clean": they are also affected by authors' region of origin and region of interest.

Short note on centrality

Here you can see an unclusterised network of the same authors. Node size represents betweenness centrality, lable size and redness (brightness of the red, to be percise) represent closeness centrality.

See this PDF file for a better quality image.

This information is to be analysed in the future.

Conclusion

This method allows us to:

find out most cited authors (most important for linguists of any field of work)
authomaticlly clusterise linguists by the field of their research ( it is still a WIP, but it does show some relevant results)
establish reception of ideas in linguistic papers and see how those ideas came to be (by analysing betweenness and closeness centrality)

Further work

Using the Dialogue dataset to make a network of Russian computational linguists
Analysing centrality and what it means
Investigating self-referencing (= self-loops)
Investigating strong bound between authors (= edge weight)
Investigating a network, filtered the other way around ("upopular authors" and "small papers")

and a meme for y'all

Noam Chomsky in linguistic world (oil, canvas). Source.

Presentation by Olga Vedenina

Output data mangment and presentation by Elizaveta Leonova

Code and presentation by Alina Morse & Anna Polyanskaya

Feel free to contact Anna for any questions.

HSE, 2018

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
codes		codes
images		images
output		output
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Citation network

Part 1: getting data

Progress for each collection

Dialogue

Journal of Linguistics

Part 2: data managment

Part 3: networks in Gephi

Full data

Filetered out

Analysing clusters

Short note on centrality

Conclusion

Further work

and a meme for y'all

About

Releases

Packages

Contributors 2

Languages

polyankaglade/DH-project

Folders and files

Latest commit

History

Repository files navigation

Citation network

Part 1: getting data

Progress for each collection

Dialogue

Journal of Linguistics

Part 2: data managment

Part 3: networks in Gephi

Full data

Filetered out

Analysing clusters

Short note on centrality

Conclusion

Further work

and a meme for y'all

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages