You can find our presentation from 27.10.2018 here.
Here you can find two Python-3 codes, which extract all references from a collection of articles and make extracted material ready for Gephy.
Codes require
bs4
module. You can get it usingpip install beutifulsoup4
.
We tried using two collections.
- The
final_dialogue.py
uses a collection of articles from Dialogue conference. It consists of 1748 articles in Russian and English, mostly in PDF format.
It requires one more module:
PyPDF2
. You can also get it withpip install
. So far, Russian PDF-files can't be read at all (pyPDF2 is really bad at it).
Due to some technical problems we were not able to get the citations from this collection, so we tried another one. In the future we may try using an existing Dialogue dataset.
- The
final_cambridge.py
uses a collection of articles from Journal of Linguistics. It consists of 934 articles in English from the Most cited section.
step | status |
---|---|
Crawler | done |
Get/download articles | done |
Get English texts from PDF-files | done |
Get Russian texts from PDF-files | is it even possible? |
Get all citations | NO |
Output data in right format | in theory |
User-friendly interface | almost |
FAIL
step | status |
---|---|
Crawler | done |
Get/download articles | done |
Get all citations | done |
Get authors' surnames right | almost |
Output data in right format | done |
User-friendly interface | done |
SUCCESS
In the data we got from the previous step some autors' names were presented in different ways. Unfortunately, we couldn't get them all to the same format, but we manually corrected some of them (see a list of corrected authors).
The final version of the ready-for-Gephi .csv file can be found here.
The graph we got consisted of 20408 nodes (= authors) and about 60000 edges (= instances of referencing).
We arranges and scaled the nodes by their in-degree (= by how many authors a certain person was referenced) and got the following graph, showing Noam Chomsky to be "the most popular" (he was referenced by 298 out of 948 authors, that's about third of them!)
Well, we decided to delete Noam Chomsky from our network :)
However, this network was still too big to be analysed, so we filetered out some nodes:
- "unpopular authors" - nodes, which in-degree was less than 20. Why 20? That allowed us to make the graph much smaller without losing too much relevant data, see:
- "short papers" - nodes, wich out-degree was less than 100. Why 100? Same answer as before, see:
Than we used the sandart clusterising function to get this:
label size represents the in-degree, node size represents betweenness centrality
As you can see, there are 7 main clusters. See this PDF file for a better quality image.
- Comparative linguistics
- Syntax
- Specialising in English (grammar mostly)
- Sociolinguistics & morphology (really mixed one, we think it shlould be sub-clusterised)
- Typology
- Speech analysis and phonology
- Specialising in Japanese & Oceanic syntax (it is questionable, whether A.M. Zwicky and G.K. Pullum should be here)
In my opinion, they should be also deleted from this network at all. (C) Anna Polyanskaya
- Morphology
- A very small and very specific cluster of two Canadian linguists from Quebec: Carole Paradis and Darlene Lacharité
However, this clusters are not comletely "clean": they are also affected by authors' region of origin and region of interest.
Here you can see an unclusterised network of the same authors. Node size represents betweenness centrality, lable size and redness (brightness of the red, to be percise) represent closeness centrality.
See this PDF file for a better quality image.
This information is to be analysed in the future.
This method allows us to:
- find out most cited authors (most important for linguists of any field of work)
- authomaticlly clusterise linguists by the field of their research ( it is still a WIP, but it does show some relevant results)
- establish reception of ideas in linguistic papers and see how those ideas came to be (by analysing betweenness and closeness centrality)
- Using the Dialogue dataset to make a network of Russian computational linguists
- Analysing centrality and what it means
- Investigating self-referencing (= self-loops)
- Investigating strong bound between authors (= edge weight)
- Investigating a network, filtered the other way around ("upopular authors" and "small papers")
Noam Chomsky in linguistic world (oil, canvas). Source.
Presentation by Olga Vedenina
Output data mangment and presentation by Elizaveta Leonova
Code and presentation by Alina Morse & Anna Polyanskaya
Feel free to contact Anna for any questions.
HSE, 2018