network-similarity

Code and miscellaneous files that I generated in the process of writing my master's thesis. A subset of them can be used as a general package for creating a citation network from a collection of .txt files containing the references of your parent papers in a standard format. There are a lot of extra files, so here's a guide to what's important:

"Marissa Graham Thesis" PDF in the ThesisClass folder is the final draft.
"Thesis Presentation" PDF in the main folder is the beamer slides for my defense.
The "bibliographies" folder contains the reference lists I formatted by hand. The "ref_lists" folder stores the (manually corrected) references for each parent paper in their Paper class format. None of the other .csv or .txt files are necessary to store and load the dataset itself.
All of the .png files are necessary to compile my thesis and slides. A few of them are only for the slides, but I didn't keep track of which ones they are.
Getting things to play nice with Google Sheets will involve some JSON files and probably a rabbit hole of googling things if you're not used to API keys. Maybe even if you are.
The .png and .gml files are all in the main folder because the prospect of making sure LaTeX and Mathematica could still find them was more annoying than just scrolling a lot all the time.

Most relevant Mathematica notebooks:

"Miscellaneous Figures" (didn't put much effort into documentation since the figures have context in the thesis itself).
"Subject Tagged Stuff" (good documentation).
"Full Citation Network Statistics and Visualization" (ok documentation).

.py files:

paper.py is a smallish file containing the Paper class for storing the metadata I cared about for each paper, looking up citations in CrossRef, and duplicate testing.
database.py is the workhorse file with the Database class for creating a database from a collection of reference lists for individual papers by parsing and looking up individual entries in CrossRef, writing spreadsheets of incorrect entries for easier manual correction, updating a database based on manually corrected spreadsheets (both .csv and Google Sheets), writing the citation network the database represents to a GML file, and creating various convenience file dumps and the BibTeX file.
subject_assignment.py is just a collection of functions which I used to tag papers with a subject based on keywords in the journal titles they were tagged with. Not intended to be part of the main package.

.ipynb files:

Similarity Scoring Methods and Coupled Node-Edge Similarity Measure were toy implementations of a couple of specific algorithms and aren't relevant to the final thesis.
Scraping Attempts was very early on, when I found out that you can't just scrape all the references for a DOI number with BeautifulSoup.
Database Statistics is basic statistics for the database that don't involve the citation network.
Bibliography Parsing has some remnants of the process of reading and writing .csv files to correct the database, but is mostly the process of tagging subjects via journal titles, and then a bit of scratch work from another class for some reason.

.gml files:

citation_network.gml is the main citation network dataset used for tables and figures.
sciMet_dataset.gml and zewail_dataset.gml are the citation networks I used for comparison to my network. The cited source for these gives them to you in Pajek NET format, which Mathematica didn't recognize, so I loaded them in NetworkX and wrote the GML files myself.
The rest were either used to generate figures, or created in the process of making figures. See the Miscellaneous Figures Mathematica notebook for details.
I did not save the specific random networks used to generate figures. In hindsight, those numbers should have been averaged over multiple trials.

The Paper and Database classes should be usable as-is (the jupyter notebooks have examples of how I'd call them), but I haven't tested them to production-ready level so use at your own risk.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Citation networks and child checking files		Citation networks and child checking files
Old py files and misc		Old py files and misc
SciMet		SciMet
ThesisClass		ThesisClass
Zewail		Zewail
__pycache__		__pycache__
bibliographies		bibliographies
ref_lists		ref_lists
243273.protein.links.v10.5.txt.gz		243273.protein.links.v10.5.txt.gz
ALL_CITATIONS.txt		ALL_CITATIONS.txt
ALL_HAVEBIBS.txt		ALL_HAVEBIBS.txt
Bibliography Parsing.ipynb		Bibliography Parsing.ipynb
Citation Network Statistics (pre-data cleanup).pdf		Citation Network Statistics (pre-data cleanup).pdf
Citation Network Timing Tests.nb		Citation Network Timing Tests.nb
Citation Network.nb		Citation Network.nb
Code snippets.nb		Code snippets.nb
Coupled Node-Edge Similarity Measure .ipynb		Coupled Node-Edge Similarity Measure .ipynb
Database Statistics.ipynb		Database Statistics.ipynb
Drive API practice-7d393b96feaa.json		Drive API practice-7d393b96feaa.json
Full Citation Network Statistics (pre-data cleanup).pdf		Full Citation Network Statistics (pre-data cleanup).pdf
Full Citation Network Statistics and Visualizations.nb		Full Citation Network Statistics and Visualizations.nb
IsoRank_demo.png		IsoRank_demo.png
LICENSE		LICENSE
Miscellaneous Figures.nb		Miscellaneous Figures.nb
Networkx fiddling.ipynb		Networkx fiddling.ipynb
Partition Investigation.nb		Partition Investigation.nb
PathBLAST_demo.png		PathBLAST_demo.png
README.md		README.md
Scraping Attempts.ipynb		Scraping Attempts.ipynb
Similarity Scoring Methods.ipynb		Similarity Scoring Methods.ipynb
Subject Tagged Stuff.nb		Subject Tagged Stuff.nb
Thesis Presentation.pdf		Thesis Presentation.pdf
Thesis Presentation.synctex(busy)		Thesis Presentation.synctex(busy)
Thesis Presentation.synctex.gz		Thesis Presentation.synctex.gz
Thesis Presentation.tex		Thesis Presentation.tex
Useful Snippets.nb		Useful Snippets.nb
acyclic_demo.png		acyclic_demo.png
all_hashes.txt		all_hashes.txt
all_journal_titles.txt		all_journal_titles.txt
all_papers_dump.txt		all_papers_dump.txt
assortativity_demo.png		assortativity_demo.png
baby_network2.gml		baby_network2.gml
basic_properties_demo.png		basic_properties_demo.png
bipartite_assignment_problem.png		bipartite_assignment_problem.png
citation_network.gml		citation_network.gml
citation_network_nx.gml		citation_network_nx.gml
citation_network_str.gml		citation_network_str.gml
client_secret.json		client_secret.json
closeness_demo.png		closeness_demo.png
color_coded_full.png		color_coded_full.png
color_coded_left.png		color_coded_left.png
color_coded_right.png		color_coded_right.png
color_key.png		color_key.png
connections_figure.png		connections_figure.png
connectivity_demo.png		connectivity_demo.png
csv_database.csv		csv_database.csv
database.py		database.py
deterministic_local_alignment.PNG		deterministic_local_alignment.PNG
display_sciMet.png		display_sciMet.png
display_zewail.png		display_zewail.png
edge_deletion_left.png		edge_deletion_left.png
edge_deletion_right.png		edge_deletion_right.png
edge_insertion_left.png		edge_insertion_left.png
edge_insertion_right.png		edge_insertion_right.png
final_checkedpass.txt		final_checkedpass.txt
final_database.csv		final_database.csv
fix_nonetitles.csv		fix_nonetitles.csv
fix_weirdos.csv		fix_weirdos.csv
fixed_lastcallfornewguys.txt		fixed_lastcallfornewguys.txt
fixed_nonetitles.txt		fixed_nonetitles.txt
fixed_one_last_try.txt		fixed_one_last_try.txt
fixed_weirdos.txt		fixed_weirdos.txt
foodweb.png		foodweb.png
fresh_database.py		fresh_database.py
full_citation_network.png		full_citation_network.png
full_network_color_coded.png		full_network_color_coded.png
full_table_raw.txt		full_table_raw.txt
giant_table_ex.PNG		giant_table_ex.PNG
global_alignment.gml		global_alignment.gml
global_alignment.png		global_alignment.png
global_alignment_bottom.gml		global_alignment_bottom.gml
global_alignment_top.gml		global_alignment_top.gml
graphdict.txt		graphdict.txt
graphlet_degree_distributions.png		graphlet_degree_distributions.png
graphlet_degree_distributions_MIDDLE.png		graphlet_degree_distributions_MIDDLE.png
graphlet_degree_distributions_TOP.png		graphlet_degree_distributions_TOP.png
graphlets_figure.png		graphlets_figure.png
inexact_matching.gml		inexact_matching.gml
isomorphism_demos.png		isomorphism_demos.png
isomorphism_demos_LINEAR.png		isomorphism_demos_LINEAR.png
isomorphism_full.gml		isomorphism_full.gml
isomorphism_left.gml		isomorphism_left.gml
isomorphism_right.gml		isomorphism_right.gml
journal_titles.txt		journal_titles.txt
lastcallfornewguys.csv		lastcallfornewguys.csv
lite_network2.gml		lite_network2.gml
local_alignment		local_alignment
local_alignment.gml		local_alignment.gml
local_alignment.png		local_alignment.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

network-similarity

About

Releases

Packages

Languages

License

marissa-graham/network-similarity

Folders and files

Latest commit

History

Repository files navigation

network-similarity

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages