GitHub - ianmiell/word_counts: Calculation of vocabulary using Porter stemming of writers based on their corpuses

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin		bin
files		files
results		results
README		README

Repository files navigation

================================================================================
word_counts
================================================================================

github repo containing the programs used and output generated for the small
project described here:

http://zwischenzugs.wordpress.com/2011/03/06/shakespeare_unexceptional_vocabulary

================================================================================
Directory = Description
================================================================================
bin       = Contains scripts used to generate results
files     = Contains files of words generated from authors' works (please note
            that the works themselves have not been included due to copyright 
            concerns. They are freely available online at Project Gutenberg, but
            require editing before analysis).
results   = Results output for examined writers.



================================================================================
File                       = Description
================================================================================
bin/analyse.pl             = Takes stdin, outputs words sorted with numbers,
                             possessives, and "'d"s replaced with "ed"s.
bin/calculate_all.sh       = Takes all the files in
                             "../files/*_files/<writer's name>_complete" and
                             outputs results from calculate.sh to
                             ../results/<writer's name>_results.txt
bin/calculate.sh           = Given a filename representing a writer's corpus, 
                             outputs: number of words in corpus; number of
                             words after analyse.pl is run over it; number of
                             unique words after analyse.pl is run over it; 
                             number of unique words after stemmer.pl is run
                             on corpus that has been through analyse.pl
bin/stemmer.pl             = Takes words as inputs and outputs them in a 
                             canonical Porter-stemmed form.
results/<writer's name>_results.txt
                           = Results of writer's corpus's analysis.
files/<writer'name>_files/<writer's name>_complete_words_all_lc
                           = All words used in corpus in lower case.
files/<writer'name>_files/<writer's name>_complete_words_all_lc_uniq
                           = As above, but each word is unique in the file
                             and in alphabetical order.
files/<writer'name>_files/<writer's name>_complete_stemmed_uniq
                           = As above, but each word is stemmed, unique and in
                             alphabetical order.