Skip to content

ianmiell/word_counts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

================================================================================
word_counts
================================================================================

github repo containing the programs used and output generated for the small
project described here:

http://zwischenzugs.wordpress.com/2011/03/06/shakespeare_unexceptional_vocabulary

================================================================================
Directory = Description
================================================================================
bin       = Contains scripts used to generate results
files     = Contains files of words generated from authors' works (please note
            that the works themselves have not been included due to copyright 
            concerns. They are freely available online at Project Gutenberg, but
            require editing before analysis).
results   = Results output for examined writers.



================================================================================
File                       = Description
================================================================================
bin/analyse.pl             = Takes stdin, outputs words sorted with numbers,
                             possessives, and "'d"s replaced with "ed"s.
bin/calculate_all.sh       = Takes all the files in
                             "../files/*_files/<writer's name>_complete" and
                             outputs results from calculate.sh to
                             ../results/<writer's name>_results.txt
bin/calculate.sh           = Given a filename representing a writer's corpus, 
                             outputs: number of words in corpus; number of
                             words after analyse.pl is run over it; number of
                             unique words after analyse.pl is run over it; 
                             number of unique words after stemmer.pl is run
                             on corpus that has been through analyse.pl
bin/stemmer.pl             = Takes words as inputs and outputs them in a 
                             canonical Porter-stemmed form.
results/<writer's name>_results.txt
                           = Results of writer's corpus's analysis.
files/<writer'name>_files/<writer's name>_complete_words_all_lc
                           = All words used in corpus in lower case.
files/<writer'name>_files/<writer's name>_complete_words_all_lc_uniq
                           = As above, but each word is unique in the file
                             and in alphabetical order.
files/<writer'name>_files/<writer's name>_complete_stemmed_uniq
                           = As above, but each word is stemmed, unique and in
                             alphabetical order.




About

Calculation of vocabulary using Porter stemming of writers based on their corpuses

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published