ethnicityguesser

A machine-learning classifier based on nltk's maxent classifier to guess the ethnicity of a last name.

DEPENDENCIES: nltk

CURRENT ETHNICITIES: african chinese czech danish french indian italian japanese jewish spanish

DOCUMENTATION:

Note that all methods available to NLTK's MaxentClassifier is available to the ethnicity classifier. Look at NLTKMaxentEthnicityClassifier.py for information, but in general it is the exact same method name (i.e NLTKMaxentEthnicityClassifier.prob_classify()/classify()/explain() etc). Only train() works a little differently because the implementation allows a list of names rather than a list of featuresets.

Usage:

-- From the Python interpreter:

FROM SCRATCH:

Instantiation:

>>> from runner import make_classifier >>> mxec = make_classifier()

Training: (note that make_classifier already passes in training tokens)

>>> mxec.train()

Classification: (After training)

>>> mxec.classify('leventhal') 'jewish' >>> mxec.classify('sekhri') 'indian'

FROM PICKLE:

Instantiation:

>>> import cPickle as pickle >>> pickle_file = open('pickled_classifiers/[insert pickle file here, e.g jewishandindian.pkl]', 'rb') >>> mxec = pickle.load(pickle_file) >>> pickle_file.close()

Training: Done for you

Classification:

>>> mxec.classify('leventhal') 'jewish' >>> mxec.classify('sekhri') 'indian'

-- In other code (as in a bigger project, etc):

from NLTKMaxentEthnicityClassifier import NLTKMaxentEthnicityClassifier as mxec
classifier = mxec(tokens) ## tokens must be a list of ([list of names], 'ethnicity') pairs. Ethnicities can be repeated.

Training and Classification as above.

Pickling:

>>> mxec.pickleme(directory_name)

pickled_names directory is full of pickled ([list of names], 'ethnicity') pairs

pickled_classifiers is full of pickled trained classifiers

POSSIBLE FUTURE WORK Move data files over to talentworks-data repository, create a new classifier written using our BaseClassifier but with the features identified by ethnicityguesser, and confirm that the P/R stats do not regress.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
names_textfiles		names_textfiles
pickled_classifiers		pickled_classifiers
pickled_names		pickled_names
.DS_Store		.DS_Store
.gitignore		.gitignore
IGNORElistmakers.py		IGNORElistmakers.py
IGNOREtestpickles.py		IGNOREtestpickles.py
NLTKMaxentEthnicityClassifier.py		NLTKMaxentEthnicityClassifier.py
README.md		README.md
__init__.py		__init__.py
documentation.txt		documentation.txt
runner.py		runner.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ethnicityguesser

About

Releases

Packages

Languages

kushalc/ethnicityguesser

Folders and files

Latest commit

History

Repository files navigation

ethnicityguesser

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages