Skip to content

PrabhatCS/etk

 
 

Repository files navigation

etk

travis ci This repository will contain our toolkit for extracting information from web pages. It will be built in stages to contain the following capabilities:

  • Several structure extractors to identify the main content of a page and tables
  • A host of data extractors for common entities, including people, places, phone, email, dates, etc.
  • A trainable algorithm to rank extractions
  • Automated experimentation to measure precision and recall of extractions

Setup

conda-env create .
source activate etk_env
python -m spacy download en

Run Tests

python -m unittest discover

Launch Jupyter Notebook

jupyter notebook etk_examples.ipynb
or
jupyter notebook etk_extraction_using_config.ipynb

Before running the code in the notebook, change the kernel to Python [conda env:etk_env]

About

Extraction Toolkit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 55.7%
  • Julia 40.6%
  • Jupyter Notebook 2.6%
  • HTML 1.1%