A set of Python tools to download the Senate and House transcripts and convert them to usable text.
sh setup.sh
sh build-corpera.sh
The text transcripts will be in transcripts-txt/ and will be named by chamber of congress and date.
- Downloading PDFs by date range
- Converting them into usable text
- Seperating the text by speaker and eliminating non-spoken text (See SeperateSpeeches.py)