iatv
provides programmatic access to captions, transcripts, and video clips
from the archive.org's TV News Archive (TVNA
for short).
The TVNA holds well over one million shows from 2012 to today, with even older
shows being added.
For example, to get the first 100 search results that contain 'climate' or 'change' from the month of July 2016 on the Fox News Network,
from iatv import Show, search_items
items = search_items('climate change', channel='FOXNEWSW', time='201607', rows=100)
# filter out commercials
shows = [item for item in items if 'commercial' not in item]
Then let's download the whole transcript of the first record in
shows
. We won't just get the raw captions.
show = Show(shows[0]['identifier'])
trans = show.get_transcript()
iatv
will print status updates to screen as the transcript is built.
Closed captions have to be fetched in roughly 60-second intervals,
apparently for legal reasons. However we can still build full
transcripts from this access.
trans
will be a list of strings, with each unnamed person's turn
talking being a string in the list. We can zip these together and
build a reasonable transcript like so
fulltext = u'\n\n'.join(trans).encode('utf-8')
open(show.identifier + '.txt', 'w').write(fulltext)
This tool is based on a few basic dependencies:
pycaption
provides methods for parsing
SubRip
text (SRT)-formatted
captions, which are provided by archive.org in SRT
format.
sumy
is an auto-summarization library with many available options for
summarization method. It's interesting to study both as software and
as a list of recommendations of computational linguistic algorithms to know
about. I'm using it quite ignorantly for now, just using Latent Semantic
Analysis to somehow generate summaries.
Note that if you are downloading the same transcripts often, for example for iteratively developing a processing pipeline, the best practice is to cache the results somehow. Here we'll show how to save them to file.
This is not necessarily the best or most responsible way to do this.
import os
from iatv import search_items, Show
days = ['01', '02', '03']
times = ['201607' + el for el in days]
items = [item
for time in times
for item in
search_items('I', channel='FOXNEWSW', time=time, rows=1000)]
shows = [item in items if 'commercial' not in item]
for show_spec in shows:
iden = show_spec['identifier']
show = Show(iden)
os.mkdir(iden)
ts = show.get_transcript()
ts_file_path = os.path.join(iden, 'transcript.txt')
open(ts_file_path, 'w').write('\n\n'.join(ts).encode('utf-8'))
md = show.metadata
md_file_path = os.path.join(iden, 'metadata.json')
open(md_file_path, 'w').write(json.dumps(md).encode('utf-8'))
This is wrapped up in a function called download_all_transcripts
that takes
a list of show specifications (search results from search_items
) and
downloads all their transcripts and metadata to directories of the form
<base_directory>/<show_identifier>/{transcript.txt,metadata.json}
.
For example, to download all transcripts from July 2016, run
items = search_items('I', channel='FOXNEWSW', time='201607', rows=100000)
shows = [item in items if 'commercial' not in item]
download_all_transcripts(shows, base_directory='July2016')
Note that if the directory for an identifier already exists, it will be
skipped to avoid re-downloading existing data. This will report on every URL it
downloads from. To turn off this reporting, add the kwarg verbose=False
to
the download_all_transcripts
call.
Now let's make summaries of all of these downloaded files and save these summaries to the same identifier-specific directories. So using the same variables as above,
base_directory = 'July2016'
n_sentences = 12
# run this then open July2016/FOXNEWSW_20160701_000000_The_OReilly_Factor/summary.txt
summarize_standard_dir(base_directory, n_sentences)
iatv
will serve as a building block in a larger system of tv data management
for scientific use.