Whoever has watched enough of the TV show Suits knows that there are specific phrases repeated over and over in many episodes. Not only that but these phrases are used by many characters, as if they all have the same way of talking.
Feeling the urge to quantify this observation, this notebook downloads, parses and analyses all the subtitles from all 134 episodes (9 seasons) of Suits. It uses n-grams to assist finding common phrases and regular expressions to match them and similar ones in the subtitles corpus.
Reddit post
- requests and BeautifulSoup to fetch and parse episode transcripts from online source
- Python's re for Regular Expressions to match similar phrases
- nltk for most common n-grams detection
- matplotlib and PowerPoint for final visualization