Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
The-Gupta authored Jun 25, 2020
1 parent a790383 commit c733deb
Showing 1 changed file with 21 additions and 0 deletions.
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -4,3 +4,24 @@ Web Scraping of TED.com for complete Metadata, Transcript, Audio, Video, Images
Environment: Google Colab with Google Drive without any Hardware Accelerator

Scraped Data: https://www.kaggle.com/thegupta/ted-talk


### Context

I was looking for an interesting dataset for a personal Data Science project, and I'm a fan of TED. So, I looked for the TED dataset, found [Rounka's](https://www.kaggle.com/rounakbanik/ted-talks) but it is incomplete and outdated. Then, [I scraped myself](https://github.com/The-Gupta/TED-Scraper/blob/master/Scraper.ipynb) and made it super fast using Parallel Programming. Now, it **downloads all Metadata along with the Transcript in 300 seconds of all 4609 Talks on the website***. This is the **most comprehensive TED Talk dataset** which includes [media files](https://drive.google.com/drive/folders/1clqw9izazxafPDuIekXQYYdI-J42VvCR) (images, audio, and video) too!
*Scraped on 24-JUN-20. One can scrape entire TED.com using the code to get the latest dataset in 5 minutes.
Downloading media files take time: 2 minutes for photos of Speaker and Talk

### Content

Each row corresponds to a Talk on TED.com and each column details Metadata (generic/speaker/talk related information) plus Transcript.


### Acknowledgements

I thank Google for Colab.


### Inspiration

We've got the entire TED.com in an Excel sheet, let's find some INSIGHTS WORTH SHARING!

0 comments on commit c733deb

Please sign in to comment.