Online-courses-subtitles-topic-classification

Introduction

This is my undergraduate research project.
As MOOC (Massive Open Online Courses) become more and more popular, we hope to find a way to build online courses easier. It is easy to see that videos alone do not make courses. They need to be carefully edited and supplemented by text materials like subtitles. In many occasions, long videos are edited into small separate segments by volunteers, with each segment talking about a single topic. In this project, we hope to make this process automatic by utilizing NLP and OCR technologies. Currently, we require both the videos and the subtitle. But if good speech recognition methods becomes more available in the future, we might try using only the videos.

Environment

Tested on Windows 7&8 using Anaconda and Python 3.5

This project uses:
PySceneDetect
ABBYY
Keras

Method

Preprocessing data

Training data are obtained from scanned textbooks from digital libraries using ABBYY PDF Transformer +, used for educational purpose only.
Original test data are course materials obtained from Coursera, used for educational purpose only.

We put the video files of MOOC courses under the 'mp4' directory and the subtitle files under the 'srt' directory. Their names should match.

We use "CutScene.py" to invoke PySceneDetect which cuts our MOOC videos into scenes, and save the lists of scenes in csv files. The first and last frames of each scene are saved under the main directory.

We "ExtractTitlesAndCut.py" to cut the subtitles into 5-line text segments. Each represents the sentence in the middle. They are then given slides' titles as filenames. The titles are taken through ABBYY OCR from the images. These text segments will be our test files for classification.

Classification

Classification uses keras. The base code is taken from the keras blog

About OCR

In this project there are two instances where OCR is needed. The first time is when we obtain training data, where we use a commercial software to turn scanned texts into digital texts. The second time is when we obtain the titles of each slide from the frames, where we use the API to recognize the text.
Out of all the OCR technologies I've tried, ABBYY (www.ocrsdk.com) provides the most accurate outcomes. I hope they can provide more pages for me to complete this project.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
classification		classification
preprocess		preprocess
LICENSE-PySceneDetect.txt		LICENSE-PySceneDetect.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Online-courses-subtitles-topic-classification

Introduction

Environment

Method

Preprocessing data

Classification

About OCR

About

Releases

Packages

Languages

License

lejinghu/Online-courses-subtitles-topic-classification

Folders and files

Latest commit

History

Repository files navigation

Online-courses-subtitles-topic-classification

Introduction

Environment

Method

Preprocessing data

Classification

About OCR

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages