This Jupyter Notebook demonstrates application example of NLP to energy-industry articles in PDF.
Preprocessing part is described: conversion from PDF to text, tokenizer, duplicate file deletion.
About 600 articles were collected and converted into text files.
https://github.com/Jun-Tam/article_analysis_word2vec/blob/master/NLP_Articles_Preprocess.ipynb
Analysis part is described: BoW, IDF/TF, word2vec, WordCloud
https://github.com/Jun-Tam/article_analysis_word2vec/blob/master/NLP_Articles_Word2Vec.ipynb
Word count from all the articles is as shown below.
Word Cloud is a usefull tool to visualize what were people's interests in each year.
Using word2vec "ness" vectors are defined, and each article is converted into ness vectors.
The time-series plots below show recent article trends for individual ness vectors along with oil price history.
Natural Language Processing In Action, Undestanding, analyzing, and generating text with Python, Manning
Hobson Lane, Cole Howard, Hannes Max Hapke