Political News Filter classifies English news articles regarding whether they cover policy topics.
It uses a broad characterization of politics: Politics is about "who gets what, when, and how" (Lasswell, 1936). As a result, Political News Filter may consider business news or tech news as political, depending on actual contents.
- Python 3.6+
- Pandas 0.24.1+
- NumPy 1.18.1+
- Keras 2.3.1+
- TensorFlow 2.1.0+
Political News Filter supports both CPU and GPU processing. The latter is faster but requires a CUDA-capable graphics card and the CUDA toolkit.
-
Clone this repository:
$ git clone https://github.com/lukasgebhard/Political-News-Filter.git $ cd Political-News-Filter
-
Download and extract pon_classifier.zip into the repository folder. Its inflated size is 1.2 GB.
-
Install Python dependencies. For example, create a virtual environment:
$ virtualenv --python=python3.6 venv $ source venv/bin/activate $ pip install -r requirements.txt
-
Verify the installation was successful:
$ ./check_installation.sh Hooray! Political News Filter is properly installed and ready to use.
Start a Python session:
$ python3
Create exemplary articles:
>>> political_article = '''White House declares war against terror. The US government officially announced a ''' \
'''large-scale military offensive against terrorism. Today, the Senate agreed to spend an ''' \
'''additional 300 billion dollars on the advancement of combat drones to be used against ''' \
'''global terrorism. Opposition members sharply criticize the government. ''' \
'''"War leads to fear and suffering. ''' \
'''Fear and suffering is the ideal breeding ground for terrorism. So talking about a ''' \
'''war against terror is cynical. It's actually a war supporting terror."'''
>>> nonpolitical_article = '''Table tennis world cup 2025 takes place in South Korea. ''' \
'''The 2025 world cup in table tennis will be hosted by South Korea, ''' \
'''the Table Tennis World Commitee announced yesterday. ''' \
'''Three-time world champion, Hu Ho Han, did not pass the qualification round, ''' \
'''to the advantage of underdog Bob Bobby who has been playing outstanding matches ''' \
'''in the National Table Tennis League this year.'''
To filter a list of news articles, call filter_news
:
>>> from political_news_filter import filter_news
>>> political_article == filter_news([political_article, nonpolitical_article])[0]
True
If you need more flexibility, you can directly call the underlying classifier:
>>> from political_news_filter import Classifier
>>> classifier = Classifier()
>>> probabilities = classifier.estimate([political_article, nonpolitical_article])
>>> probabilities[0] > 0.99
True
>>> probabilities[1] < 0.01
True
Please read the docstrings for further information.
Below are some benchmarks on a notebook with 6 CPU cores @ 2.6 GHz, a GPU with 4 GB GRAM and CUDA capability 7.5, 32 GB RAM, and a PCIe SSD drive:
Task | On CPU | On GPU |
---|---|---|
One-time Initialization | 30 sec | 15 sec |
Classification of 1,000 articles | 1.8 sec | 1.3 sec |
The classifier is based on a model by Heng Zheng submitted to Kaggle under the Apache 2.0 license. It is a convolutional neural network with a 100-dimensional GloVe embedding layer, three convolutional layers, each one followed by a ReLu layer and a pooling layer, and finally a softmax output layer. During training, a cross-entropy loss function is minimized using dropout regularization.
I created a labeled set of 0.57M news articles, selected from:
- The CommonCrawl news archive (extracted using news-please)
- The HuffPost dataset
- The BBC dataset
After fitting the classifier on 87.5 % of the articles, testing it on the remaining 12.5 % yields:
- F1 = 94.4
- Precision = 95.6
- Recall = 93.2
If you use Political News Filter, please cite our poster:
@InProceedings{POLUSA,
author = {Gebhard, Lukas and Hamborg, Felix},
title = {The POLUSA Dataset: 0.9M Political News Articles Balanced by Time and Outlet Popularity},
year = {2020},
month = {August},
booktitle = {Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20)},
venue = {Virtual event, China},
publisher = {Association for Computing Machinery},
doi = {10.1145/3383583.3398567}
}