flatiron-capstone
│ README.md
└─notebooks
│ └─labeling
│ │ fb_labels.ipynb
│ └─eda
│ │ eda_polarity_and_length.ipynb
│ │ eda.py
│ └─results
│ │ master_results.ipynb
│ │ results.py
│ └─feature_importances
│ feature_importance.ipynb
│ importances.py
└─src
│ model.py
│ vectorize.py
└─appendix
│ resources.txt
│ setup_instructions.txt
│ images
Language in political ad data will be distinct among liberals and conservatives.
I collected my data according to a guiding principle discussed in the paper A Nation under Joint Custody:How Conflicting Family Models divide US Politics . In the paper, Eva Wheling discusses ‘elite-to-citizen’ discourse which she describes as language used politicians and elites that is directed toward citizens.
The source I used to obtain examples of elite-to-citizen discourse was a collection of over 200,000 Facebook political ads from 2019-2021 available as a CSV from ProPublica Data Store.
The original data had over 200,000 observations. In order to extract a smaller sample I set some parameters as to how I would filter the data.
- Data is labeled as liberal or conservative based on the affiliation of the organization that paid for the ad, as represented by the 'paid-by' column.
- Data is limited to only those rows whose payer appears more than one hundred times. This resulted in a data set of approximately 20,000 rows with approximately 200 unique payers.
Next, I reviewed the list of payers and used the following combination of manual and automated techniques to label each ad.
- Named entity recognition (NER) with SpaCy was used in order to extract payers with names that included words similar to "Republican" and "Democrat".
- Named entity recognition was used to extract the names of any politicans, and label the politicans whose affiliation was known.
- Politicans and organizations that could not be labeled using NER were manually researched using OpenSecrets to determine the political affiliation of all of the payers.
When the affiliation of all the payers was determined and the correct label was applied to each row, I checked the values for a class imbalance and noticed that there were 14,000 rows that were labeled liberal and only approximately 5,000 rows that were labeled conservative. To address the class imbalance I under-sampled and used 5,000 observations from each class for my analysis.
The original ad data from ProPublica had 27 features. Since this analysis was focused on text data from ad messages, after the labeling process all columns except for 'message' and 'label' were eliminated.
The following steps were performed on the liberal set and the conservative set separately, and combined later during the modeling phase.
- Data from the ‘message’ column was used for the linguistic analysis. HTML and superfluous characters were removed. Then each observation was transformed into a SpaCy document.
- The new list of documents was then lemmatized, and filtered for digits, punctuation, stop words, and words that are three characters or less.
- Each word was assigned a word-vector using the SpaCy pre-trained model ‘en-web-code-md’.
- The vocabulary was trimmed to words that appear at least ten times. Then, a column was created for each word in the vocabulary. The value of each column was the mean vector for the word as it appears in that row.
- The compound polarity for each observation was calculated using Vader Sentiment Intensity Analyzer.
- Messages were divded into long (length had a z-score > 1) and short (length had a z-score < 1).
- 98.38% of ads are short, 62% of ads are long.
- Of short ads, 50.75% are conservative while 49.25% are liberal.
- Of long ads, 4.32% are conservative and 95.68% are liberal
- Longer ads were overwhelmingly liberal and positive.
I fit and tested three different models (Logistic Regression, Niave Bayes, and Support Vector Classifier) to compare their performance, using the following process for each classifier:
- A sample of 1,000 observations was loaded from each vectorized dataframe (liberal and conservative) and concatenated to create one dataframe of 2,000 observations for the model.
- The data was divded into X and y data, then split into a training and validation set.
- The training data was then fit that particular classifier, and evaluated using three different test sizes: 0.2, 0.3, 0.4.
Results for each classifier were combined and saved to the /results directory.
The process is repeated 10 times, and the results from each trial are combined into a masters results data frame, which is then grouped by classifier and text size. The mean score for all of the evaluation metrics is extracted from each and compared.
model | test size | f1 | precision | recall | accuracy | roc_auc | mean_score |
---|---|---|---|---|---|---|---|
Gaussian NB | 0.2 | 9.65259 | 9.94671 | 9.37688 | 9.665 | 9.66357 | 9.66095 |
Gaussian NB | 0.3 | 9.56674 | 9.82702 | 9.32119 | 9.575 | 9.5767 | 9.57333 |
Gaussian NB | 0.4 | 9.47513 | 9.69629 | 9.26551 | 9.4825 | 9.48414 | 9.48071 |
Logistic Regression | 0.2 | 6.4213 | 5.9322 | 7.0201 | 6.15 | 6.15433 | 6.33559 |
Logistic Regression | 0.3 | 6.39771 | 5.95296 | 6.93709 | 6.11167 | 6.10613 | 6.30111 |
Logistic Regression | 0.4 | 6.3345 | 5.91879 | 6.83623 | 6.05875 | 6.05287 | 6.24023 |
SVC | 0.2 | 6.13127 | 6.12301 | 6.17085 | 6.1825 | 6.18244 | 6.15801 |
SVC | 0.3 | 6.04041 | 6.15965 | 5.98013 | 6.13167 | 6.13268 | 6.08891 |
SVC | 0.4 | 5.99584 | 6.11838 | 5.93052 | 6.09125 | 6.09246 | 6.04569 |
The model that had the highest mean score was Gaussian Niave Bayes using an 80/20 train/test split. However, special consideration was given with respect to precision because a lot of political language is domain specific and there was bound to be a large vocabulary shared between the classes and there was a high likelyhood that many features might be equally as prevelant among either class.
To retrieve the most important features in this model, I use Sci-Kit Learn Permutation Importance.
- Obtained the results of three models (Logistic Regression, Niave Bayes, and Support Vector Classifier) and selected Niave Bayes because it had the best performance.
- Loaded the vectorized data used to create the models. Divided the data by class, and then sampled 1000 observations from each class.
- The data frames were then concatenated and split into X, y data, then into a training and validation set, and fit the data to Sci-Kit Learn Gaussian NB.
- Then, the permutation importance was calculated for the selected sample of observations and stored.
- This process was repeated 10 times, and the results of each test were combined into a master dataframe, and then grouped by feature by calculating the mean of the mean importance for each feature.
The features with the top-ten highest mean of means:
features | mean_importance |
---|---|
debate | 0.0088 |
stage | 0.0078 |
corporate | 0.0052 |
democracy | 0.005 |
qualify | 0.0047 |
mainstream | 0.0044 |
media | 0.0042 |
dnc | 0.0041 |
pac | 0.004 |
organize | 0.0034 |
(coming soon) An analysis of how often the most important features occur among each class and in what context.
- The ProPublica Facebook ad data set is very rich. There are still many valuable insights that can be gleaned from further examination of different features in the original data, especially the relationship between the ad target demographics and the content of the ad.
- Reimagining the model to be more sensitive to framing in language by using Named Entity Recognition and POS tags to identify key nouns and verbs used in the discourse to evoke certain frames to investigate who is evoking which frames.
- Extending this model to analyze Citizen discourse using Reddit data and comparing Elite-to-Citizen and Citizen political discourse. In addition, the model could be extended to decipher between Elite-to-Citizen and Citizen discourse.