Name		Name	Last commit message	Last commit date
parent directory ..
utils		utils
email_eda.ipynb		email_eda.ipynb
fradulent_emails.txt		fradulent_emails.txt
readme.md		readme.md
requirements.txt		requirements.txt

readme.md

Text Processing Pipeline using Polars

This repository contains the code for my medium post Fast String Processing with Polars - Scam Emails Dataset.

The project implements a text processing pipeline using the Polars library for efficient data manipulation and analysis. The pipeline is designed to handle text data, perform various preprocessing tasks, and extract useful features from the text.

Dataset

The dataset used in this project is the CLAIR collection of fraud emails by Radev, D. (2008). The dataset can be accessed from the ACL Data and Code Repository under the identifier ADCR2008T001. More information about the dataset can be found at ACL Data and Code Repository.

Dependencies

The following dependencies are required to run the text processing pipeline:

numpy==1.23.5
pandas==1.5.3
polars==0.17.14
nltk==3.8.1
scikit-learn==1.2.2
matplotlib==3.7.1
wordcloud==1.9.2

Run in a Notebook

Install the required dependencies using pip pip install -r requirements.txt
Navigate to email_eda.ipynb and run the code to load, pre-process, clean, and tokenise the emails. Additionally, it will cluster the texts and generate the wordcloud for each cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fraud_email

fraud_email

readme.md

Text Processing Pipeline using Polars

Dataset

Dependencies

Run in a Notebook

Files

fraud_email

Directory actions

More options

Directory actions

More options

Latest commit

History

fraud_email

Folders and files

parent directory

readme.md

Text Processing Pipeline using Polars

Dataset

Dependencies

Run in a Notebook