GitHub - HarrisonPW/Amazon-Review-AI-Detector

Open Source Packages

Libraries Used

Scikit-learn
- Description: Machine learning library for Python. Used for Logistic Regression and Multinomial Naive Bayes.
- Language: Python.
Hugging Face Transformers
- Description: Library for state-of-the-art natural language processing models like GPT-2.
- Language: Python.
Optuna
- Description: Framework for hyperparameter optimization.
- Language: Python.
Pandas
- Description: Data manipulation and analysis library.
- Language: Python.
Matplotlib
- Description: Library for creating static, animated, and interactive visualizations.
- Language: Python.
NumPy Description: Fundamental package for numerical computation in Python, providing support for arrays and matrices. Language: Python.
Seaborn Description: Statistical data visualization library based on Matplotlib. Language: Python.
PyTorch Description: Deep learning framework providing tensor computation and automatic differentiation. Language: Python.
NLTK (Natural Language Toolkit) Description: Library for processing and analyzing human language data (natural language processing). Language: Python.
TQDM Description: Library for creating progress bars in Python. Language: Python.
Logging Description: Built-in Python library for generating log messages. Language: Python.
JSON Description: Built-in Python module for parsing and creating JSON data. Language: Python.
Time Description: Built-in Python module for handling time-related functions. Language: Python.

Datasets

Amazon Fake/Real Review Dataset

Source: Kaggle Dataset
Categories: Cell Phones and Accessories, Clothing, Home and Kitchen, Sports, and Toys.
Size: 100,000 data points (balanced dataset: 50% real and 50% fake reviews).
Preprocessing:
- Text cleaning (removing punctuations, stopwords, and extra spaces).
- Train-Test Split: 80% training, 20% testing.

Performance Measurement Tools

Confusion Matrix
- Evaluates model prediction performance.
- Highlights Type I and Type II errors.
ROC and Precision-Recall Curves
- Visualization of performance metrics over different thresholds.
Training and Evaluation Metrics
- Accuracy, Precision, Recall, F1 Score across epochs.

Models

Logistic Regression Model

Overview:
Logistic Regression is a classic statistical method used for binary classification tasks. In this project, it is employed to classify Amazon reviews as either spam (fake) or non-spam (real). Despite its simplicity, Logistic Regression is effective for tasks where feature relationships are linear or near-linear.

Key Features:

Simplicity and Interpretability: The model provides straightforward results and coefficients that explain the relationship between features and output.
Speed: Training is computationally efficient, even on large datasets.
Binary Classification: Suitable for a balanced dataset like the one used here.

Training Details:

Dataset Split: 80% training, 20% testing.
Preprocessing:
- Text cleaning (lowercasing, stopword removal, punctuation removal).
- Features extracted using TF-IDF or Count Vectorization.
Metrics Evaluated:
- Accuracy, Precision, Recall, and F1 Score.
- ROC and Confusion Matrix.

Use Case:
Logistic Regression works best when computational resources are limited, or a fast and interpretable model is preferred.

GPT-2 Model

Overview:
GPT-2 (Generative Pre-trained Transformer 2) is a deep learning model developed by OpenAI for natural language understanding and generation. In this project, GPT-2 is fine-tuned for text classification, specifically detecting fake and real reviews by analyzing patterns, semantics, and linguistic nuances.

Key Features:

Contextual Understanding: GPT-2 excels in understanding the context and generating human-like text.
Fine-tuned for Specific Tasks: By retraining on a labeled dataset, GPT-2 learns to distinguish spam reviews from genuine ones.
Adaptability: GPT-2 can be used for text classification, summarization, and other NLP tasks.

Training Details:

Dataset Split: 80% training, 20% testing.
Fine-tuning Parameters:
- Epochs: Adjusted to optimize performance.
- Batch Size: Fine-tuned for resource optimization.
- Learning Rate: Optimized using Optuna.
Performance Metrics:
- Accuracy, Precision, Recall, and F1 Score tracked across epochs.
- Training Loss monitored for optimization.

Advanced Features:

Uses transformer architecture, which includes multi-head attention and positional encodings.
Capable of handling long-range dependencies in text, making it ideal for nuanced tasks like fake review detection.

Use Case:
GPT-2 is suitable for tasks requiring a high degree of text understanding or tasks where leveraging semantic and contextual information significantly improves performance.

Multinomial Naive Bayes

Overview:
MultinomialNB is a variant of the Naive Bayes algorithm that works well with discrete features like word counts

Key Features:

Used in conjunction with the CountVectorizer

Training Details:

Sample Size: 100,000 samples in total. 80,000 samples in the training dataset. 20,000 samples in the testing dataset.
Input: TextReview (removing stopwords)
Output: Class (0 or 1)
Dataset Split: 80% training, 20% testing.
Parameters:
- Max Features: 10,000
- n_gram range (1,5)
- alpha: 0.3

Use Case:
Mostly used for classification tasks, especially text classification tasks where data is represented as word counts.

Gemini 1.5 API Model

Overview:
The Gemini 1.5 API is a cutting-edge large language model (LLM) optimized for a wide range of natural language processing tasks. In this project, it was fine-tuned to classify Amazon reviews, leveraging its advanced semantic understanding and text classification capabilities.

Key Features:

Pre-trained for Versatility: Gemini 1.5 comes pre-trained on extensive datasets, making it adaptable for tasks such as classification, summarization, and sentiment analysis.
High Accuracy in Text Understanding: Its ability to process complex language patterns contributes to robust classification results.
Cloud-based Scalability: As an API, it supports seamless integration and scales effortlessly for production environments.

Training Details:

Sample Size: 100,000 samples in total. 80,000 samples in the training dataset. 20,000 samples in the testing dataset.
Input: TextReview (removing stopwords)
Output: Class (0 or 1)
Dataset Split: 80% training, 20% testing.
Fine-tuning Parameters:
- Optimized using Optuna.
- Epochs: Set to the maximum (2) under the constraint of Google AI Studio.
- Batch Size: Tuned for resource efficiency.
- Learning Rate: Adjusted to minimize overfitting and maximize generalization.
Metrics Evaluated:
- Accuracy, Precision, Recall, F1 Score, and confusion matrix.
- Reduction in cross-entropy loss (<= 1) achieved during training.

Advanced Features:

Transformer and MoE Architecture: Incorporates self-attention and positional encodings for understanding text at both a local and global level.
API-Driven Deployment: Accessible via a secure, scalable API endpoint, enabling real-time inference for review classification tasks.
Real-Time Integration: Integrated with other models in the system for collaborative inference, enhancing overall system accuracy.

Use Case:
Gemini 1.5 API is ideal for applications requiring a balance between high accuracy and deployment scalability, such as real-time content moderation, sentiment analysis, and review authenticity verification.

BERT+LSTM

Model Breakdown

BERT
LSTM

Overview:
BERT+LSTM is a combined model, which is generally considered as more powerful than any of the single model, providing more robustness to the base bert model with variable length of text input. It is also good at extract features given the text data are usually sequential in context.

Key Features:

Bidirectional: Unlike unidirectional model which read from, for example, left to right, the bidirectional model can better extract the information embedded in the language data, providing better performance overall.
Input of variable length: The combination with LSTM equipped the base bert with capability of handling variable length input.
Balance: While maintaining a relatively smaller size, it is powerful in information extraction.

Training Details:

Dataset Split: 80% of 100,000 data points used for training, the rest 20% used for testing.
Data Input: reviewText, data cleaning techniques applied.
Training Output: Confidence score list of size 2 (real or fake).
Use of Optuna: Total trials of 5 to find the best parameters' combination.
Epoch #: 5
Preprocessing:
- Data cleaning, including stop words removal and lowering letter cases.
- Added Sentimental Score to ensure sentimental consistency in reviewSummary and reviewText column. (Removed after Workshop Pre as focusing exclusively on reviewText.)
Metrics Evaluated:
- Accuracy, Precision, Recall, and F1 Score.
- Confusion Matrix

Use Case:
Bert + LSTM is widely user for text classification tasks, such as fake online shopping review detection and sentimental analysis given a brief text.

Model Comparison

Feature	Logistic Regression	GPT-2	Gemini 1.5 API	Bert + LSTM
Complexity	Low	High	High	High
Training Time	~28 seconds for 100 epochs	Hours per epoch (GPU-accelerated)	Cloud-based, offloaded computation	~35 min each trail, 3hrs54min in total
Interpretability	High (coefficients are interpretable)	Low (black-box neural network)	Moderate (API abstraction)	Low given deep learning complex layers
Resource Needs	Minimal (CPU sufficient)	High (requires GPU for efficient use)	Moderate (API-driven scaling)	High, A100 used in above training
Accuracy	Moderate (good for linear data)	High (excels in capturing nuanced data)	High (API-trained on extensive datasets)	High, ~ 90 % in accuracy
Suitability	Simple, fast tasks with limited features	Complex, semantic-rich text tasks	Versatile, real-time NLP tasks	Balanced in performance and size

Infrastructure

To run this project on your computer, you can follow these steps:

1. Clone or Download the Project

On GitHub, clone or download this project to your local environment:

git clone <repository_url>
cd <repository_name>

To run the backend, go to the folder Amazon-Review-AI-Detector. To run the frontend, go to the folder amazon-spam-detector-frontend.

2. Run on Docker

Ensure Docker Desktop is Running.

Run the following command:

sudo docker-compose up --build -d

OR (for Windows)

sudo docker-compose up --build -d

3. Start the Development Server

After running this command, you should see output in the terminal similar to:

  VITE v5.x.x  ready in xx ms

  ➜  Local:   http://localhost:5173/
  ➜  Network: use --host to expose

4. Access the Project in Your Browser

Open your browser and go to the local address shown in the output (usually http://localhost:5173). You should be able to see the frontend project running on your local server.

Explanation of Other Files

Dockerfile and docker-compose.yml: If you want to run this project in a Docker container, you can use these files. Run docker-compose up to start the container.
tsconfig.json: TypeScript configuration file, which defines the TypeScript compilation options.
eslint.config.js: ESLint configuration file, used for code style and quality checking.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
amazon-spam-detector-frontend		amazon-spam-detector-frontend
json_chunks		json_chunks
training_models		training_models
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
Amazon Review Model Weight Calculation .ipynb		Amazon Review Model Weight Calculation .ipynb
Amazon Review.ipynb		Amazon Review.ipynb
BERT.safetensors.part0		BERT.safetensors.part0
BERT.safetensors.part1		BERT.safetensors.part1
BERT.safetensors.part2		BERT.safetensors.part2
BERT.safetensors.part3		BERT.safetensors.part3
BERT.safetensors.part4		BERT.safetensors.part4
BERTLSTM.safetensors.part0		BERTLSTM.safetensors.part0
BERTLSTM.safetensors.part1		BERTLSTM.safetensors.part1
BERTLSTM.safetensors.part2		BERTLSTM.safetensors.part2
BERTLSTM.safetensors.part3		BERTLSTM.safetensors.part3
BERTLSTM.safetensors.part4		BERTLSTM.safetensors.part4
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
data_handler.py		data_handler.py
data_load.ipynb		data_load.ipynb
data_processing.ipynb		data_processing.ipynb
docker-compose.yml		docker-compose.yml
gpt2_spam_detector.pth.part0		gpt2_spam_detector.pth.part0
gpt2_spam_detector.pth.part1		gpt2_spam_detector.pth.part1
gpt2_spam_detector.pth.part2		gpt2_spam_detector.pth.part2
gpt2_spam_detector.pth.part3		gpt2_spam_detector.pth.part3
gpt2_spam_detector.pth.part4		gpt2_spam_detector.pth.part4
json_reassembler.py		json_reassembler.py
json_splitter.py		json_splitter.py
logistic_regression_count_vectorizer.pkl		logistic_regression_count_vectorizer.pkl
logistic_regression_logistic_spam_detector.pkl		logistic_regression_logistic_spam_detector.pkl
reassembler.py		reassembler.py
reassembler_BERT.py		reassembler_BERT.py
requirements.txt		requirements.txt
special_tokens_map.json		special_tokens_map.json
split_helper.py		split_helper.py
test.ipynb		test.ipynb
tokenizer.json		tokenizer.json
tokenizer_config.json		tokenizer_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Source Packages

Libraries Used

Datasets

Amazon Fake/Real Review Dataset

Performance Measurement Tools

Models

Logistic Regression Model

GPT-2 Model

Multinomial Naive Bayes

Gemini 1.5 API Model

BERT+LSTM

Model Breakdown

Model Comparison

Infrastructure

To run this project on your computer, you can follow these steps:

1. Clone or Download the Project

2. Run on Docker

3. Start the Development Server

4. Access the Project in Your Browser

Explanation of Other Files

About

Releases

Packages

Contributors 4

Languages

HarrisonPW/Amazon-Review-AI-Detector

Folders and files

Latest commit

History

Repository files navigation

Open Source Packages

Libraries Used

Datasets

Amazon Fake/Real Review Dataset

Performance Measurement Tools

Models

Model Breakdown

Model Comparison

Infrastructure

To run this project on your computer, you can follow these steps:

1. Clone or Download the Project

2. Run on Docker

3. Start the Development Server

4. Access the Project in Your Browser

Explanation of Other Files

About

Resources

Stars

Watchers

Forks

Languages