Skip to content

HarrisonPW/Amazon-Review-AI-Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open Source Packages

Libraries Used

  1. Scikit-learn

    • Description: Machine learning library for Python. Used for Logistic Regression and Multinomial Naive Bayes.
    • Language: Python.
  2. Hugging Face Transformers

    • Description: Library for state-of-the-art natural language processing models like GPT-2.
    • Language: Python.
  3. Optuna

    • Description: Framework for hyperparameter optimization.
    • Language: Python.
  4. Pandas

    • Description: Data manipulation and analysis library.
    • Language: Python.
  5. Matplotlib

    • Description: Library for creating static, animated, and interactive visualizations.
    • Language: Python.
  6. NumPy Description: Fundamental package for numerical computation in Python, providing support for arrays and matrices. Language: Python.

  7. Seaborn Description: Statistical data visualization library based on Matplotlib. Language: Python.

  8. PyTorch Description: Deep learning framework providing tensor computation and automatic differentiation. Language: Python.

  9. NLTK (Natural Language Toolkit) Description: Library for processing and analyzing human language data (natural language processing). Language: Python.

  10. TQDM Description: Library for creating progress bars in Python. Language: Python.

  11. Logging Description: Built-in Python library for generating log messages. Language: Python.

  12. JSON Description: Built-in Python module for parsing and creating JSON data. Language: Python.

  13. Time Description: Built-in Python module for handling time-related functions. Language: Python.


Datasets

Amazon Fake/Real Review Dataset

  • Source: Kaggle Dataset
  • Categories: Cell Phones and Accessories, Clothing, Home and Kitchen, Sports, and Toys.
  • Size: 100,000 data points (balanced dataset: 50% real and 50% fake reviews).
  • Preprocessing:
    • Text cleaning (removing punctuations, stopwords, and extra spaces).
    • Train-Test Split: 80% training, 20% testing.

Performance Measurement Tools

  1. Confusion Matrix

    • Evaluates model prediction performance.
    • Highlights Type I and Type II errors.
  2. ROC and Precision-Recall Curves

    • Visualization of performance metrics over different thresholds.
  3. Training and Evaluation Metrics

    • Accuracy, Precision, Recall, F1 Score across epochs.

Models

Overview:
Logistic Regression is a classic statistical method used for binary classification tasks. In this project, it is employed to classify Amazon reviews as either spam (fake) or non-spam (real). Despite its simplicity, Logistic Regression is effective for tasks where feature relationships are linear or near-linear.

Key Features:

  • Simplicity and Interpretability: The model provides straightforward results and coefficients that explain the relationship between features and output.
  • Speed: Training is computationally efficient, even on large datasets.
  • Binary Classification: Suitable for a balanced dataset like the one used here.

Training Details:

  • Dataset Split: 80% training, 20% testing.
  • Preprocessing:
    • Text cleaning (lowercasing, stopword removal, punctuation removal).
    • Features extracted using TF-IDF or Count Vectorization.
  • Metrics Evaluated:
    • Accuracy, Precision, Recall, and F1 Score.
    • ROC and Confusion Matrix.

Use Case:
Logistic Regression works best when computational resources are limited, or a fast and interpretable model is preferred.


Overview:
GPT-2 (Generative Pre-trained Transformer 2) is a deep learning model developed by OpenAI for natural language understanding and generation. In this project, GPT-2 is fine-tuned for text classification, specifically detecting fake and real reviews by analyzing patterns, semantics, and linguistic nuances.

Key Features:

  • Contextual Understanding: GPT-2 excels in understanding the context and generating human-like text.
  • Fine-tuned for Specific Tasks: By retraining on a labeled dataset, GPT-2 learns to distinguish spam reviews from genuine ones.
  • Adaptability: GPT-2 can be used for text classification, summarization, and other NLP tasks.

Training Details:

  • Dataset Split: 80% training, 20% testing.
  • Fine-tuning Parameters:
    • Epochs: Adjusted to optimize performance.
    • Batch Size: Fine-tuned for resource optimization.
    • Learning Rate: Optimized using Optuna.
  • Performance Metrics:
    • Accuracy, Precision, Recall, and F1 Score tracked across epochs.
    • Training Loss monitored for optimization.

Advanced Features:

  • Uses transformer architecture, which includes multi-head attention and positional encodings.
  • Capable of handling long-range dependencies in text, making it ideal for nuanced tasks like fake review detection.

Use Case:
GPT-2 is suitable for tasks requiring a high degree of text understanding or tasks where leveraging semantic and contextual information significantly improves performance.


Overview:
MultinomialNB is a variant of the Naive Bayes algorithm that works well with discrete features like word counts

Key Features:

Training Details:

  • Sample Size: 100,000 samples in total. 80,000 samples in the training dataset. 20,000 samples in the testing dataset.
  • Input: TextReview (removing stopwords)
  • Output: Class (0 or 1)
  • Dataset Split: 80% training, 20% testing.
  • Parameters:
    • Max Features: 10,000
    • n_gram range (1,5)
    • alpha: 0.3

Use Case:
Mostly used for classification tasks, especially text classification tasks where data is represented as word counts.

Overview:
The Gemini 1.5 API is a cutting-edge large language model (LLM) optimized for a wide range of natural language processing tasks. In this project, it was fine-tuned to classify Amazon reviews, leveraging its advanced semantic understanding and text classification capabilities.

Key Features:

  • Pre-trained for Versatility: Gemini 1.5 comes pre-trained on extensive datasets, making it adaptable for tasks such as classification, summarization, and sentiment analysis.
  • High Accuracy in Text Understanding: Its ability to process complex language patterns contributes to robust classification results.
  • Cloud-based Scalability: As an API, it supports seamless integration and scales effortlessly for production environments.

Training Details:

  • Sample Size: 100,000 samples in total. 80,000 samples in the training dataset. 20,000 samples in the testing dataset.
  • Input: TextReview (removing stopwords)
  • Output: Class (0 or 1)
  • Dataset Split: 80% training, 20% testing.
  • Fine-tuning Parameters:
    • Optimized using Optuna.
    • Epochs: Set to the maximum (2) under the constraint of Google AI Studio.
    • Batch Size: Tuned for resource efficiency.
    • Learning Rate: Adjusted to minimize overfitting and maximize generalization.
  • Metrics Evaluated:
    • Accuracy, Precision, Recall, F1 Score, and confusion matrix.
    • Reduction in cross-entropy loss (<= 1) achieved during training.

Advanced Features:

  • Transformer and MoE Architecture: Incorporates self-attention and positional encodings for understanding text at both a local and global level.
  • API-Driven Deployment: Accessible via a secure, scalable API endpoint, enabling real-time inference for review classification tasks.
  • Real-Time Integration: Integrated with other models in the system for collaborative inference, enhancing overall system accuracy.

Use Case:
Gemini 1.5 API is ideal for applications requiring a balance between high accuracy and deployment scalability, such as real-time content moderation, sentiment analysis, and review authenticity verification.


Model Breakdown

Overview:
BERT+LSTM is a combined model, which is generally considered as more powerful than any of the single model, providing more robustness to the base bert model with variable length of text input. It is also good at extract features given the text data are usually sequential in context.

Key Features:

  • Bidirectional: Unlike unidirectional model which read from, for example, left to right, the bidirectional model can better extract the information embedded in the language data, providing better performance overall.
  • Input of variable length: The combination with LSTM equipped the base bert with capability of handling variable length input.
  • Balance: While maintaining a relatively smaller size, it is powerful in information extraction.

Training Details:

  • Dataset Split: 80% of 100,000 data points used for training, the rest 20% used for testing.
  • Data Input: reviewText, data cleaning techniques applied.
  • Training Output: Confidence score list of size 2 (real or fake).
  • Use of Optuna: Total trials of 5 to find the best parameters' combination.
  • Epoch #: 5
  • Preprocessing:
    • Data cleaning, including stop words removal and lowering letter cases.
    • Added Sentimental Score to ensure sentimental consistency in reviewSummary and reviewText column. (Removed after Workshop Pre as focusing exclusively on reviewText.)
  • Metrics Evaluated:
    • Accuracy, Precision, Recall, and F1 Score.
    • Confusion Matrix

Use Case:
Bert + LSTM is widely user for text classification tasks, such as fake online shopping review detection and sentimental analysis given a brief text.


Model Comparison

Feature Logistic Regression GPT-2 Gemini 1.5 API Bert + LSTM
Complexity Low High High High
Training Time ~28 seconds for 100 epochs Hours per epoch (GPU-accelerated) Cloud-based, offloaded computation ~35 min each trail, 3hrs54min in total
Interpretability High (coefficients are interpretable) Low (black-box neural network) Moderate (API abstraction) Low given deep learning complex layers
Resource Needs Minimal (CPU sufficient) High (requires GPU for efficient use) Moderate (API-driven scaling) High, A100 used in above training
Accuracy Moderate (good for linear data) High (excels in capturing nuanced data) High (API-trained on extensive datasets) High, ~ 90 % in accuracy
Suitability Simple, fast tasks with limited features Complex, semantic-rich text tasks Versatile, real-time NLP tasks Balanced in performance and size

Infrastructure

Screenshot 2024-12-03 at 7 22 35 PM

To run this project on your computer, you can follow these steps:

1. Clone or Download the Project

On GitHub, clone or download this project to your local environment:

git clone <repository_url>
cd <repository_name>

To run the backend, go to the folder Amazon-Review-AI-Detector. To run the frontend, go to the folder amazon-spam-detector-frontend.

2. Run on Docker

Ensure Docker Desktop is Running.

Run the following command:

sudo docker-compose up --build -d

OR (for Windows)

sudo docker-compose up --build -d

3. Start the Development Server

After running this command, you should see output in the terminal similar to:

  VITE v5.x.x  ready in xx ms

  ➜  Local:   http://localhost:5173/
  ➜  Network: use --host to expose

4. Access the Project in Your Browser

Open your browser and go to the local address shown in the output (usually http://localhost:5173). You should be able to see the frontend project running on your local server.

Explanation of Other Files

  • Dockerfile and docker-compose.yml: If you want to run this project in a Docker container, you can use these files. Run docker-compose up to start the container.
  • tsconfig.json: TypeScript configuration file, which defines the TypeScript compilation options.
  • eslint.config.js: ESLint configuration file, used for code style and quality checking.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •