This Python project utilizes the Natural Language Toolkit (NLTK) library to implement an N-gram language model. The code includes the following packages:
-
Pandas: A data manipulation library used for handling and analyzing tabular data.
-
NLTK (Natural Language Toolkit):
-
bigrams
: Module for extracting bigrams from a sequence of words. -
lm.preprocessing.pad_both_ends
: Preprocessing module for padding both ends of a sequence. -
tokenize.WordPunctTokenizer
: Tokenizer for breaking text into words using punctuation and whitespace. -
lm.Vocabulary
: Module for constructing a vocabulary from a given text corpus. -
lm.Laplace
: Module implementing Laplace smoothing for language modeling.
-
This Python project implements a text prediction system using the Laplace smoothing model with bigrams. The goal is to predict the next word in a given sentence based on the provided prefix. The project utilizes the Natural Language Toolkit (NLTK) library for processing and modeling natural language data.
-
Data Preprocessing:
- The project starts by reading a CSV file (
train.csv
) containing text data. - HTML tags are removed from the 'Body' column of the dataset using a function called
remove_html_tags
. - The text is tokenized using the
WordPunctTokenizer
from NLTK.
- The project starts by reading a CSV file (
-
N-gram Model Building:
- The corpus is then processed further by padding both ends of each sentence with special symbols ("
" and "") using thepad_both_ends
function. - Bigrams are extracted from the padded sentences using the
bigrams
function. - The vocabulary is constructed using the
Vocabulary
class from NLTK.
- The corpus is then processed further by padding both ends of each sentence with special symbols ("
-
Laplace Smoothing Model:
- The Laplace smoothing model is implemented using the
Laplace
class from NLTK's language modeling module. - The model is trained on the bigram data.
- The Laplace smoothing model is implemented using the
-
Next-word Prediction:
- User input is taken to provide a prefix for next-word prediction.
- The Laplace model scores each word in the vocabulary based on its likelihood to follow the given prefix.
- The top three predictions with their respective scores are displayed.
- Instal the required dependencies:
pip install -r requirements.txt
- Run the code:
python text_prediction.py
links : louisbau