A Bigram Language Model is a type of statistical language model used in natural language processing to predict the probability of a word given the previous word in a sequence. Here's a breakdown of its key components and how it works:
-
Definition:
- A bigram is a pair of consecutive words in a sequence. For example, in the sentence "I love natural language processing," the bigrams are: "I love," "love natural," "natural language," and "language processing."
-
Probability Calculation:
- The model estimates the probability of a word given the previous word. Formally, the probability of a word
$w_i$ given the previous word$w_{i-1}$ is represented as$P(w_i | w_{i-1})$ . - Using the chain rule of probability, the probability of a sequence of words
$w_1, w_2, \ldots, w_n$ can be approximated as:
- The model estimates the probability of a word given the previous word. Formally, the probability of a word
-
Training the Model:
- To train a bigram model, you need a large corpus of text. The model learns the probabilities from this corpus by counting the occurrences of each bigram.
- The probability of a word
$w_i$ given$w_{i-1}$ is calculated as:
where
- Smoothing:
- In practice, many bigrams may not appear in the training corpus, leading to zero probabilities. Smoothing techniques such as Laplace smoothing (adding a small constant to all counts) are used to handle this issue.
-
Text Generation:
- Bigram models can be used to generate text by starting with a seed word and repeatedly sampling the next word based on the learned probabilities.
-
Speech Recognition and Text Prediction:
- They are used to improve the accuracy of speech recognition systems and text input prediction in keyboards.
-
Machine Translation:
- In machine translation, bigram models help in predicting the next word in the translated sentence based on the previous word, improving fluency.
-
Short Context:
- Bigram models only consider one previous word, which limits their ability to capture long-range dependencies in the text.
-
Data Sparsity:
- Even with large corpora, many possible bigrams may not appear, leading to sparse data issues.
-
Lack of Semantics:
- They rely purely on statistical information without understanding the semantics of the words, which can lead to nonsensical predictions.
Bigram language models are a foundational concept in natural language processing, offering a simple yet powerful way to model word sequences based on local context. Despite their limitations, they form the basis for more advanced models like trigrams, n-grams, and neural network-based language models.
Sure! Let's create a simple demonstration of a Bigram Language Model using Python. We will:
- Build a bigram language model from a sample corpus.
- Calculate bigram probabilities.
- Use the model to generate text.
First, let's prepare a small corpus for the demonstration:
corpus = [
"I love natural language processing",
"I love machine learning",
"natural language processing is fun",
"machine learning is fascinating",
"I love fun activities"
]
Next, we'll implement the bigram model:
- Tokenize the corpus into bigrams.
- Count the occurrences of each bigram and unigram.
- Calculate the bigram probabilities.
- Generate new text based on the bigram probabilities.
Here's the complete code:
import random
from collections import defaultdict, Counter
# Sample corpus
corpus = [
"I love natural language processing",
"I love machine learning",
"natural language processing is fun",
"machine learning is fascinating",
"I love fun activities"
]
# Tokenize the corpus into bigrams
def get_bigrams(corpus):
bigrams = []
for sentence in corpus:
tokens = sentence.lower().split()
bigrams.extend([(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)])
return bigrams
bigrams = get_bigrams(corpus)
unigrams = [token for sentence in corpus for token in sentence.lower().split()]
# Count the occurrences of bigrams and unigrams
bigram_counts = Counter(bigrams)
unigram_counts = Counter(unigrams)
# Calculate bigram probabilities
bigram_probabilities = defaultdict(lambda: defaultdict(float))
for (w1, w2), count in bigram_counts.items():
bigram_probabilities[w1][w2] = count / unigram_counts[w1]
# Generate text using the bigram model
def generate_text(start_word, num_words):
current_word = start_word
sentence = [current_word]
for _ in range(num_words - 1):
next_word_candidates = list(bigram_probabilities[current_word].keys())
next_word_probabilities = list(bigram_probabilities[current_word].values())
if next_word_candidates:
current_word = random.choices(next_word_candidates, next_word_probabilities)[0]
sentence.append(current_word)
else:
break
return ' '.join(sentence)
# Example of text generation
start_word = "i"
num_words = 10
generated_text = generate_text(start_word, num_words)
print("Generated Text:", generated_text)
-
Tokenize the Corpus:
- The
get_bigrams
function splits each sentence into tokens and creates bigrams.
- The
-
Count Occurrences:
- We use
Counter
to count the occurrences of each bigram and unigram.
- We use
-
Calculate Bigram Probabilities:
- The probabilities are calculated as the count of a bigram divided by the count of the first word in the bigram.
-
Generate Text:
- The
generate_text
function starts with a given word and generates the next word based on the bigram probabilities until the specified number of words is reached.
- The
Run the code to see the generated text. The output will vary each time due to the random choice of the next word based on the learned probabilities.
Would you like to see the code executed?
- change some parameters to find what happened?
- add some new corpus case
corpus = [
"I love natural language processing",
"I love machine learning",
"natural language processing is fun",
"machine learning is fascinating",
"I love fun activities",
"I love Jisoo",
]
- start_word = "i" ??
- num_words = 10 ??