Skip to content

Latest commit

 

History

History
166 lines (123 loc) · 6.41 KB

ch01.md

File metadata and controls

166 lines (123 loc) · 6.41 KB

Bigram Language Model

A Bigram Language Model is a type of statistical language model used in natural language processing to predict the probability of a word given the previous word in a sequence. Here's a breakdown of its key components and how it works:

Basics of Bigram Models

  1. Definition:

    • A bigram is a pair of consecutive words in a sequence. For example, in the sentence "I love natural language processing," the bigrams are: "I love," "love natural," "natural language," and "language processing."
  2. Probability Calculation:

    • The model estimates the probability of a word given the previous word. Formally, the probability of a word $w_i$ given the previous word $w_{i-1}$ is represented as $P(w_i | w_{i-1})$.
    • Using the chain rule of probability, the probability of a sequence of words $w_1, w_2, \ldots, w_n$ can be approximated as:

$$P(w_1, w_2, \ldots, w_n) \approx P(w_1) \cdot P(w_2 | w_1) \cdot P(w_3 | w_2) \cdot \ldots \cdot P(w_n | w_{n-1})$$

  1. Training the Model:
    • To train a bigram model, you need a large corpus of text. The model learns the probabilities from this corpus by counting the occurrences of each bigram.
    • The probability of a word $w_i$ given $w_{i-1}$ is calculated as:

$$P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}$$

where $C(w_{i-1}, w_i)$ is the count of the bigram $(w_{i-1}, w_i)$ and $C(w_{i-1})$ is the count of the word $w_{i-1}$.

  1. Smoothing:
    • In practice, many bigrams may not appear in the training corpus, leading to zero probabilities. Smoothing techniques such as Laplace smoothing (adding a small constant to all counts) are used to handle this issue.

Applications

  1. Text Generation:

    • Bigram models can be used to generate text by starting with a seed word and repeatedly sampling the next word based on the learned probabilities.
  2. Speech Recognition and Text Prediction:

    • They are used to improve the accuracy of speech recognition systems and text input prediction in keyboards.
  3. Machine Translation:

    • In machine translation, bigram models help in predicting the next word in the translated sentence based on the previous word, improving fluency.

Limitations

  1. Short Context:

    • Bigram models only consider one previous word, which limits their ability to capture long-range dependencies in the text.
  2. Data Sparsity:

    • Even with large corpora, many possible bigrams may not appear, leading to sparse data issues.
  3. Lack of Semantics:

    • They rely purely on statistical information without understanding the semantics of the words, which can lead to nonsensical predictions.

Conclusion

Bigram language models are a foundational concept in natural language processing, offering a simple yet powerful way to model word sequences based on local context. Despite their limitations, they form the basis for more advanced models like trigrams, n-grams, and neural network-based language models.

Sure! Let's create a simple demonstration of a Bigram Language Model using Python. We will:

  1. Build a bigram language model from a sample corpus.
  2. Calculate bigram probabilities.
  3. Use the model to generate text.

First, let's prepare a small corpus for the demonstration:

corpus = [
    "I love natural language processing",
    "I love machine learning",
    "natural language processing is fun",
    "machine learning is fascinating",
    "I love fun activities"
]

Next, we'll implement the bigram model:

  1. Tokenize the corpus into bigrams.
  2. Count the occurrences of each bigram and unigram.
  3. Calculate the bigram probabilities.
  4. Generate new text based on the bigram probabilities.

Here's the complete code:

import random
from collections import defaultdict, Counter

# Sample corpus
corpus = [
    "I love natural language processing",
    "I love machine learning",
    "natural language processing is fun",
    "machine learning is fascinating",
    "I love fun activities"
]

# Tokenize the corpus into bigrams
def get_bigrams(corpus):
    bigrams = []
    for sentence in corpus:
        tokens = sentence.lower().split()
        bigrams.extend([(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)])
    return bigrams

bigrams = get_bigrams(corpus)
unigrams = [token for sentence in corpus for token in sentence.lower().split()]

# Count the occurrences of bigrams and unigrams
bigram_counts = Counter(bigrams)
unigram_counts = Counter(unigrams)

# Calculate bigram probabilities
bigram_probabilities = defaultdict(lambda: defaultdict(float))
for (w1, w2), count in bigram_counts.items():
    bigram_probabilities[w1][w2] = count / unigram_counts[w1]

# Generate text using the bigram model
def generate_text(start_word, num_words):
    current_word = start_word
    sentence = [current_word]
    for _ in range(num_words - 1):
        next_word_candidates = list(bigram_probabilities[current_word].keys())
        next_word_probabilities = list(bigram_probabilities[current_word].values())
        if next_word_candidates:
            current_word = random.choices(next_word_candidates, next_word_probabilities)[0]
            sentence.append(current_word)
        else:
            break
    return ' '.join(sentence)

# Example of text generation
start_word = "i"
num_words = 10
generated_text = generate_text(start_word, num_words)
print("Generated Text:", generated_text)

Explanation:

  1. Tokenize the Corpus:

    • The get_bigrams function splits each sentence into tokens and creates bigrams.
  2. Count Occurrences:

    • We use Counter to count the occurrences of each bigram and unigram.
  3. Calculate Bigram Probabilities:

    • The probabilities are calculated as the count of a bigram divided by the count of the first word in the bigram.
  4. Generate Text:

    • The generate_text function starts with a given word and generates the next word based on the bigram probabilities until the specified number of words is reached.

Run the code to see the generated text. The output will vary each time due to the random choice of the next word based on the learned probabilities.

Would you like to see the code executed?

  • change some parameters to find what happened?
    • add some new corpus case
   corpus = [
    "I love natural language processing",
    "I love machine learning",
    "natural language processing is fun",
    "machine learning is fascinating",
    "I love fun activities",
    "I love Jisoo",
]
  • start_word = "i" ??
  • num_words = 10 ??