CERMatch is a novel Python library designed for evaluating Optical Character Recognition (OCR) systems using Character Error Rate (CER) based metrics. This library provides a unique method for matching ground truth text words with predicted words, offering a comprehensive analysis of OCR accuracy.
- Calculate CER Match Score: Compares predicted text against ground truth using CER.
- Configurable Parameters: Customize CER threshold, weights for match categories, and normalization settings.
- Detailed Output: Provides a composite score and ratios of matched, errored, invented, and missed words.
- Flexibility: Options for case sensitivity, special character inclusion, ASCII transliteration, and word list returns.
CERMatch operates in several steps:
-
Normalization: The input texts (both predicted and ground truth) are normalized by removing extra spaces, handling case sensitivity, special characters, and transliterating to ASCII as per configuration.
-
Word Comparison: The normalized texts are split into words. Each word in the predicted text is then compared with the ground truth text.
-
CER Calculation: For each predicted word, the Character Error Rate (CER) is calculated against each word in the ground truth. The CER threshold is used to determine if a word is considered a match, an error, or an invented word.
-
Scoring: The algorithm calculates the percentages of matched, errored, invented, and missed words. These values are then used to compute a composite score based on provided weights, giving a holistic view of the OCR's accuracy.
-
Output: CERMatch returns a dictionary with the composite score and ratios of each word category. Optionally, it can also return lists of words in each category for a more detailed analysis. The entries in the output dictionary are as follows:
- composite_score: The composite score based on the provided weights.
- ratio_matched: The percentage of words that match between the predicted and ground truth texts.
- ratio_errors: The percentage of words that contain errors in the predicted text.
- ratio_invented: The percentage of words that are invented in the predicted text.
- ratio_missed: The percentage of words that are missed in the predicted text.
Install CERMatch using pip:
pip install cermatch
Here is a basic example of how to use CERMatch:
from cermatch import calculateCERMatch
text_pred = "Hello, my namo is me Diego Bonilla"
text_gt = "Hello, my name is Diego Bonilla S."
# cer_threshold: CER threshold for considering a match.
# composite_weights: Weights for matched, errors, invented, missed words.
result = calculateCERMatch(
text_pred, text_gt,
cer_threshold=0.5,
composite_weights=(0.5, 0.2, 0.15, 0.15)
)
print(result)
"""
Output:
{
'composite_score': 0.7642857142857142,
'ratio_matched': 0.7142857142857143,
'ratio_errors': 0.14285714285714285,
'ratio_invented': 0.14285714285714285,
'ratio_missed': 0.14285714285714285
}
"""
CERMatch is licensed under the MIT License - see the LICENSE file for details.