Farasa Desegmenter (Unsegment)

Simple Script to undo Farasa Segmentation

Tested on 106K words Error rate: 0.074% Error Count: 79

Functions docs:

To import the functions use from desegmentors import *

desegmentword: Word segmentor that takes a Farasa Segmented Word and removes the '+' signs

>>> desegment("ال+يومي+ة")
اليومية
>>> desegment("ل+ال+يومي+ة") #Handles Lam + Al-Ataarif
لليومية

desegment_line: Simple wrapper over desegmentword that splits a string by the sep character

>>> desegment_line('ال+دراس+ات ال+نظري+ة ل+ال+تصميم ال+حديث')
الدراسات النظرية للتصميم الحديث

desegment_arabert: Use this function if sentence tokenization was done using from arabert.preprocess_arabert import preprocess with Farasa enabled. AraBERT segmentation using Farasa adds a space after the '+' for prefixes, and after before the '+' for suffixes

>>> desegment_arabert("ال+ دراس +ات ال+ نظري +ة ل+ ال+ تصميم ال+ حديث")
الدراسات النظرية للتصميم الحديث

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
.gitignore		.gitignore
README.md		README.md
desegmentors.py		desegmentors.py
test_desegment.py		test_desegment.py