Skip to content

iHub/text_disambiguation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

text_disambiguation

Computer-mediated discourses like email messages, SMS text messages and online chats are inundated with spelling errors, non-standard words, abbreviations, false starts, repetitions, missing punctuations, inappropriate letter casing, pause filling words and cognitive errors. The preference toward economy of words facilitates short, brief typing as well as the need for semantic clarity, ultimately shaping the structure of these messages. Due to the increasingly great demand for data mining in our growing knowledge economy, text analytics has become an indispensable tool of great value; processing of noisy text is of paramount importance in every-day applications. Short texts like these present new challenges to standard natural language processing tools which are usually designed for well-written text. Since they behave differently from normal written text and in order to reduce the tremendous effort required to customize or adapt the language model of the traditional translation system to handle SMS text style, normalization is performed to moderate the irregularities in both English and Swahili text using a noisy channel model. The noisy channel model described in this paper, incorporates two modules; a language unit which models the probability that a user would select a word from a set of similar alternatives and an error unit which models the phonetic and graphemic factors involved in the formation of an ungrammatical word. It works by first detecting ill-formed words, generating a set of candidate corrections based on morphophonemic similarity to the invalid word and ranking them according to the language model and the error model metric system. Preliminary evaluation shows that the system can be used for decoding texting language words to their standard counterparts with more than 65% accuracy.

Releases

No releases published

Packages

No packages published

Languages