GitHub - AlasdairF/Tokenize: All-in-one text tokenizer for Go. Super-fast. Lots of features.

##Tokenize

This Tokenize package contains three functions that are extremely fast and efficient at tokenizing text. No regular expressions are used. The whole thing requires only two loops of the data, the first for UTF8 normalization and accent removal, the second for everything else.

##Warning The same underlying array is used for each token, this means you must copy the slice of bytes sent to the wordfn function if you intend to save the slices. Please see my Unleak package for an easy one-liner implementation of this. If you are counting the token occurances with my BinSearch package, with the native map implementation, or you are converting the slice of bytes to a string then it is not necessary to copy the slice since these implementations make their own copies.

##Features

UTF8 normalization
Accent removal, e.g. á -> a
Special characters converted to their common form, e.g. æ -> e
Lowercasing
Hypenated words split
Contractions removed, e.g. l'histoire -> histoire (but they're -> theyre)
All UTF8 scripts are supported.

For example:

Et l'Histore de l'amitè.

Becomes

et
histore
de
amite

##Installation

go get github.com/AlasdairF/Tokenize

##Parameters

The optional parameters are:

lowercase				converts all letters to lowercase

stripAccents			removes accents, e.g. á -> a

stripContractions		removes contractions, e.g. l'histore -> histore

stripNumbers			removes all numbers

stripForeign			leaves only a-zA-Z0-9 (after accent removal)

##Recommended

Recommended settings for tokenization of English are:

lowercase, stripAccents, stripForeign

Recommended settings for tokenization of continental European languages are:

lowercase, stripAccents, stripContractions, stripForeign

Recommended settings for tokenization of international scripts are:

lowercase, stripContractions

All non-letters and non-numbers, such as punctuation and whitespace are always stripped.

##AllInOne

The first parameter is the []byte data to process, the second is the function for what to with each token. Then the options.

For example, if you want to put all words into a slice then you would use:

tokens := make([][]byte, 0, 100)

wordfn := func(word []byte) {
	tokens = append(tokens, unleak.Bytes(word)) // using my Unleak package to make a copy of the slice
}

lowercase, stripAccents, stripContractions, stripNumbers, stripForeign := true, true, true, false, true
tokenize.AllInOne(data, wordfn, lowercase, stripAccents, stripContractions, stripNumbers, stripForeign)

WithProvidedBuffer

Exactly the same as AllInOne but accepts the reuse of the custom.Buffer. This is much faster if you are repeatedly using this package on small chunks of data.

import "github.com/AlasdairF/Custom"
buf := custom.NewBuffer(32)
tokenize.WithProvidedBuffer(buf, data, wordfn, lowercase, stripAccents, stripContractions, stripNumbers, stripForeign)

##Paginate

Paginate is the same as AllInOne but it also recognizes custom page breaks. Four parameters are required. The first is the []byte data to process, the second is the page break marker as []byte, the third is the function for what to do with each token, the fourth is the function for what to do whenever a page break marker is reached. Please note that the page break marker itself should contain only single-byte characters (ASCII), I usually use {#} as the page break marker.

For example:

pages := make([][][]byte, 0, 10)
tokens := make([][]byte, 0, 100)

wordfn := func(word []byte) {
	tokens = append(tokens, unleak.Bytes(word))
}
pagefn := func() {
	pages = append(pages, tokens)
	tokens = make([][]byte, 0, 100)
}

lowercase, stripAccents, stripContractions, stripNumbers, stripForeign := true, true, true, false, true
tokenize.Paginate(data, []byte("[newpage]"), wordfn, pagefn, lowercase, stripAccents, stripContractions, stripNumbers, stripForeign)

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
tokenize.go		tokenize.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WithProvidedBuffer

About

Releases

Packages

Languages

AlasdairF/Tokenize

Folders and files

Latest commit

History

Repository files navigation

WithProvidedBuffer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages