This repository contains some fundamental data structures for NLP.
It is deprecated since I don't have time to maintain it. However, I might refactor this in the near future, and add the following components:
- Combinatory categorial grammar (CCG) parser
- Semantic role labeler
- word2vec
I am in favor of spliting LangKit into mutliple sub-packages and focusing on one of the sub-packages.
Current features:
- Corpus readers
- Part-of-Speech Tagging (HMM)
- Language Modeling (Ngram model)
- Classification (Naive Bayes)
- Evaluation (F-score)
- Word Alignment (IBM models)
- Swift 3.0-dev 2016-05-03 build (
DEVELOPMENT-SNAPSHOT-2016-04-25-a
)swiftenv
is strongly recommended as a Swift version manager
Simply add a dependency in Swift Package Manager.
dependencies: [
.Package(url: "https://github.com/xinranmsn/LangKit", majorVersion: 0, minor: 2),
]
Then add import LangKit
to your source file.
- Train a part-of-speech tagger with your data
guard let taggedCorpus = CorpusReader(fromFile: "Data/train.txt", tokenizingWith: ^String.tagTokenized) else {
print("❌ Corpora error!")
exit(EXIT_FAILURE)
}
let tagger = PartOfSpeechTagger(taggedCorpus: taggedCorpus, smoothingMode: .goodTuring)
let sentence = "Colorless green ideas sleep furiously .".tokenized()
tagger.tag(sentence) |> print
- Train a n-gram language model with your data
guard let corpus = TokenCorpusReader(fromFile: "Data/train.txt") else {
print("❌ Corpora error!")
exit(EXIT_FAILURE)
}
let model = NgramModel(n: 3,
trainingCorpus: corpus,
smoothingMode: .none,
counter: TrieNgramCounter())
let sentence = "Colorless green ideas sleep furiously .".tokenized()
model.sentenceLogProbability(sentence) |> print
- Scripting in Swift
You can script to use LangKit by adding a shebang to the Swift source. Example scripts are in Examples/Scripting/
. Scripting in Swift is not a mature feature yet, so you'll need to build LangKit to a dynamic library.
swift build -Xswiftc -emit-library
cp LangKit.dylib .build/debug/LangKit.swiftmodule Examples/Scripting/lib/
cd Examples/Scripting/lib/
./Tagger.swift
This is what the shebang looks like:
#!/usr/bin/env swift -I<dir of LangKit.swiftmodule> -L<dir of LangKit.dylib> -lLangKit -target x86_64-apple-macosx10.10
I know. The -target x86_64-apple-macosx10.10
doesn't really look cool.
You can use Xcode 7.3 with Swift 3 dev toolchain or only the Swift 3 dev toolchain. Xcode is recommended if you need a Playground.(not available until Swift 3 release version)
Make sure you have added Swift 3's bin
to PATH
.
Build:
$ swift build
Test:
$ swift test
The Foundation
framework on Linux is based on apple/swift-corelibs-foundation
but the one on OS X is not. So they may have inconsistency in naming from time to time. Foundation
APIs on OS X usually have the latest naming.
- Generate an Xcode project by executing
swift build -X
- Switch the toolchain to Swift development snapshot
- Open
LangKit.xcodeproj
Build: ⌘b
Test: ⌘u
- Language Modeling
- N-gram language model
- Trie counter
- Dictionary counter
- Smoothing
- Additive
- Good Turing
- Intrinsic Evaluation
- Perplexity
- Incremental training
- N-gram language model
- Sequence Labeling
- Hidden Markov model
- Smoothing
- Additive
- Good Turing
- Incremental training
- Smoothing
- Part-of-speech tagger
- Maximum-entropy Markov model
- Hidden Markov model
- Preprocessing
- Basics
- Tokenization
- Basics
- Penn Treebank tokenizer
- Classification
- Naive Bayes
- Support vector machine
- Alignment
- IBM Model 1
- IBM Model 2
- Evaluation
- F-score
- File IO
- Corpus reader
- ARPA LM file support
- Demo
- Language identification (
$ ./demo -n id
) - HMM POS tagging (
$ ./demo -n pos
) - IBM Model 1 and 2 (
$ ./demo -n (ibm1|ibm2)
) - Classification Evaluation (
$ ./demo -n eval
)
- Language identification (
- Swift 2 is not supported.