[//]: # (Travis has been deactivated: )
νΈμν°μμ λ§λ μ€νμμ€ νκ΅μ΄ μ²λ¦¬κΈ°
- 2017λ 4.4 λ²μ μ΄νμ κ°λ°μ http://openkoreantext.org μμ μ§νλ©λλ€.
- We now started an official fork at http://openkoreantext.org as of early 2017. All the development after version 4.4 will be done in open-korean-text.
Scala/Java library to process Korean text with a Java wrapper. twitter-korean-text currently provides Korean normalization and tokenization. Please join our community at Google Forum. The intent of this text processor is not limited to short tweet texts.
μ€μΉΌλΌλ‘ μ°μ¬μ§ νκ΅μ΄ μ²λ¦¬κΈ°μ λλ€. νμ¬ ν μ€νΈ μ κ·νμ ννμ λΆμ, μ€ν λ°μ μ§μνκ³ μμ΅λλ€. 짧μ νΈμμ λ¬Όλ‘ μ΄κ³ κΈ΄ κΈλ μ²λ¦¬ν μ μμ΅λλ€. κ°λ°μ μ°Έμ¬νμκ³ μΆμ λΆμ Google Forumμ κ°μ ν΄ μ£ΌμΈμ. μ¬μ©λ²μ μκ³ μ νμλ μ΄λ³΄λΆν° μ½λμ μ°Έμ¬νκ³ μΆμΌμ λΆλ€κΉμ§ λͺ¨λ νμν©λλ€.
twitter-korean-textμ λͺ©νλ λΉ λ°μ΄ν° λ±μμ κ°λ¨ν νκ΅μ΄ μ²λ¦¬λ₯Ό ν΅ν΄ μμΈμ΄λ₯Ό μΆμΆνλ λ°μ μμ΅λλ€. μμ ν μμ€μ ννμ λΆμμ μ§ν₯νμ§λ μμ΅λλ€.
twitter-korean-textλ normalization, tokenization, stemming, phrase extraction μ΄λ κ² λ€κ°μ§ κΈ°λ₯μ μ§μν©λλ€.
μ κ·ν normalization (μ λλΌγ γ -> μ λλ€ γ γ , μ€λ¦ν΄ -> μ¬λν΄)
- νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ λλΌγ γ γ γ γ -> νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ λλ€ γ γ
ν ν°ν tokenization
- νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ λλ€ γ γ -> νκ΅μ΄Noun, λ₯ΌJosa, μ²λ¦¬Noun, νλVerb, μμNoun, μ Adjective, λλ€Eomi γ γ KoreanParticle
μ΄κ·Όν stemming (μ λλ€ -> μ΄λ€)
- νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ λλ€ γ γ -> νκ΅μ΄Noun, λ₯ΌJosa, μ²λ¦¬Noun, νλ€Verb, μμNoun, μ΄λ€Adjective, γ γ KoreanParticle
μ΄κ΅¬ μΆμΆ phrase extraction
- νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ λλ€ γ γ -> νκ΅μ΄, μ²λ¦¬, μμ, μ²λ¦¬νλ μμ
Introductory Presentation: Google Slides
Gunja Agrawal kindly created a test API webpage for this project: http://gunjaagrawal.com/langhack/
Gunja Agrawalλμ΄ λ§λ€μ΄μ£Όμ ν μ€νΈ μΉ νμ΄μ§ μ λλ€. http://gunjaagrawal.com/langhack/
Opensourced here: twitter-korean-tokenizer-api
To include this in your Maven-based JVM project, add the following lines to your pom.xml:
Mavenμ μ΄μ©ν κ²½μ° pom.xmlμ λ€μμ λ΄μ©μ μΆκ°νμλ©΄ λ©λλ€:
<dependency>
<groupId>com.twitter.penguin</groupId>
<artifactId>korean-text</artifactId>
<version>4.4</version>
</dependency>
The maven site is available here http://twitter.github.io/twitter-korean-text/ and scaladocs are here http://twitter.github.io/twitter-korean-text/scaladocs/
modamoda kindly offered a .net wrapper: https://github.com/modamoda/TwitterKoreanProcessorCS
Ch0p kindly offered a node.js wrapper: twtkrjs
Youngrok Kim kindly offered a node.js wrapper: node-twitter-korean-text
Baeg-il Kim kindly offered a Python version: https://github.com/cedar101/twitter-korean-py
Jaepil Jeong kindly offered a Python wrapper: https://github.com/jaepil/twkorean
- Python Korean NLP project KoNLPy now includes twitter-korean-text. νμ΄μ¬μμ μ¬μ΄ νμ©μ΄ κ°λ₯ν KoNLPy ν¨ν€μ§μ twkoreanμ΄ ν¬ν¨λμμ΅λλ€.
jun85664396 kindly offered a Ruby wrapper: twitter-korean-text-ruby
- This provides access to com.twitter.penguin.korean.TwitterKoreanProcessorJava (Java wrapper).
Jaehyun Shin kindly offered a Ruby wrapper: twitter-korean-text-ruby
- This provides access to com.twitter.penguin.korean.TwitterKoreanProcessor (Original Scala Class).
socurites's Korean analyzer for elasticsearch based on twitter-korean-text: tkt-elasticsearch
Clone the git repo and build using maven.
Git μ 체λ₯Ό ν΄λ‘ νκ³ Mavenμ μ΄μ©νμ¬ λΉλν©λλ€.
git clone https://github.com/twitter/twitter-korean-text.git
cd twitter-korean-text
mvn compile
Open 'pom.xml' from your favorite IDE.
You can find these examples in examples folder.
examples ν΄λμ μ¬μ© λ°©λ² μμ νμΌμ΄ μμ΅λλ€.
from Scala
import com.twitter.penguin.korean.TwitterKoreanProcessor
import com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor.KoreanPhrase
import com.twitter.penguin.korean.tokenizer.KoreanTokenizer.KoreanToken
object ScalaTwitterKoreanTextExample {
def main(args: Array[String]) {
val text = "νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ
λλΌγ
γ
γ
γ
γ
#νκ΅μ΄"
// Normalize
val normalized: CharSequence = TwitterKoreanProcessor.normalize(text)
println(normalized)
// νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ
λλ€γ
γ
#νκ΅μ΄
// Tokenize
val tokens: Seq[KoreanToken] = TwitterKoreanProcessor.tokenize(normalized)
println(tokens)
// List(νκ΅μ΄(Noun: 0, 3), λ₯Ό(Josa: 3, 1), (Space: 4, 1), μ²λ¦¬(Noun: 5, 2), νλ(Verb: 7, 2), (Space: 9, 1), μμ(Noun: 10, 2), μ
λ(Adjective: 12, 2), λ€(Eomi: 14, 1), γ
γ
(KoreanParticle: 15, 2), (Space: 17, 1), #νκ΅μ΄(Hashtag: 18, 4))
// Stemming
val stemmed: Seq[KoreanToken] = TwitterKoreanProcessor.stem(tokens)
println(stemmed)
// List(νκ΅μ΄(Noun: 0, 3), λ₯Ό(Josa: 3, 1), (Space: 4, 1), μ²λ¦¬(Noun: 5, 2), νλ€(Verb: 7, 2), (Space: 9, 1), μμ(Noun: 10, 2), μ΄λ€(Adjective: 12, 3), γ
γ
(KoreanParticle: 15, 2), (Space: 17, 1), #νκ΅μ΄(Hashtag: 18, 4))
// Phrase extraction
val phrases: Seq[KoreanPhrase] = TwitterKoreanProcessor.extractPhrases(tokens, filterSpam = true, enableHashtags = true)
println(phrases)
// List(νκ΅μ΄(Noun: 0, 3), μ²λ¦¬(Noun: 5, 2), μ²λ¦¬νλ μμ(Noun: 5, 7), μμ(Noun: 10, 2), #νκ΅μ΄(Hashtag: 18, 4))
}
}
from Java
import java.util.List;
import scala.collection.Seq;
import com.twitter.penguin.korean.TwitterKoreanProcessor;
import com.twitter.penguin.korean.TwitterKoreanProcessorJava;
import com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor;
import com.twitter.penguin.korean.tokenizer.KoreanTokenizer;
public class JavaTwitterKoreanTextExample {
public static void main(String[] args) {
String text = "νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ
λλΌγ
γ
γ
γ
γ
#νκ΅μ΄";
// Normalize
CharSequence normalized = TwitterKoreanProcessorJava.normalize(text);
System.out.println(normalized);
// νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ
λλ€γ
γ
#νκ΅μ΄
// Tokenize
Seq<KoreanTokenizer.KoreanToken> tokens = TwitterKoreanProcessorJava.tokenize(normalized);
System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(tokens));
// [νκ΅μ΄, λ₯Ό, μ²λ¦¬, νλ, μμ, μ
λ, λ€, γ
γ
, #νκ΅μ΄]
System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(tokens));
// [νκ΅μ΄(Noun: 0, 3), λ₯Ό(Josa: 3, 1), (Space: 4, 1), μ²λ¦¬(Noun: 5, 2), νλ(Verb: 7, 2), (Space: 9, 1), μμ(Noun: 10, 2), μ
λ(Adjective: 12, 2), λ€(Eomi: 14, 1), γ
γ
(KoreanParticle: 15, 2), (Space: 17, 1), #νκ΅μ΄(Hashtag: 18, 4)]
// Stemming
Seq<KoreanTokenizer.KoreanToken> stemmed = TwitterKoreanProcessorJava.stem(tokens);
System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(stemmed));
// [νκ΅μ΄, λ₯Ό, μ²λ¦¬, νλ€, μμ, μ΄λ€, γ
γ
, #νκ΅μ΄]
System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(stemmed));
// [νκ΅μ΄(Noun: 0, 3), λ₯Ό(Josa: 3, 1), (Space: 4, 1), μ²λ¦¬(Noun: 5, 2), νλ€(Verb: 7, 2), (Space: 9, 1), μμ(Noun: 10, 2), μ΄λ€(Adjective: 12, 3), γ
γ
(KoreanParticle: 15, 2), (Space: 17, 1), #νκ΅μ΄(Hashtag: 18, 4)]
// Phrase extraction
List<KoreanPhraseExtractor.KoreanPhrase> phrases = TwitterKoreanProcessorJava.extractPhrases(tokens, true, true);
System.out.println(phrases);
// [νκ΅μ΄(Noun: 0, 3), μ²λ¦¬(Noun: 5, 2), μ²λ¦¬νλ μμ(Noun: 5, 7), μμ(Noun: 10, 2), #νκ΅μ΄(Hashtag: 18, 4)]
}
}
TwitterKoreanProcessor.scala is the central object that provides the interface for all the features.
TwitterKoreanProcessor.scalaμ μ§μνλ λͺ¨λ κΈ°λ₯μ λͺ¨μ λμμ΅λλ€.
mvn test
will run our unit tests
λͺ¨λ μ λ ν
μ€νΈλ₯Ό μ€ννλ €λ©΄ mvn test
λ₯Ό μ΄μ©ν΄ μ£ΌμΈμ.
We provide tools for quality assurance and test resources. They can be found under src/main/scala/com/twitter/penguin/korean/qa and src/main/scala/com/twitter/penguin/korean/tools.
Refer to the general contribution guide. We will add this project-specific contribution guide later.
μ€μΉ λ° μμ νλ λ°©λ² μμΈ μλ΄
Tested on Intel i7 2.3 Ghz
Initial loading time (μ΄κΈ° λ‘λ© μκ°): 2~4 sec
Average time per parsing a chunk (νκ· μ΄μ μ²λ¦¬ μκ°): 0.12 ms
Tweets (Avg length ~50 chars)
Tweets | 100K | 200K | 300K | 400K | 500K | 600K | 700K | 800K | 900K | 1M |
---|---|---|---|---|---|---|---|---|---|---|
Time in Seconds | 57.59 | 112.09 | 165.05 | 218.11 | 270.54 | 328.52 | 381.09 | 439.71 | 492.94 | 542.12 |
Average per tweet: 0.54212 ms |
Benchmark test by KoNLPy
From http://konlpy.org/ko/v0.4.2/morph/
- Will Hohyon Ryu (μ νΈν): https://github.com/nlpenguin | https://twitter.com/NLPenguin
Copyright 2014 Twitter, Inc.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0