Chinese Word Segmentation Component
分词使用方法:
List<Word> words = WordSeg.seg("杨尚川是APDPlat应用级产品开发平台的作者");
System.out.println(words);
输出:
[杨尚川, 是, APDPlat, 应用, 级, 产品开发, 平台, 的, 作者]
Lucene插件:
Analyzer analyzer = new ChineseWordAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("text", "杨尚川是APDPlat应用级产品开发平台的作者");
while(tokenStream.incrementToken()){
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
System.out.println(charTermAttribute.toString()+" "+offsetAttribute.startOffset());
}
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, analyzer);
IndexWriter indexWriter = new IndexWriter(directory, config);
QueryParser queryParser = new QueryParser(Version.LUCENE_47, "text", analyzer);
Query query = queryParser.parse("text:杨尚川");
TopDocs docs = indexSearcher.search(query, Integer.MAX_VALUE);
分词算法文章: