Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

想请问一下jiebaR分词可以选择分词粒度吗?如何实现 #36

Open
WangYuKane opened this issue Jul 20, 2016 · 3 comments
Open

Comments

@WangYuKane
Copy link

No description provided.

@qinwf
Copy link
Owner

qinwf commented Jul 22, 2016

你好,有多种分词的方法,具体可以参考 这里

> cc = worker()
> cc$default = "full"

default 可以设置的值为 c("mix", "query", "hmm", "mp", "tag", "full") 中的一个。

简单的说, hmm 会识别生词,直接用 hmm 模型分词,但是不使用词典,mp 使用词典, mix 使用词典和识别生词,full 是类似搜索引擎模式的分词。query 方法先使用mix方法切词,对于切出来的较长的词再使用 full 方法。

full 方法在 R 包里没有更新出来,但是在最新的 CRAN 版里是可以用的。设置 default 方法为 full 就可以了。

> cc = worker()
> cc["中华人民共和国"]
[1] "中华人民共和国"
> cc$default = "full"
> cc["中华人民共和国"]
[1] "中华"           "中华人民"       "中华人民共和国" "华人"          
[5] "人民"           "人民共和国"     "共和"           "共和国"        
> cc$default = "hmm"
> cc["我是黄小明"]
[1] "我"     "是"     "黄小明"
> cc$default = "mp"
> cc["我是黄小明"]
[1] "我" "是" "黄" "小" "明"
> cc
Worker Type:  Jieba Segment

Default Method  :  mp
Detect Encoding :  TRUE
Default Encoding:  UTF-8
Keep Symbols    :  FALSE
Output Path     :  
Write File      :  TRUE
By Lines        :  FALSE
Max Word Length :  20
Max Read Lines  :  1e+05

Fixed Model Components:  

$dict
[1] "C:/Users/outwen/Documents/R/win-library/3.3/jiebaRD/dict/jieba.dict.utf8"

$user
[1] "C:/Users/outwen/Documents/R/win-library/3.3/jiebaRD/dict/user.dict.utf8"

$hmm
[1] "C:/Users/outwen/Documents/R/win-library/3.3/jiebaRD/dict/hmm_model.utf8"

$stop_word
NULL

$user_weight
[1] "max"

$timestamp
[1] 1469182808

$default $detect $encoding $symbol $output $write $lines $bylines can be reset.

之后我会把文档更新一下。

@WangYuKane
Copy link
Author

感谢您的细致的回复,最近才看到,不好意思!

full模式确实可以得到多种粒度的词,但并没法得到完整句子

比如:

cc["我来自中华人民共和国"]
[1]"我" "来自" "中华" "中华人民" "中华人民共和国" "华人"
[5] "人民" "人民共和国" "共和" "共和国"

虽然可以得到不同粒度的切词,但并没法得到完整语义的句子

不知道是否有办法在不同粒度下切词,又可以有完整语义的句子?

感谢您的细心解答!

祝好!

@qinwf
Copy link
Owner

qinwf commented Jul 28, 2016

粒度怎么定义?

除了 full,还有 "mix", "query", "hmm", "mp", 这四种方法。full 方法适用于搜索引擎的切词。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants