输出关键词提取的排序后的所有结果 #27

qinwf · 2016-02-14T10:27:51Z

No description provided.

AlexYoung757 · 2016-07-09T01:30:09Z

@qinwf
您好，假设有个list，其中包含多条文本，如果设置topn=4的话，提取的关键词一定是4个吗

qinwf · 2016-07-09T03:38:01Z

@AlexYoung757 不一定，如果句子的词数不到 4 个，那结果也不会超过 4 个.

cnhzzx · 2017-04-18T05:36:33Z

@qinwf 我在关键词提取的时候把topn设置成1000，但是提取的关键词数量会少于分词的数量，这是什么原原因？是不是因为有些分出来的词并不在idf语料库中？

qinwf · 2017-04-18T09:52:45Z

@cnhzzx ，能给一个例句吗，我重复一下？

cnhzzx · 2017-04-19T00:38:58Z

@qinwf

keyworker = worker("keywords",user = "user_dict.txt", stop_word = "stop_words.txt",idf = "idf.txt",topn = 1000)
wk = worker(user = "user_dict.txt", stop_word = "stop_words.txt")
wk["今天股票跌很厉害"]
[1] "股票" "跌" "厉害"
vector_keywords(wk["今天股票跌很厉害"],keyworker)
6.92433 4.76323
"厉害" "股票"
keyworker 没有识别"跌"这个用wk分出来的词。我在设置keyworker这个引擎时，用了自定义的idf.txt这个idf语料库（加上了“跌”这个字的idf值，但貌似对keyworker也没有任何作用）
还有一个问题：就是我试了把”厉害”这个词从系统默认的idf语料库中删了，发现keyworker依旧能识别“厉害”这个词，而且idf值又变成了另一个不知道从哪里来的数字（not之前的 6.92433！）。我想知道这个keyworker到底是如何工作的？它在运行时调用的是哪个idf词库呢？

qinwf · 2017-04-19T01:43:32Z

谢谢，我可以重现了。

这几个词是单字词，在 upstream 的源码里，单字词和停词在提取时会被跳过，see：

jiebaR/inst/include/lib/KeywordExtractor.hpp

Lines 113 to 115 in 00446d7

    
           if (IsSingleWord(words[i]) || stopWords_.find(words[i]) != stopWords_.end()) { 
        
             continue; 
        
           }

https://github.com/yanyiwu/cppjieba/blob/45809955f5a345886ec3d49cbed3ec68ced70b1c/include/cppjieba/KeywordExtractor.hpp#L67-L69

我不是很了解这条规则的具体目的，我先在 jiebaR 的删了这条规则吧。

cc @yanyiwu

master 已经更新了，你可以从 GitHub 安装最新版。

85e0819

> keyworker = worker("keywords",topn = 1000)
> wk = worker()
> vector_keywords(wk["今天股票跌很厉害"],keyworker)
11.7392 6.92433 4.99212 4.76323 
   "跌"  "厉害"  "今天"  "股票"

“很” 在停词表里

cnhzzx · 2017-04-19T02:51:30Z

@qinwf
谢谢！现在已经可以用了
但就是我在系统自带的idf词库中没有找到”跌"这个词，不知道“跌”这个词对应的idf值11.7392是怎么计算得到的？（别的词比如“厉害”，我直接在系统的idf词库中可以找到其对应的idf值就是6.92433）

qinwf · 2017-04-19T10:26:44Z

@cnhzzx 取平均值：

jiebaR/inst/include/lib/KeywordExtractor.hpp

Lines 93 to 97 in 85e0819

    
           if (cit != idfMap_.end()) { 
        
             itr->second.weight *= cit->second; 
        
           } else { 
        
             itr->second.weight *= idfAverage_; 
        
           }

jiebaR/inst/include/lib/KeywordExtractor.hpp

Line 177 in 85e0819

idfAverage_ = idfSum / lineno;

qinwf pushed a commit that referenced this issue Apr 19, 2017

不忽略单字词 #27

85e0819

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

输出关键词提取的排序后的所有结果 #27

输出关键词提取的排序后的所有结果 #27

qinwf commented Feb 14, 2016

AlexYoung757 commented Jul 9, 2016 •

edited

Loading

qinwf commented Jul 9, 2016

cnhzzx commented Apr 18, 2017

qinwf commented Apr 18, 2017

cnhzzx commented Apr 19, 2017 •

edited

Loading

qinwf commented Apr 19, 2017

cnhzzx commented Apr 19, 2017

qinwf commented Apr 19, 2017

输出关键词提取的排序后的所有结果 #27

输出关键词提取的排序后的所有结果 #27

Comments

qinwf commented Feb 14, 2016

AlexYoung757 commented Jul 9, 2016 • edited Loading

qinwf commented Jul 9, 2016

cnhzzx commented Apr 18, 2017

qinwf commented Apr 18, 2017

cnhzzx commented Apr 19, 2017 • edited Loading

qinwf commented Apr 19, 2017

cnhzzx commented Apr 19, 2017

qinwf commented Apr 19, 2017

AlexYoung757 commented Jul 9, 2016 •

edited

Loading

cnhzzx commented Apr 19, 2017 •

edited

Loading