Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

输出关键词提取的排序后的所有结果 #27

Open
qinwf opened this issue Feb 14, 2016 · 8 comments
Open

输出关键词提取的排序后的所有结果 #27

qinwf opened this issue Feb 14, 2016 · 8 comments

Comments

@qinwf
Copy link
Owner

qinwf commented Feb 14, 2016

No description provided.

@AlexYoung757
Copy link

AlexYoung757 commented Jul 9, 2016

@qinwf
您好,假设有个list,其中包含多条文本,如果设置topn=4的话,提取的关键词一定是4个吗

@qinwf
Copy link
Owner Author

qinwf commented Jul 9, 2016

@AlexYoung757 不一定,如果句子的词数不到 4 个,那结果也不会超过 4 个.

@cnhzzx
Copy link

cnhzzx commented Apr 18, 2017

@qinwf 我在关键词提取的时候把topn设置成1000,但是提取的关键词数量会少于分词的数量,这是什么原原因?是不是因为有些分出来的词并不在idf语料库中?

@qinwf
Copy link
Owner Author

qinwf commented Apr 18, 2017

@cnhzzx ,能给一个例句吗,我重复一下?

@cnhzzx
Copy link

cnhzzx commented Apr 19, 2017

@qinwf

keyworker = worker("keywords",user = "user_dict.txt", stop_word = "stop_words.txt",idf = "idf.txt",topn = 1000)
wk = worker(user = "user_dict.txt", stop_word = "stop_words.txt")
wk["今天股票跌很厉害"]
[1] "股票" "跌" "厉害"
vector_keywords(wk["今天股票跌很厉害"],keyworker)
6.92433 4.76323
"厉害" "股票"
keyworker 没有识别"跌"这个用wk分出来的词。我在设置keyworker这个引擎时,用了自定义的idf.txt这个idf语料库(加上了“跌”这个字的idf值,但貌似对keyworker也没有任何作用)
还有一个问题:就是我试了把”厉害”这个词从系统默认的idf语料库中删了,发现keyworker依旧能识别“厉害”这个词,而且idf值又变成了另一个不知道从哪里来的数字(not之前的 6.92433!)。我想知道这个keyworker到底是如何工作的?它在运行时调用的是哪个idf词库呢?

qinwf pushed a commit that referenced this issue Apr 19, 2017
@qinwf
Copy link
Owner Author

qinwf commented Apr 19, 2017

谢谢,我可以重现了。

这几个词是单字词,在 upstream 的源码里,单字词和停词在提取时会被跳过,see:

if (IsSingleWord(words[i]) || stopWords_.find(words[i]) != stopWords_.end()) {
continue;
}

https://github.com/yanyiwu/cppjieba/blob/45809955f5a345886ec3d49cbed3ec68ced70b1c/include/cppjieba/KeywordExtractor.hpp#L67-L69

我不是很了解这条规则的具体目的,我先在 jiebaR 的删了这条规则吧。

cc @yanyiwu

master 已经更新了,你可以从 GitHub 安装最新版。

85e0819

> keyworker = worker("keywords",topn = 1000)
> wk = worker()
> vector_keywords(wk["今天股票跌很厉害"],keyworker)
11.7392 6.92433 4.99212 4.76323 
   "跌"  "厉害"  "今天"  "股票" 

“很” 在停词表里

@cnhzzx
Copy link

cnhzzx commented Apr 19, 2017

@qinwf
谢谢!现在已经可以用了
但就是我在系统自带的idf词库中没有找到”跌"这个词,不知道“跌”这个词对应的idf值11.7392是怎么计算得到的? (别的词比如“厉害”,我直接在系统的idf词库中可以找到其对应的idf值就是6.92433)

@qinwf
Copy link
Owner Author

qinwf commented Apr 19, 2017

@cnhzzx 取平均值 :

if (cit != idfMap_.end()) {
itr->second.weight *= cit->second;
} else {
itr->second.weight *= idfAverage_;
}

idfAverage_ = idfSum / lineno;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants