Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add self-defined dict of proper nouns from wikipedia (增加wikipedia專有名詞的dict) #428

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

skydome20
Copy link

@skydome20 skydome20 commented Jan 10, 2017

我下載維基百科的資料庫(2016/10/20),萃取裡面每一篇中文文章的「標題」,以此作為「專有名詞」的字詞語料,用來完善jieba切字時的精準度。
注意的是:

  1. 包含1,418,574個專有名詞
  2. 專有名詞是以繁體中文呈現
  3. 資料來源:https://zh.wikipedia.org/wiki/Wikipedia:%E6%95%B0%E6%8D%AE%E5%BA%93%E4%B8%8B%E8%BD%BD

I download the data of articles from wikipedia, extracting the "title" of each chinese article, and combining "these titles" as a dict of proper nouns.
This "proper noun dict" can be used in jieba.cut for accuracy.
Note that:

  1. Including 1,418,574 proper nouns (titles)
  2. The words are in zh type
  3. Data source: https://zh.wikipedia.org/wiki/Wikipedia:%E6%95%B0%E6%8D%AE%E5%BA%93%E4%B8%8B%E8%BD%BD

add self-defined dict of proper noun from wikipedia
@skydome20 skydome20 changed the title add self-defined dict of proper nouns from wikipedia add self-defined dict of proper nouns from wikipedia (增加wikipedia專有名詞的dict) Jan 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant