Add info about how to add new language, closes miso-belica#62

Daudulus · Dec 16, 2019 · cb7b97d · cb7b97d
1 parent 7dc904d
commit cb7b97d
Show file tree

Hide file tree

Showing 3 changed files with 59 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -4,7 +4,10 @@
 
 Simple library and command line utility for extracting summary from HTML
 pages or plain texts. The package also contains simple evaluation
-framework for text summaries. Implemented summarization methods are described in the [documentation](docs/summarizators.md). I also maintain a list of [alternative implementations](docs/alternatives.md) of the summarizers in various languages.
+framework for text summaries. Implemented summarization methods are described in the [documentation](docs/summarizators.md). I also maintain a list of [alternative implementations](docs/alternatives.md) of the summarizers in various programming languages.
+
+## Is my natural language supported?
+There is a [good chance](docs/index.md#Tokenizer) it is. But if not it is [not too hard to add](docs/how-to-add-new-language.md) it.
 
 ## Installation
 

diff --git a/docs/how-to-add-new-language.md b/docs/how-to-add-new-language.md
@@ -0,0 +1,48 @@
+# How to add new natural language support into Sumy
+
+Let's say [someone wants](https://github.com/miso-belica/sumy/issues/62) to summarize documents in the Russian language by the Sumy. The first thing you will need is a [tokenizer](docs/index.md#Tokenizer). So we can try if there is some support in Sumy.
+
+```python
+from sumy.nlp.tokenizers import Tokenizer
+tokenizer = Tokenizer("ru")
+
+# https://ru.m.wikipedia.org/wiki/Тунберг,_Грета
+sentences = tokenizer.to_sentences("Гре́та Тинтин Элеонора Э́рнман Ту́нберг (швед. Greta Tintin Eleonora Ernman Thunberg; род. 3 января 2003[1][2], Стокгольм[1]) — шведская школьница, экологическая активистка. В 15 лет начала протестовать возле шведского парламента с плакатом «Школьная забастовка за климат», призывая к незамедлительным действиям по борьбе с изменением климата в соответствии с Парижским соглашением. Её действия нашли отклик по всему миру, породив массовые мероприятия, известные как «школьные забастовки за климат» или «пятницы ради будущего»")
+
+for sentence in sentences:
+    print(tokenizer.to_words(sentence))
+```
+
+So we are good here. But you know, what if the tokenizer is missing? Then you need to implement one by yourself. Or to find some library for your language and wrap it into the API for Sumy. It should be easy because Sumy expects an object with two methods. The simplest naive tokenizer would look like this.
+
+```python
+from typing import List
+
+class Tokenizer:
+    @staticmethod
+    def to_sentences(text: str) -> List[str]:
+        return [s.strip() for s in text.split(".")]
+
+    @staticmethod
+    def to_words(sentence: str) -> List[str]:
+        return [w.strip() for w in sentence.split(" ")]
+```
+
+Another language-specific thing is **Stemmer**. For Sumy, the stemmer is any callable object which accepts word (string) and returns word (string). But the word may be somehow changed. The role of the stemmer is to normalize words into the same form. For example, you have words: _teacher_, _teaching_, _teach_ and you want to return the root of the word _teach_ for all these because they have the same meaning. But of course, you want to return _sleep_ for _sleeping_. Some languages like the Japanese language does not care and can simply return the original word. But for another like Slovak/Czech/English language, it's quite important to normalize all the different forms of the word. The simplest stemmer looks like the one below:
+
+```python
+def null_stemmer(word):
+    """I am the same as from sumy.nlp.stemmers import null_stemmer :)"""
+    return word
+```
+
+```python
+# seems NLTK covers our back again :)
+from sumy.nlp.stemmers import Stemmer
+stemmer = Stemmer("ru")
+stem = stemmer("Элеонора")  # элеонор
+```
+
+The last piece is a list of stop-words. Sumy has some stop-words in, but you can download any free list from the internet. This piece is also optional because the summarizers can work without it but it's **highly recommended** to provide one because it may increase the quality of the summaries dramatically.
+
+And that's all. You will these parts together with the other parts from the README and send a pull request with your code :) 
diff --git a/docs/index.md b/docs/index.md
@@ -10,16 +10,22 @@ Sumy is able to create **extractive summary**. That means that it tries to find
 
 Even I focused on Czech/Slovak language in my work I wanted Sumy to be extendable for other languages from the start. That's why I created it as a set of independent objects that can be replaced by the user of the library to add better or new capabilities to it.
 
+### Document
 The central object is the [`Document`](https://github.com/miso-belica/sumy/blob/master/sumy/models/dom/_document.py) which represents the whole document ready to be summarized. It consists of the collection of the [`Paragraphs`](https://github.com/miso-belica/sumy/blob/master/sumy/models/dom/_paragraph.py) which consists of the collection of the [`Sentences`](https://github.com/miso-belica/sumy/blob/master/sumy/models/dom/_sentence.py). Every sentence has a boolean flag `is_heading` indicating if it's a normal sentence or _heading_. Also, it has tokenizer attached so you can get a list of `words` from it. But the `Word` is represented as a simple string.
 
-To create a `Document` (or `Parser`) you will need a [`Tokenizer`](https://github.com/miso-belica/sumy/blob/master/sumy/nlp/tokenizers.py). The `Tokenizer` is one of the **language-specific** part of the puzzle. I use [nltk library](https://www.nltk.org/api/nltk.tokenize.html) to do that so there is a great chance your language is covered by that library. Simply try to pass your language name to it and you will see if it will work :) If it raises the exception you have two choices. The 1st one is to send the pull request to Sumy with a new `Tokenizer` for your language. And the 2nd is to create your own `Tokenizer` and pass it to Sumy. And you know, now when you have it it should be easy to send the pull request with your code anyway. The tokenizer is any object with two methods `to_sentences(paragraph: str)` and `to_words(sentence: str)`.
+### Tokenizer
+To create a `Document` (or `Parser`) you will need a [`Tokenizer`](https://github.com/miso-belica/sumy/blob/master/sumy/nlp/tokenizers.py). The `Tokenizer` is one of the **language-specific** part of the puzzle. I use [nltk library](https://www.nltk.org/api/nltk.tokenize.html) to do that so there is a great chance your language is covered by that library. Simply try to pass your language name to it and you will see if it will work :) If it raises the exception you have two choices. The 1st one is to send the pull request to Sumy with a new `Tokenizer` for your language. And the 2nd is to [create your own `Tokenizer`](how-to-add-new-language.md) and pass it to Sumy. And you know, now when you have it it should be easy to send the pull request with your code anyway. The tokenizer is any object with two methods `to_sentences(paragraph: str)` and `to_words(sentence: str)`.
 
+### Parser 
 You can create the `Document` by hand but it would be not very convenient. That's why there is [`DocumentParser`](https://github.com/miso-belica/sumy/blob/master/sumy/parsers/parser.py) for the job. It's the base class you can inherit and extend to create your transformation from the input document format to the `Document` object. Sumy provides 2 implementations to do that. The first one is the [`PlainTextParser`](https://github.com/miso-belica/sumy/blob/master/sumy/parsers/plaintext.py). The name is not accurate because some very simple formatting is expected. `Paragraphs` are separated by a single empty line and headings of the paragraphs can be created by writing the whole sentence in UPPER CASE letters. But that's all. The more interesting implementation is the [`HtmlParser`](https://github.com/miso-belica/sumy/blob/master/sumy/parsers/html.py). It is able to extract the main article from the HTML page with the help of [breadability library](https://github.com/bookieio/breadability) and returns `Document` with useful meta-information about the document extracted from HTML markup. Many other summarizers use XML format for the input documents and it should not be hard to implement it if you want to. All you should do it to inherit `DocumentParser` and define the property `DocumentParser.document` returning `Document` object.
 
+### Preprocessing (optional)
 Ok, now you know how to create the `Document` from your text. Next, you want to summarize it probably. Before we do that you should know that the `Document` can be preprocessed in any way. You can transform/enhance it with important information for you. You can even add or remove parts of it. Whatever you need. In some edge cases, you can even create the new `Document` as long as you adhere to the API.
 
+### Stemmer
 Then you need a [`Stemmer`](https://github.com/miso-belica/sumy/blob/master/sumy/nlp/stemmers/__init__.py). The `Stemmer` is just a fancy word for the algorithm that tries to normalize the words into the single one. The simplest stemmer implementation in Sumy is the so-called `null_stemmer`. It is handy for cases like Chinese/Japanese/Korean languages where words do not need to be unified. But the Czech/Slovak language has custom `Stemmer` in Sumy. All other languages use [nltk](https://www.nltk.org/api/nltk.stem.html) for this. SO again, there is a good chance your language is covered. But stemmer is any `callable` that takes a word and returns word. That is good news for you because you can implement your own by simply creating a new function with a custom implementation.
 
+### Summarizer
 And we are reaching the finish line here. You have `Document` created and you are not afraid to use your `Stemmer`. Now you are ready to choose one of the [`Summarizers`](https://github.com/miso-belica/sumy/tree/master/sumy/summarizers). Probably except for the [`RandomSummarizer`](https://github.com/miso-belica/sumy/blob/master/sumy/summarizers/random.py) which serves just as a lower limit when evaluating the quality of the summaries. The `Summarizer` needs a `Stemmer` as it's dependency and optionally the list of the stop-words. Although it's the optional dependency I really recommend to use it to get better results. You can use `sumy.utils.get_stop_words(language: str)` or simply provide your list of the words. After all of this, your summarizer is ready to serve you. Simply provide it the `Document` and the count of the sentences you want to return and you are done.
 
 You can find some specifics to the summarizators at the [separate page](summarizators.md).