Changelog

Finnish language model for spaCy

Version 0.15.1, 2024-11-14

* Better cleaning of training data
* Redacted person names in email addresses in the word frequency data

Version 0.15.0, 2024-10-19

* Compatible with spaCy 3.8
* Improved spam filter on the MC4 corpus

Version 0.14.0, 2023-10-14

* Compatible with spaCy 3.7
* The noun chunker includes chains of flats and nmods: e.g. "maaliskuun 7. päivänä"
* The parser doesn't try to detect nsubj:outer, dislocated and goeswith
  dependencies anymore. There's not enough training data to learn those.
* Tokenize "-kampanja" as ["-", "kampanja"]
* Tokenize "maa-" as ["maa", "-"]
* Tokenize "/kk" as ["/", "kk"]
* Other tokenizer improvements

Version 0.13.0, 2023-07-21

* Compatible with spaCy 3.6

Version 0.12.0, 2023-02-01

* Compatible with spaCy 3.5
* Word occurrence probabilities (they have been broken in the past several versions)

Version 0.11.0, 2022-07-23

* Ported to spaCy 3.4
* Updated word vectors and word frequencies
* Minor fixes to the lemmatization

Version 0.10.0, 2022-05-07

* Floret embedding vectors trained on MC4_fi_cleaned

Version 0.10.0b1, 2022-04-09

* Ported to spaCy 3.3.0.dev0. Older spacy versions are not supported anymore.
* Noun chunker now splits off appositions as independent phrases

Version 0.9.0, 2022-01-19

* The pipeline now includes a named-entity recognizer (NER)

Version 0.8.0, 2021-11-21

* Ported to spaCy 3.2. Older spaCy versions are not supported anymore.
* Vectors for out-of-vocabulary words generated by Floret embeddings
* The default spaCy morphologizer instead of the custom Voikko-based morphologizer

Version 0.7.1, 2021-08-21

* Works on Python 3.7 again

Version 0.7.0, 2021-07-12

* Compatibility with spaCy v3.1
* Minor improvements to analysis: prefer non-compound words

Version 0.6.0, 2021-04-11

* Improved tagging and parsing accuracy by pretraining
* Improved lemmatization accuracy by better handling of ambiguous inflections
* Morphological features (case, verb tense, person, etc.)
* Properly set POS SPACE on whitespace tokens

Version 0.5.0, 2021-03-14

* Ported to spaCy 3.0. Does not support SpaCy 2.0 anymore.

Version 0.4.1, 2020-08-29

* Published as a PyPI package. The package name is spacy_fi_experimental_web_md

Version 0.4.0, 2020-07-06

* Ported to SpaCy 2.3
* Include 500k keys and 20k vectors like in the official *_md models
* Include the word vectors for the most frequent words

Version 0.3.0, 2020-05-17

* Extract noun phrases
* Lemmatize conjugated abbreviations: EU:ssa => EU
* Requires SpaCy 2.2.4 or later

Version 0.2.0, 2020-01-26

* Tagging auxiliary verbs as AUX (previously VERB) following the UD convention
* Fixed bugs in lemmatization of compounds words: ilmakuivata, esiopetus, etc
* Improved lemmatization of pronouns, especially clitics: sinäkin, mekään, etc
* Using the same Finnish tokenizer rules as the spaCy master branch

Version 0.1.0, 2020-01-11

Initial release