Skip to content

Scottish Gaelic word frequency lists built from online corpuses

License

Notifications You must be signed in to change notification settings

innesmck/GaelicFrequencyLists

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Scottish Gaelic Word Frequency Lists

What is this?

Sets of lists of Gaelic words and phrases, sorted by how often they occur in pieces of Gaelic text scraped from LearnGaelic.

Why would I want it?

Being able to follow the Litir Bheag, Litir do Luchd-ionnsachaidh or Watch Gaelic is a pretty important step up learning Gaelic, because it opens up a huge amount of comprehensible input with clear audio and translations. Frequency lists can help with that - either by identifying the most common vocab needed for the resources to be comprehensible, or in making decisions about which of all the new words encountered will be most useful to learn first.

Also, some people just like to work through frequency lists as a means of learning in general. If that's you, there is an existing frequency list at iGàidhlig you might also want to check out. But, because of the corpus it was created from, it favours organisational terms like "pàrlamaid" and "comataidh". The frequency lists here are just as biased by the sources they were created from, but that's not such a terrible thing if you're actually studying from those sources. Additionally the frequency lists created from Watch Gaelic include spoken content on a fairly wide range of topics from BBC Alba programs, so is probably at least broadly applicable to watching tv, or maybe even in general.

What are the sources?

The lists were generated by processing all the text on the following sources.

What are the different types of list?

  • Frequency.csv - A list of individual words and a count of how many times they were encountered in the corpus. This can be useful for finding the vocabulary you need to learn to understand enough of the resource to benefit from it, or to make decisions about which new words you encounter to prioritise.
  • Bigrams1.csv - Pairs of words ordered by how often they appear together in the source text. This can be useful for finding common phrases like "gu bheil" or "'s docha", or which prepositions frequently appear with a verb like "coltach ri" or "fuireach anns". Excludes definite articles and some single character words, to avoid the list largely being "an cat", "an cù" etc.
  • Bigrams2.csv - Pairs of words of four characters or longer, to help find common phrases, or pairs of adjectives and nouns commonly used together.
  • Trigrams.csv - Sets of three words, to help find phrases including articules which needed to be excluded from the bigrams.
  • You can also find all the lists together in a spreadsheet you can duplicate.

What are the words?

The words exist in the list in all the forms found in the text, with no attempt to combine different word forms. So, for example, you'll find both "dèanamh" and "dhèanamh", "bliadhna" and "bhliadhna". This will likely bias the list a little towards certain types of words, but it's better than whatever mess I'd make of trying to fix it.

What are the scripts?

The python scripts used to scrape the content and create the lists are here. They are pretty bare-bones.

Who are you?

A Gaelic learner and not even a good one. Definitely not a linguist or anyone you should trust on this subject in any way. @innesmck

About

Scottish Gaelic word frequency lists built from online corpuses

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages