Sets of lists of Gaelic words and phrases, sorted by how often they occur in pieces of Gaelic text scraped from LearnGaelic.
Being able to follow the Litir Bheag, Litir do Luchd-ionnsachaidh or Watch Gaelic is a pretty important step up learning Gaelic, because it opens up a huge amount of comprehensible input with clear audio and translations. Frequency lists can help with that - either by identifying the most common vocab needed for the resources to be comprehensible, or in making decisions about which of all the new words encountered will be most useful to learn first.
Also, some people just like to work through frequency lists as a means of learning in general. If that's you, there is an existing frequency list at iGàidhlig you might also want to check out. But, because of the corpus it was created from, it favours organisational terms like "pàrlamaid" and "comataidh". The frequency lists here are just as biased by the sources they were created from, but that's not such a terrible thing if you're actually studying from those sources. Additionally the frequency lists created from Watch Gaelic include spoken content on a fairly wide range of topics from BBC Alba programs, so is probably at least broadly applicable to watching tv, or maybe even in general.
The lists were generated by processing all the text on the following sources.
- The Litir do Luchd-ionnsachaidh, a weekly letter on topics like folklore, history, nature and culture. The Litir Bheag is a simplified version of this with a full translation, so the frequency lists should be relevant to both. These lists are here.
- Watch Gaelic, extracts from factual shows on BBC ALBA covering a range of topics, with transcripts. These lists are here.
- There are also lists combined from both sources here.
- Frequency.csv - A list of individual words and a count of how many times they were encountered in the corpus. This can be useful for finding the vocabulary you need to learn to understand enough of the resource to benefit from it, or to make decisions about which new words you encounter to prioritise.
- Bigrams1.csv - Pairs of words ordered by how often they appear together in the source text. This can be useful for finding common phrases like "gu bheil" or "'s docha", or which prepositions frequently appear with a verb like "coltach ri" or "fuireach anns". Excludes definite articles and some single character words, to avoid the list largely being "an cat", "an cù" etc.
- Bigrams2.csv - Pairs of words of four characters or longer, to help find common phrases, or pairs of adjectives and nouns commonly used together.
- Trigrams.csv - Sets of three words, to help find phrases including articules which needed to be excluded from the bigrams.
- You can also find all the lists together in a spreadsheet you can duplicate.
The words exist in the list in all the forms found in the text, with no attempt to combine different word forms. So, for example, you'll find both "dèanamh" and "dhèanamh", "bliadhna" and "bhliadhna". This will likely bias the list a little towards certain types of words, but it's better than whatever mess I'd make of trying to fix it.
The python scripts used to scrape the content and create the lists are here. They are pretty bare-bones.
A Gaelic learner and not even a good one. Definitely not a linguist or anyone you should trust on this subject in any way. @innesmck