Komi-Zyrian corpora

This is the main page of the website where linguistic corpora of Komi-Zyrian language are located. Currently, two corpora are available: the corpus of contemporary written literary Komi-Zyrian (“the Main corpus”) and the corpus of social media in Komi-Zyrian. They differ in what kind of texts the contain, but have mostly identical annotation and search capabilities. Here is a brief comparison:

	Main corpus	Social media corpus
Language	Komi-Zyrian	Komi-Zyrian and Russian
Size	1.76 million words	1.85 million words (the Komi-Zyrian part) 18.98 million words (the Russian part)
Texts	contemporary press (up to February 2019)	open posts and comments by Komi-speaking vkontakte users (up to December 2018)
Language variety	in most cases, standard written literary Komi-Zyrian or close to it	language of digital communication: closer to the spoken variety, influenced by the dialects and Russian language, contains numerous code switching instances
Annotation	automatic morphological annotation (lemmatization, part of speech, all inflectional features), 92.2% words analyzedonly tokens that do not contain digits or Latin characters are taken into account no disambiguation annotation of Russian loanwords annotation of several lexical/semantic classes and word formation: animate/human nouns, body parts, transport, different classes of proper names, several nominal derivational suffixes glossing Russian translation of lemmata	automatic morphological annotation (lemmatization, part of speech, all inflectional features), 89.1% words analyzedonly tokens that do not contain digits or Latin characters are taken into account no disambiguation annotation of Russian loanwords annotation of several lexical/semantic classes and word formation: animate/human nouns, body parts, transport, different classes of proper names, several nominal derivational suffixes glossing Russian translation of lemmata
Metadata	title of the text author or title of the newspaper creation year (exact date in the case of newspapers) genre	group name (for groups) publicly available user metadata: sex (for everyone); if available, also birth year (grouped in 5-year spans); real names and nicknames of the users are hidden creation year message type (post/comment) language (tagged automatically, independently for each sentence)

Apart from the corpora available here, there exists at least one another written publicly available Komi corpus developed by the FU-Lab team. It contains over 40 million tokens of fiction and has an on-the-fly morphological analyzer (but no search by morphology or lemma). Additionally, there is a spoken corpus in the Komi Media Collection project by the same authors and a spoken corpus of the Pechora dialect, collected in the field by Moscow-based linguists.

You can find more detailed information about Komi-Zyrian Social media corpus and its development in this paper. Please consider citing this paper if your research is based on this corpus:

Timofey Arkhangelskiy. 2019. Corpora of social media in minority Uralic languages. Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages, pages 125–140, Tartu, Estonia, January 7 - January 8, 2019.

What is a corpus?

A language corpus is a collection of texts in that language which has been enriched with additional linguistic information, called annotation, and, preferably, equipped with a search engine. Here you will find a short list of frequently asked questions about the Komi-Zyrian corpora.

— Who needs corpora?

First of all, corpora are used by linguists. The search engine and annotation of corpora are designed in such a way that you can make linguistic queries such as “find all nouns in the genitive case” or “find all forms of the word кань followed by a verb”. Apart from linguists, corpus can be a useful tool for language teachers, language learners, and even the native speakers.

— Can I use the corpus as a library?

No, these corpora are not designed for that. When you work with a corpus, you make a query, i.e. search for a particular word, phrase or construction, and get back all sentences that contain what you searched for. By default, the sentences are showed in random order. You can expand the context of each of the sentences you get, i.e. look at their neighboring sentences. However, you may do so only a limited number of times for each sentence. Therefore, it is impossible to read an entire text in the corpus. This is done for copyright protection.

— Can I use the corpus as a dictionary?

Each Komi-Zyrian word in the corpus has Russian translation (no English translations are available at the moment). However, they are only provided as auxiliary information for users who do not speak Komi. The translations in the corpus are kept short and simple by design, they do not list all senses and do not provide usage examples like real dictionaries. If you want to know how to translate a word, the right way to do so is consulting a dictionary.

— What is morphological annotation and how do you get it?

The corpora located here are lemmatized and morphologically annotated. Lemmatization means that each word in the texts is annotated with its lemma, i.e. dictionary/citation form. Morphological annotation means that each word is annotated for its grammatical features, such as part of speech, number, case, tense, etc. Since the corpora in question are too large for manual annotation to be feasible, they were annotated automatically with a program called morphological analyzer. The analyzer uses a manually compiled grammatical dictionary and a formalized description of Komi-Zyrian inflection. The analyzer together with the necessary materials is freely available in my bitbucket repository. Automatic annotation unfortunately means that, first, out-of-vocabulary words are not annotated, and, second, that some words have several ambiguous analyses. For example, confronted with the form воӧ, the analyzer cannot determine whether it should be analyzed as the 1sg possessive form of во (“my year”), illative form of the same word (“in a year”) or even a form of the verb воны “come”. Russian sentences in the social media corpus were annotated with the mystem analyzer.

Komi language

Komi-Zyrian is one of the two literary standards for the dialectal continuum known as Komi, which belong to the Permic group of the Uralic family, along with Udmurt. The number of Komi-Zyrian speakers is estimated at 150,000. Komi-Zyrian uses Cyrillic orthography based on the Russian alphabet, with two additional letters. Almost all morphological markers are suffixes that mostly attach to the stem agglutinatively. Nominal grammatical categories are number, case, and possessiveness. The direct object can be marked either in the nominative or in the accusative (DOM). The word order in the sentence is free, with SVO (subject – verb – object) being the default.

Tagset

The grammatical features of the words in the corpora are marked with short tags. Here is the full list of tags used in Komi-Zyrian corpora. Both corpora have identical set of tags.

A — adjective
APRO — adjectival pronoun
ADV — adverb
ADVPRO — adverbial pronoun
CONJ — conjunction
IMIT — ideophone
INTRJ — interjection
N — noun
NUM — numeral
PARENTH — parenthetic word
PART — particle
PN — proper noun (subtype of nouns)
POST — postposition
PREDIC — predicative
PRO — pronoun
V — verb
1 — 1st person
1pl — 1pl possessive suffix
1sg — 1sg possessive suffix
2 — 2nd person
2pl — 2pl possessive suffix
2sg — 2sg possessive suffix
3 — 3rd person
3pl — 3pl possessive suffix
3sg — 3sg possessive suffix
abbr — abbreviation
abl — ablative case
acc — accusative case
anim — animate noun
app — approximative case
atten — attenuative derivation (-i̮št-)
attr — any attributive
attr_a — general attributive in -a
attr_loc — locative attributive in -sa
body — body part
car — caritive case
card — cardinal number
case_comp — case compounding
caus — causative (-əd-)
cns — consecutive case in -la
coll — collective numeral
com — comitative
comp — comparative (-ǯi̮k)
cvb.gen — general converb (-ig)
cvb.lim — limitative converb (-təǯ́)
cvb.mon — converb in -mən
cvb.neg — negative converb (-təg)
cvb.sim — converb of simultaneity (-əmən)
dat — dative case
delim — delimitative derivation (-l-)
dem — demonstrative pronoun
distr — distributive numeral
egr — egressive case
el — elative case
famn — family name
fut — future tense
gen — genitive case
hum — human
ill — illative case
imp — imperative
impers — impersonal verb (annotation incomplete)
indef — indefinite pronoun
inf — infinitive
ins — instrumental case
intr — intransitive verb (annotation incomplete)
iter — iterative (-av-)
loc — locative/inessive
missp — typo
neg — negative form
neg_attr — negative attributive
nmlz — nominalization in -əm
nmlz_in — locative nominalization in -in
nmlz_lun — abstract noun in -lun
nmlz_tor — abstract noun in -tor
nom — nominative case
oblin — oblinative (adjective in -əś)
ord — ordinal number
pass — passive
pass_sjy — passive (-śi̮-)
pass_ysj — passive (-i̮ś-)
patrn — patronymic
period — numeral in -pərjə
pers — peronal pronoun
persn — personal (given) name
pl — plural
pr — case in -śa
prol — prolative case
prs — present tense
pst — first (direct) past tense
pst2 — second (evidential) past tense
ptcp.act — active participle
ptcp.neg — negative participle
refl — reflective pronoun
rel_adj — relational adjective
rel_n — relational noun (inflected postposition)
rus — Russian borrowing (or borrowing through Russian)
rus_afx — Russian affix with native stem
sg — singular
short — short form of a personal pronoun
supernat — noun that denotes a supernatural beingThis category is a byproduct of animacy/humanness annotation. Since it is not clear whether these cases should be classified as human, we put them in a separate box, so that the user can decide that for themselves.
term — terminative case
time_meas — time measurement unit
tr — transitive verb (annotation incomplete)
topn — toponym (geographical name)
transport — transport
vn — verbal noun in -an

The tagset for the Russian-language part (Russian sentences in the social media corpus) can be found in the Russian National Corpus.

Authors

The corpora and morphological analyzer are developed and maintained by Timofey Arkhangelskiy. The first versions of the corpora were released in 2019 as part of his postdoctoral project supported by Alexander von Humboldt Foundation. The corpora are hosted by the School of linguistics at HSE, Moscow.

Komi-Zyrian corpora

What is a corpus?

Komi language

Tagset

Authors

Contacts