Sakha long vowels (II, AA) behave like short-vowel counterparts
But diphthongs (IA) behave like high vowels (I) (round after any round V, do not trigger rounding of low Vs)
Solution:
Each twol harmony rule (char-to-char mapping): sensitive to whether harmonising V is component of long V or diphthong or not
Many V harmony alternations required multiple rules to implement
Implementation and Challenges
Two-directional consonant assimilation
Problem:
forms like /tutn-bIt-A/ ‘use-past-3’ realised as [tutummuta] тутyн>BIт>A:тутуммута
/n/ triggers nasalisation of the following /b/
/b/ triggers labialisation of the preceding /n/
Solution:
Mutual influence not problematic in twol
rules are sensitive to underlying form (left side of :) of adjacent symbol, not surface form (right side of :)
Implementation and Challenges
Many alternations in a single stem
Problem:
forms like уһун ‘swim[imp]’ / устар ‘swim-pres’
Several different alternations involved:
с ~ Һ — intervocalic lenition
н ~ т — sonority restrictions
I ~ ∅ — consonant cluster restrictions
Solution:
≥1 twol mapping for each alternation
each mapping sensitive to the others & to other parts of morphophonological context
y used for high vowel ~ ∅ alternation, as in previous work (Washington et al., 2019)
Implementation and Challenges
Novel grammatical understanding
Existing literature:
Sakha exhibits many non-finite verb forms
Some have finite uses
Generally categorised roughly as "participle" or "converb"
Our contribution:
Categorised each form carefully based on uses:
verbal noun, verbal adjective, verbal adverb, infinitive
Implemented each use separately
Results in some syncretism (forms existing across multiple categories)
Concluded that there is not a strict participle/converb dichotomy
Documented in more detail in Washington et al. (2021)
Evaluation
Coverage
Naïve coverage: the number of forms in a corpus that receive an analysis, regardless of whether or not the analysis is correct (e.g., in context)
Corpus
Tokens
Coverage
Newspapers
~16M
91.04%
Wikipedia
~2.4M
91.30%
New Testament
~190K
94.53%
Over 90% coverage: robust morphology
Evaluation
Precision & Recall
created gold standard:
1000 valid words of Sakha
randomly selected from Wikipedia corpus
manually annotated output of transducer
Results:
Corpus
Precision
Recall
Wikipedia
98.52%
75.42%
i.e., nearly every form returned by the transducer was deemed correct, but many correct analyses were not returned by the transducer (mostly due to low coverage)
Future work
Correct minor issues in implementation of some morphophonological alternations
identified recently in data generation for a shared task (Pimentel et al., 2021)
Morphological/syntactic disambiguation
More language technology applications of transducer? (spell checkers, MT, etc.?)
Conclusion
Robust transducer
high coverage
high precision
moderate recall
Lots of room for improvement!
Ready for use in language technology applications, downstream tasks
This work has also contributed to documentation of Sakha grammar
A project of the Language Learning Lab at the University of Helsinki
Revita
Computer-Assisted Language Learning system
User uploads arbitrary text
System generates exercises
User practices with exercises
System gives feedback
Revita modules
Conclusion
The functionality of the language learning system is under development. For languages with more speakers many more linguistic resources and tools are available than for Sakha.
For example, currently, the Sakha system has only noun–postposition government rules.
Cross-Lingual Embeddings for Less-Represented Languages in European News Media
Template-based news generation
CPHI (Harmonized Index of Consumer Prices)
Template-based news generation
Template example
Template-based news generation
Result example
В Апреле 2021, в Финляндии, ежемесячный темп роста согласованного индекса потребительских цен для категории 'здоровье' был 2.5 пункта. Ежемесячный темп роста согласованного индекса потребительских цен для категории 'здоровье' был на 2.3 процентных пунктов больше, чем в среднем по ЕС. В Марте 2021, ежемесячный темп роста согласованного индекса потребительских цен для категории 'здоровье' был на 2.2 процентных пунктов больше, чем в среднем по ЕС. Ежемесячный темп роста согласованного индекса потребительских цен для категории 'здоровье' был 2.3 пункта. Финляндия имела 2 самый высокий ежемесячный темп роста согласованного индекса потребительских цен для категории 'здоровье' во всех наблюдаемых странах. В Апреле 2021, Финляндия имела 1 самый высокое значение во всех наблюдаемых странах.
Turkic benchmark
A project of Turkic Interlingua community
The goal is to create benchmarks for Extractive Summarization, SQuAD Question Answering, NLI, NER, and some other tasks for 23 Turkic languages
List of languages
Altai
Azerbaijani
Bashkir
Chuvash
Crimean Tatar
Gagauz
Iraqi Turkmen
Karachay-Balkar
Karakalpak
Kazakh
Khakas
Kirghiz
Kumyk
Sakha
Salar
Shor
Tatar
Turkish
Turkmen
Tuvinian
Urum
Uyghur
Uzbek
Conclusions
A free/open-source morphological analyser and generator for Sakha
Sakha in Revita—computer-assisted language learning platform