AnKaS: Development and Analysis of the Database of Livvi-Karelian Speech Annotations

Authors are hidden for peer review¹

¹ Affiliation is hidden for peer review
INTERSPEECH 2024 (submitted)

Paper (coming soon) Data Vocabulary Code (coming soon)

TODO List

Data collection

Data labelling

INTERSPEECH paper submission

GitHub page creation

arXiv paper submission (after accepting)

Release code and models (after accepting)

Abstract

This paper presents a new Livvi-Karelian corpus, addressing challenges encountered in low-resource language research. The main research goal was to collect and annotate new speech data, as well as to create a transcription dictionary. The corpus includes transcripts from radio broadcasts, featuring samples from 17 speakers (7 males and 10 females). Covering about 4.5 hours of audio recordings, it contains 32037 words, thus being a valuable tool for linguistic research. Among the peculiarities of the presented corpus are instances of code-switching between Livvi-Karelian and Russian. The baseline experiments were carried out with the Kaldi toolkit. Hybrid DNN/HMMs with factorized time-delay neural networks were utilized for acoustic modeling, while trigram and LSTM-based models were used for language modeling. The proposed model allowed achieving the Word Error Rate (WER) of 26%.

AnKaS Database Key Information

Database of Annotations of Karelian Speech (AnKaS) includes timestamps, textual transcriptions, and code-switching marks of Livvi-Karelian radio broadcasts.

The database is represented in JSON format. A separate .json file was created for each speaker. The following keys are used:

"phrase_id" is the phrase number for this speaker;
"link" is a link to the audio recording;
"time_start" is start time of the phrase;
"time_end" is end time of the phrase;
"sentence" is textual transcription;
"sentence_rus" is textual transcription with code-switching, indicated with brackets and the tag "rus";

The files train.txt, dev.txt, and test.txt contains the lists of phrases for training, fine-tuning, and testing the system respectively, where the first number is the speaker's id, and the second number is the phrase's id.

Vocabulary Information

voc.txt contains list of words from textual transcriptions from AnKaS database with their phonemic representation. Vocabulary includes Karelian words as well as most frequent Russian words. Transcriptions for Russian words were made according to Russian transcribing rules.

Database Metadata

Database Features	Value
Number of Speakers	17 (7 male, 10 female)
Total Duration	4.5h
Number of Utterances	4385
Word Occurrences	32,037
Unique Words	9,117