Speechlab

Slavic Languages PDF Tisk Email

In the last ten years, we have developed a highly automated procedure that allowed us to develop ASR systems for seven South Slavic languages within a relatively short period. For all modules of the systems we used only publically (and freely) available data from the Internet.

To make the development efficient, we tried to benefit from language relations and similarities as much as possible. We have built a common platform that included: a) Latin alphabet coding for all the languages (i.e. also for those with the Cyrillic one), b) a common phonetic inventory that is helpful mainly during the initial bootstrapping phase, c) a versatile G2P tool applicable (and easily modifiable) to most Slavic languages, d) a versatile digit-to-text translator that works with most number generating patterns occurring in these languages.

The experiments performed on real data prove that the ASR systems achieve results that are applicable for automatic monitoring of broadcast stations in this European region. After employing them in daily use, we intend to get more data and utilize it for further improvements, namely in the acoustic modeling part.

radiotrans

Application tracking real-time selected radio stations in seven languages
(screenshot taken on 4.1.2018 at 2:05 PM)


TAČR: TA04010199 - MULTILINMEDIA - Multilinguální platforma pro monitoring a analýzu multimédií (2015 - 2017)