V. S. Rogozhina

V. S. Rogozhina
Postgraduate Student, Department of Applied and Experimental Linguistics,
Institute of Applied and Mathematical Linguistics, Faculty of the Humanities and
Applied Sciences, MSLU; е-mail: mslu.italiano@gmail.com
Development of speech corpora and acoustic-phonetic databases are
indispensible for any research and development work in spoken language
systems. This paper is focused on describing the main spheres of speech corpora
application and demonstrating the process of creating Polish speech database. For
this purpose 40 female and 40 male Polish native speakers were recorded. The 21
hours of direct face-to-face conversations, interviews and discussions as well as
audiobooks and audio tapes of Polish textbooks were analyzed, segmented and
transcribed. During the research two types of analysis were carried out: acoustic
and perceptual. As a result an annotated speech database along with transcribing
rules was created.
Key words: speech corpora; acoustic-phonetic databases; spoken language
systems; Polish speech database; annotated speech database; database
management system.
В. С. Рогожина
аспирант каф. прикладной и экспериментальной лингвистики Института
прикладной и математической лингвистики фак-та ГПН МГЛУ
(на примере польского языка)
Современные речевые технологии автоматического распознавания
и синтеза речи невозможно представить без корпусной лингвистики и устноречевых баз данных.
Данная исследовательская работа посвящена исследованию проблемы
современной корпусной лингвистики на примере создания устно-речевой
базы данных польского языка. Для этого были записаны голоса 43 мужчин
и 40 женщин – носителей, в основном, нормативного варианта польского
языка. Был проанализирован, сегментирован и затранскрибирован 21 час
монологов, диалогов, аудиокниг, а также спонтанной и квазиспонтанной
речи. Во время исследования проводилось два вида анализа: акустический
и перцептивный. В результате были созданы аннотированная устно-речевая
база данных и правила транскрибирования.
Ключевые слова: корпусная лингвистика; устно-речевая база данных;
системы обработки языка; польская речевая база данных; аннотированная
база данных; системы управления базами данных.
Spoken language is central to human communication and has significant
links to both national identity and individual existence (http:// www.ldc.
upenn.edu/annotation/). In the area of speech and language technology,
including speech synthesis and recognition, speaker identification, language
identification and message understanding the common basis need is the
speech corpora (http://cslu.cse.ogi.edu/HLTsurvey/ch12node5). Speech
is produced differently by each speaker. Each utterance is produced by a
unique vocal tract which leaves its traces on the signal (http:// www.ldc.
upenn.edu/annotation/). The speech corpora vary in features like recording
conditions, environments, age groups, media used, sampling rates, data
collection protocols, annotation levels and tags. That is why the general
purpose of speech corpora is to cover as much variability as possible
to enable its use in various applications [1; 2]. Creation of large speech
databases is one of the important conditions for solving the problem of
speech recognition and speech synthesis [3]. This problem was examined
and extensively covered by R. K. Potapova and the Department of Applied
and Experimental Linguistics of Moscow State Linguistic University. In
the article “The Main Tendencies of Multilingual Corpus Linguistics”
[1] R. K. Potapova describes the stages of creating speech databases for
French and Arabic languages. She also points out the challenges that one
could face while developing a speech database.
As for Polish language, no audio corpus of acceptable quality had been
created till 1998. In 1998 within the project SpeechDat(E) (a project in a
series of European projects aiming at the creation of large telephone speech
databases) was created The Polish SpeechDat(E) database , containing the
recordings of 1,000 Polish speakers (488 males, 512 females) recorded
over the Polish fixed telephone network [4]. But this database is aimed to
meet, first of all, the needs of telecommunication services.
The main aim of the research is to create a speech database that will
be appropriate for speech recognition and speaker identification. Thus,
the immediate goals of the present paper are confided to give an overall
description of the main stages of the development of Polish speech
The creation of a speech database falls [5] into following stages:
• accumulation enough audio materials
В. С. Рогожина
• analysis obtained recordings
• segmentation recordings into chunks
• orthographic transcription
• developing of transcriptional rules of the required language
• phonemic transcription of each segment using transcriptional rules
• saving all files in data carrier (CD-RW or DVD+R)
During the research carried out by the Department of Applied
Linguistic of Moscow State Linguistic University 21 hours of recordings
are obtained in a variety of ways. 11 hours of recordings are direct face-toface conversations, interviews and discussions taken from Polish broadcast
website www.polskieradio.pl, and the other part of data is audiobooks and
audio tapes of Polish textbooks. To make the process of segmentation easier,
an unique ID was given to each speaker. Table 1 presents the information
of the records made by female speakers ( speaker, source, ID, duration).
Table 1
“Female speakers”
Blanca Kutyłowska Audiobook: Opowiedzcie, jak tam żyjecie
Barbara Utlinska
Audiobook: Pod sloncem Toskanii
Hanna Kaminska
Audiobook: Bella Toskania
Audiobook: Weisberger Lauren -Diabeł ubiera
się u Prady
Elzbieta Kijowska
Audiobook: Kossak Zofia -Bursztyny
Klaudia Binkowska Audiobook: Zapolska Gabriela -Z pamietnikow
mlodej mezatki
The research involves 40 female and 40 male native speakers. All
recordings are digitized.Recording is done in 16-bit PCM (*.wav) mono
with sampling frequency of 16 kHz . Of all recordings a verbatim transcript
is made. To facilitate the transcription process, the interactive signal
processing tool PRAAT1 was used. PRAAT software gives full scope for
visualizing the speech signal and at the same time creating and viewing
For more information on PRAAT see http://www.fon.hum.uva.nl/praat/
an orthographic transcription. During the transcription process, the audio
files were segmented using Adobe Audition 1.5 and Sound Forge 7.0 by
inserting time markers in unfilled pauses between words. At a later stage
these markers are used as anchor points for the automatic alignment of the
transcript and the speech file. For the broad phonetic transcription of the
data, the SAMPA1 set was used.
The speech database of Polish language made for “Foresight” project
is being developed with two major objectives- one to use it as a support to
fundamental research for the study of acoustic-phonetic, lexical, semantic,
syntactic manifestations in a language and the other to capture the variability
that arises due to variations among speakers, sex, speaking environments,
recording, transmission channels and etc., that are essential for solving the
task of speech recognition and speaker identification.
The created Polish speech database includes the 21 hours of direct faceto-face conversations, interviews and discussions as well as audiobooks in
Polish language. After acoustic and perceptual analysis of audio files, they
were segmented in 2256 relatively short chunks (of approximately 20 to
30 seconds each). Figure 1 shows the segmented audio file that afterwards
was cut into chunks.
Figure 1. The example of the segmented audio file ( f7)
The speech database consist of separate folders named after speaker’s
id (i.e. f1, f2, m1, m2 and etc.). Each prompt utterance is sorted within a
SAMPA is an ASCII encoding system for various languages, including
Dutch, based on the International Phonetic Alphabet (IPA).
В. С. Рогожина
separate file. Each file contains all the chunks of audio files made by one
speaker as well as an orthographic text with a phonemic transcription
in SAMPA, saved in *.txt file format. The created database comprises
83 Polish speakers (40 female, 43 male). All the speech database is
partitioned into 2 CDs, each of which comprises 40 speakers sessions.
Along with the database creation some of the phonetic rules were
developed (Appendix 1). As a data base management system (DBMS)
Microsoft Office Access 2007 (Figure) was chosen.
Figure 2. The example of DBMS
The creation of qualitative speech corpus is a rather complicated
technological task. To solve the problems concerning speech corpus
development, special coordinating centers were set up for recording,
keeping, spreading and creation of public and standardized language
recourses, including speech ones (http://cslu.cse.ogi.edu/HLTsurvey/
Among them there are:
– LDC (Linguistic Data Consorcium, http://www.ldc.upenn.edu)
– CSLU (Center for Spoken Language Understanding, Oregon
Graduate Institute http://www. CSLU.ogi.edu)
– ELRA (European Language Resources Association, http://www.elra.
Though the collection of speech corpora offered by centers is
increasing every year, only three speech databases of Polish language are
worth mentioning: The Polish SpeechDat(E) database PELCRA (Polish
and English Language Corpora for Research and Applications), Korpus
Języka Polskiego Wydawnictwa Naukowego PWN. These corpora content
nearly 2 million words, but there is no opportunity to use them without
buying the corpora. In internet there are only trial versions of such corpora.
To use the full version one should be a member of the special coordinating
center, which is also money consuming. That is why it is significant for
development of corpus linguistics in Russia to create its own collection
of speech databases of different languages. Speech database of Polish
language could make a valuable contribution to this collection.
Speech corpora can be classified based on their characteristics and
purpose for creating them as – task specific corpora, general purpose
corpora, lexical, morphological, syntactical and semantic corpora,
acoustic-phonetic database, databases of supra-segmental features,
databases of source and tract parameters etc. The created speech corpus is
developed for a generic use. Though the extensive work was carried out,
still many thing left to be done. The recordings need to be done in more
number of environments and using variety of devices like over telephone,
mobile, hands-free environments and in different types of transports like
own car, public vehicles and in different regions varying in geographical
conditions. Besides recordings in different conditions one studio recording
by male /female professional speaker must be done. The question of the
type of database management system (DBMS) still remains open.
