Entropic Latino40 Speech Database
Author: Jared Bernstein
Developers: Bill Grundy, Jared Bernstein, Elizabeth Rosenfeld,
Amir Najmi, Psi Mankoski.
Entropic Research Laboratory, Inc.
600 Pennsylvania Ave. SE
Washington, DC 20003
email: info@entropic.com
Introduction:
Entropic Research Laboratory designed the Latino-40 database to
provide a set of recordings for training speaker-independent systems
that recognize Latin-American Spanish. The resulting database, called
Entropic Latino40, was recorded in the period from 11 July through
9 September 1994, in Palo Alto, California.
The database comprises about 5000 utterance files. These files
include about 125 utterances from each of 40 different speakers, 20
male and 20 female. The recordings were all made with a high-quality,
head-mounted microphone (Shure SM10A) in an office environment, and
the utterances were digitized in 16-bit samples at 16 kHz.
Material:
The Linguistic Data Consortium provided 13,000 sentences that had been
selected (apparently from Latin American newspaper text) by people
working at Texas Instruments. No documentation was available on the
sentence set, and the sentences include a number of anomalous or
ambiguous forms. The sentences are all shorter than 80 characters,
and are not grouped into larger constituents like paragraphs or
stories.
Each of 13,000 sentences is identified by its own number sss1 through
sss13000. The set of sentences was divided into 13 distinct sets of
1000 sentences each, and each successive speaker read from the next
subset of 1000 sentences, rotating through the 13 subsets. For each
speaker, the first 125 acceptable sentences are included in the
Latino40 data base. It was necessary to reject 10 or 15 sentences for
many speakers, and as many as 150 for one speaker, in order to find
125 acceptable ones. The following is a sample of 20 sentences from
subset 3 that includes the longest sentence (sss3816) among the entire
13,000:
- sss3800 Hay problemas muy serios en estos momentos.
- sss3801 Se habman ilustrado varias caractermsticas.
- sss3802 A Venezuela y a Guyana les ira particularmente bien.
- sss3803 A su juicio, tal hipstesis era ticnicamente inadecuada.
- sss3804 No hacemos gestos porque no somos actores, aqadis.
- sss3805 Queremos un poco de seguridad para regresar.
- sss3806 Este lmmite maximo se aplics a tres escalas.
- sss3807 Si el Consejo acepta eso, no tengo nada que decir.
- sss3808 Esto puede lograrse a travis de una fsrmula legal.
- sss3809 No vamos a defraudar al pueblo, dijo Reina.
- sss3810 El economista de la firma, no lo cree.
- sss3811 Postergan juicios polmticos en contra de los ministros.
- sss3812 Esa es la parte facil, dijo el portavoz.
- sss3813 Tambiin se debe tener en cuenta otro elemento.
- sss3814 En poco tiempo se iban a instalar diez mas.
- sss3815 ?Es esto un smmbolo del dominio japonis de la electrsnica?
- sss3816 En mil novecientos ochenta y nueve no se denegs ese tipo de autorizacisn.
- sss3817 Procura evitar accidentes y vertimientos.
- sss3818 Ademas, se promoveran actividades industriales.
- sss3819 Los hombres utilizan ese tiempo para esconder sus armas.
- sss3820 Ha abierto los ojos y mueve las manos.
Speakers:
Speakers were all paid volunteers who had been informally solicited in
the Palo Alto area. All speakers were adults; they ranged in age from
18 to 59 years of age. All claimed to be native speakers of Latin
American Spanish, although one speaker was completely rejected because
his accent sounded Brazilian to the person verifying the recordings.
Seven speakers were from Peru; five each from Argentina, Columbia,
Guatemala, and Nicaragua; three from Venezuela; and two each from
Chile, Costa Rica, Cuba, El Salvador, and Mexico.
Naming Convention:
The speakers are identified with four characters; two letters and two
numbers. The first letter identifies the country of origin: e.g. 'a'
for Argentina, 'p' for Peru, etc., but 'c' for Colombia, 'b' for Cuba,
'h' for Chile, and 'r' for Costa Rica. The second letter identifies
the speaker's gender: 'm' male or 'f' female. The two-digit number is
an arbitrary identifier in the range 01 to 40.
The forty speakers are identified as follows:
id age subset origin verifier note
af01 30 5 Santa Cruz, Argentina
af13 27 12 Buenos Aires, Argentina
af14 43 5 Buenos Aires, Argentina
am19 41 10 Buenos Aires, Argentina
am26 28 4 Buenos Aires, Argentina
bm21 30 2 Havana, Cuba
bm22 55 4 Havana, Cuba
cf11 52 2 Cali, Colombia
cf30 34 5 Bogota, Colombia
cm02 23 11 Bogota, Colombia
cm05 40 7 Bogota, Colombia
cm07 37 8 Bogota, Colombia
gf10 30 13 Quetzaltenango, Guatemala
gf18 27 1 Guatemala City, Guatemala poor reading
gf20 34 12 San Marcos, Guatemala
gf38 30 8 Guatemala, Guatemala
gm06 29 10 San Marcos, Guatemala 119 sentences; poor reading
gm17 18 12 Guatemala City, Guatemala
hf28 43 9 Valparaiso, Chile
hf39 39 11 Vina del Mar, Chile
hm12 59 9 Santiago, Chile
mf27 28 9 D. F., Mexico
mm32 32 9 Durango, Mexico
nf34 23 6 Granada, Nicaragua slow reading
nf35 29 10 Managua, Nicaragua
nm15 54 6 Managua, Nicaragua
nm23 44 7 Managua, Nicaragua
pf31 39 13 Lima, Peru
pf33 37 2 Lima, Peru slow reading
pf37 23 10 Cusco, Peru
pf40 40 3 Lima, Peru uvular /rr/
pm03 36 3 Lima, Peru
pm16 57 3 Lima, Peru
pm24 31 4 Lima, Peru poor reading
rf29 59 7 San Jose, Costa Rica
rf36 35 11 San Jose, Costa Rica poor reading
sf09 46 8 San Salvador, El Salvador
sm04 24 6 San Salvador, El Salvador poor reading
vf08 28 11 Valencia, Venezuela
vm25 33 5 Caracas, Venezuela
Recording Procedure:
Speakers were seated in an upholstered chair facing the console of a
Silicon Graphics Indy (SGI) workstation computer. Each speaker was
introduced to the procedure to be followed and signed a consent to
participate in the data collection. Speakers were instructed how to
control the recording software, how to wear the microphone properly,
and how to judge whether or not a particular read rendition would be
acceptable.
The speakers wore a Shure SM10A unidirectional head-worn dynamic
microphone, and controlled the recording session at their own pace
using a recording program designed for the purpose. Control of the
recordings was principally accomplished through a "record" button that
displayed the text of the Spanish sentence, and initiated recording.
The recording of a sentence was typically ended by pushing a "record
next" button, that terminated the recording of the current sentence
and then initiated the recording and display of the next sentence.
Speakers had access to a full set of other controls that permitted
them to play and re-record earlier sentences if they wished, and move
about in the database they were constructing.
After an initial period during which an Entropic supervisor monitored
the speaker's reading and recording control, speakers were left to
monitor their own reading and recordings.
Speech signals went from the Shure microphone through a Rane MS-1
preamplifier into the 'line input' jack on the SGI Indy workstation.
The gain of the Rane preamplifier and the SGI system were set and
checked once toward the beginning of the recording session and were
left fixed at that level.
Recording Environment:
The room was a small carpeted office with a floor area of
approximately 3.9 m by 2.9 m and a ceiling height of 2.7 m. The room
was heated and cooled by forced air that entered via a vent high on
the wall above a large cabinet and about 3 m from the subject's head.
The room had two doors that were usually left open, and sometimes
exposed the microphone to passing conversation or incoming call
signals from various nearby telephones.
Except for the carpeted floor, most surfaces in the room were hard and
smooth. For example, subjects sat at a table with a plastified
hardwood surface; there was a large white board immediately to the
subject's right, and the wall behind the computer console was entirely
glass.
Physical dimensions: 3.9 m x 2.9 m (floor to ceiling 2.7 m)
door <-------- glass wall ------------>
----------| | | | | |------------------------------------
| | | ________ | |
| | | | | | |
| | | table | SGI | | |
|cabinet| | | console| | |
| | | -------- | |
| | | | |
| | ============================== |
| | |
|_______| subject |
| seated |
| |
_ |
_ _________|
door _ | |
_ -------------- | file |
_ | bookshelf | | cabinet |
|-------------------------------------------------
Verification:
The recordings were verified to be fluent and to correspond to the
presented text. Verification was performed by an educated Argentine.
Verification was accomplished primarily by rejecting spoken renditions
that did not correspond to the original text. In general, text was
not altered to correspond to an acceptable, but variant, spoken token.
Some sentences were excluded because of anomalies in the text as
presented. A sentence token was considered a fluent reading if it
contained all and only the printed words in the correct order (no
false starts, or repeats) and the words were pronounced in accordance
with any accepted Spanish letter-sound values. This leaves some
inconsistencies due to dialect differences, but, more importantly, it
leaves some foreign words (especially proper names) pronounced with
pseudo-English or pseudo-French values.
Resulting Files:
The raw speech files were processed to delete excessive initial and
final silences, using a modified version of the find_ep endpointing
program that is part of Entropic's ESPS package. The files are
distributed in NIST SPHERE compressed format.
File headers are formatted as in the following example:
[as printed by SPHERE "h_read"]
database_id latino40
database_version 1.0
sample_rate 16000
sample_n_bytes 2
sample_sig_bits 16
sample_coding pcm,embedded-shorten-v1.09
channel_count 1
microphone Shure SM-10a
prompt_type printed
recording_site ERL Palo Alto
native_language spanish
geographic_origin Santa Cruz, Argentina
age 30
gender Female
sample_count 76801
prompt_text No habiendo objeciones, asm queds acordado.
sample_max 14030
sample_min -13585
sample_byte_format 10
sample_checksum 64953
speaker_Id af01