Entropic Latino40 Speech Database

Author: Jared Bernstein

Developers: Bill Grundy, Jared Bernstein, Elizabeth Rosenfeld, Amir Najmi, Psi Mankoski.

Entropic Research Laboratory, Inc.
600 Pennsylvania Ave. SE
Washington, DC 20003

email: info@entropic.com

Introduction:

Entropic Research Laboratory designed the Latino-40 database to provide a set of recordings for training speaker-independent systems that recognize Latin-American Spanish. The resulting database, called Entropic Latino40, was recorded in the period from 11 July through 9 September 1994, in Palo Alto, California.

The database comprises about 5000 utterance files. These files include about 125 utterances from each of 40 different speakers, 20 male and 20 female. The recordings were all made with a high-quality, head-mounted microphone (Shure SM10A) in an office environment, and the utterances were digitized in 16-bit samples at 16 kHz.

Material:

The Linguistic Data Consortium provided 13,000 sentences that had been selected (apparently from Latin American newspaper text) by people working at Texas Instruments. No documentation was available on the sentence set, and the sentences include a number of anomalous or ambiguous forms. The sentences are all shorter than 80 characters, and are not grouped into larger constituents like paragraphs or stories.

Each of 13,000 sentences is identified by its own number sss1 through sss13000. The set of sentences was divided into 13 distinct sets of 1000 sentences each, and each successive speaker read from the next subset of 1000 sentences, rotating through the 13 subsets. For each speaker, the first 125 acceptable sentences are included in the Latino40 data base. It was necessary to reject 10 or 15 sentences for many speakers, and as many as 150 for one speaker, in order to find 125 acceptable ones. The following is a sample of 20 sentences from subset 3 that includes the longest sentence (sss3816) among the entire 13,000:

sss3800 Hay problemas muy serios en estos momentos.
sss3801 Se habman ilustrado varias caractermsticas.
sss3802 A Venezuela y a Guyana les ira particularmente bien.
sss3803 A su juicio, tal hipstesis era ticnicamente inadecuada.
sss3804 No hacemos gestos porque no somos actores, aqadis.
sss3805 Queremos un poco de seguridad para regresar.
sss3806 Este lmmite maximo se aplics a tres escalas.
sss3807 Si el Consejo acepta eso, no tengo nada que decir.
sss3808 Esto puede lograrse a travis de una fsrmula legal.
sss3809 No vamos a defraudar al pueblo, dijo Reina.
sss3810 El economista de la firma, no lo cree.
sss3811 Postergan juicios polmticos en contra de los ministros.
sss3812 Esa es la parte facil, dijo el portavoz.
sss3813 Tambiin se debe tener en cuenta otro elemento.
sss3814 En poco tiempo se iban a instalar diez mas.
sss3815 ?Es esto un smmbolo del dominio japonis de la electrsnica?
sss3816 En mil novecientos ochenta y nueve no se denegs ese tipo de autorizacisn.
sss3817 Procura evitar accidentes y vertimientos.
sss3818 Ademas, se promoveran actividades industriales.
sss3819 Los hombres utilizan ese tiempo para esconder sus armas.
sss3820 Ha abierto los ojos y mueve las manos.

Speakers:

Speakers were all paid volunteers who had been informally solicited in the Palo Alto area. All speakers were adults; they ranged in age from 18 to 59 years of age. All claimed to be native speakers of Latin American Spanish, although one speaker was completely rejected because his accent sounded Brazilian to the person verifying the recordings. Seven speakers were from Peru; five each from Argentina, Columbia, Guatemala, and Nicaragua; three from Venezuela; and two each from Chile, Costa Rica, Cuba, El Salvador, and Mexico.

Naming Convention:

The speakers are identified with four characters; two letters and two numbers. The first letter identifies the country of origin: e.g. 'a' for Argentina, 'p' for Peru, etc., but 'c' for Colombia, 'b' for Cuba, 'h' for Chile, and 'r' for Costa Rica. The second letter identifies the speaker's gender: 'm' male or 'f' female. The two-digit number is an arbitrary identifier in the range 01 to 40.

The forty speakers are identified as follows:

 id  age subset   origin                     verifier note

af01  30    5    Santa Cruz, Argentina		
af13  27   12    Buenos Aires, Argentina
af14  43    5    Buenos Aires, Argentina
am19  41   10    Buenos Aires, Argentina
am26  28    4    Buenos Aires, Argentina
bm21  30    2    Havana, Cuba
bm22  55    4    Havana, Cuba
cf11  52    2    Cali, Colombia
cf30  34    5    Bogota, Colombia
cm02  23   11    Bogota, Colombia
cm05  40    7    Bogota, Colombia
cm07  37    8    Bogota, Colombia
gf10  30   13    Quetzaltenango, Guatemala
gf18  27    1    Guatemala City, Guatemala   poor reading
gf20  34   12    San Marcos, Guatemala
gf38  30    8    Guatemala, Guatemala
gm06  29   10    San Marcos, Guatemala       119 sentences; poor reading
gm17  18   12    Guatemala City, Guatemala
hf28  43    9    Valparaiso, Chile
hf39  39   11    Vina del Mar, Chile
hm12  59    9    Santiago, Chile
mf27  28    9    D. F., Mexico
mm32  32    9    Durango, Mexico
nf34  23    6    Granada, Nicaragua          slow reading
nf35  29   10    Managua, Nicaragua
nm15  54    6    Managua, Nicaragua
nm23  44    7    Managua, Nicaragua
pf31  39   13    Lima, Peru
pf33  37    2    Lima, Peru                  slow reading
pf37  23   10    Cusco, Peru
pf40  40    3    Lima, Peru                  uvular /rr/
pm03  36    3    Lima, Peru
pm16  57    3    Lima, Peru
pm24  31    4    Lima, Peru                  poor reading
rf29  59    7    San Jose, Costa Rica
rf36  35   11    San Jose, Costa Rica        poor reading
sf09  46    8    San Salvador, El Salvador
sm04  24    6    San Salvador, El Salvador   poor reading
vf08  28   11    Valencia, Venezuela
vm25  33    5    Caracas, Venezuela

Recording Procedure:

Speakers were seated in an upholstered chair facing the console of a Silicon Graphics Indy (SGI) workstation computer. Each speaker was introduced to the procedure to be followed and signed a consent to participate in the data collection. Speakers were instructed how to control the recording software, how to wear the microphone properly, and how to judge whether or not a particular read rendition would be acceptable.

The speakers wore a Shure SM10A unidirectional head-worn dynamic microphone, and controlled the recording session at their own pace using a recording program designed for the purpose. Control of the recordings was principally accomplished through a "record" button that displayed the text of the Spanish sentence, and initiated recording. The recording of a sentence was typically ended by pushing a "record next" button, that terminated the recording of the current sentence and then initiated the recording and display of the next sentence. Speakers had access to a full set of other controls that permitted them to play and re-record earlier sentences if they wished, and move about in the database they were constructing.

After an initial period during which an Entropic supervisor monitored the speaker's reading and recording control, speakers were left to monitor their own reading and recordings.

Speech signals went from the Shure microphone through a Rane MS-1 preamplifier into the 'line input' jack on the SGI Indy workstation. The gain of the Rane preamplifier and the SGI system were set and checked once toward the beginning of the recording session and were left fixed at that level.

Recording Environment:

The room was a small carpeted office with a floor area of approximately 3.9 m by 2.9 m and a ceiling height of 2.7 m. The room was heated and cooled by forced air that entered via a vent high on the wall above a large cabinet and about 3 m from the subject's head. The room had two doors that were usually left open, and sometimes exposed the microphone to passing conversation or incoming call signals from various nearby telephones.

Except for the carpeted floor, most surfaces in the room were hard and smooth. For example, subjects sat at a table with a plastified hardwood surface; there was a large white board immediately to the subject's right, and the wall behind the computer console was entirely glass.

Physical dimensions: 3.9 m x 2.9 m (floor to ceiling 2.7 m)

             door      <-------- glass wall ------------>
----------| | | | | |------------------------------------  
|       |               |                 ________     | |
|       |               |                |        |    | |
|       |               |       table    |   SGI  |    | |
|cabinet|               |                | console|    | |
|       |               |                 --------     | |
|       |               |                              | |
|       |                ==============================  |
|       |                                                |
|_______|                                  subject       |
        |                                   seated       |
        |                                                |
        _                                                |
        _                                       _________|
  door  _                                     |          |
        _             --------------          |  file    |
        _            |   bookshelf  |         | cabinet  |
        |-------------------------------------------------

Verification:

The recordings were verified to be fluent and to correspond to the presented text. Verification was performed by an educated Argentine. Verification was accomplished primarily by rejecting spoken renditions that did not correspond to the original text. In general, text was not altered to correspond to an acceptable, but variant, spoken token. Some sentences were excluded because of anomalies in the text as presented. A sentence token was considered a fluent reading if it contained all and only the printed words in the correct order (no false starts, or repeats) and the words were pronounced in accordance with any accepted Spanish letter-sound values. This leaves some inconsistencies due to dialect differences, but, more importantly, it leaves some foreign words (especially proper names) pronounced with pseudo-English or pseudo-French values.

Resulting Files:

The raw speech files were processed to delete excessive initial and final silences, using a modified version of the find_ep endpointing program that is part of Entropic's ESPS package. The files are distributed in NIST SPHERE compressed format.

File headers are formatted as in the following example: [as printed by SPHERE "h_read"]

database_id latino40
database_version 1.0
sample_rate 16000
sample_n_bytes 2
sample_sig_bits 16
sample_coding pcm,embedded-shorten-v1.09
channel_count 1
microphone Shure SM-10a
prompt_type printed
recording_site ERL Palo Alto
native_language spanish
geographic_origin Santa Cruz, Argentina
age 30
gender Female
sample_count 76801
prompt_text No habiendo objeciones, asm queds acordado.
sample_max 14030
sample_min -13585
sample_byte_format 10
sample_checksum 64953
speaker_Id af01