AUTOMATIC SPEECH RECOGNITION IN PORTUGUESE APPLIED TO RADIO COMMUNICATION
Name: LUCAS GRIGOLETO SCART
Publication date: 06/03/2024
Examining board:
Name | Role |
---|---|
FILIPE WALL MUTZ | Examinador Interno |
JORGE LEONID ACHING SAMATELO | Examinador Interno |
MARIANA RAMPINELLI FERNANDES | Examinador Externo |
RAQUEL FRIZERA VASSALLO | Presidente |
Summary: Speech is the main form of communication used between humans, and as such understanding
spoken language is one of the main goals of natural language processing. Automatic speech
recognition, the focus of this work, is the ability of a machine to recognize the content of
words and sentences in a spoken language and transform them into a textual format. Currently,
methods based on deep neural networks have dominated the field of speech processing, presenting
state-of-the-art results in multiple applications.
As the field of speech recognition continues to evolve, several challenges arise when attempting to
adapt models to new languages and datasets, particularly in the context of radio communication
recordings, as presented in this study. Compared to English, Portuguese has less available
annotated speech data, making it essential to explore methods for effectively utilizing unlabeled
data during training. Additionally, radio communication recordings exhibit a substantial degree
of variation in background noise and speaker characteristics compared to other audio datasets.
This variability can affect the accuracy and robustness of the model.
This study proposes utilizing out-of-domain annotated data through a data augmentation method
to build baseline models. In addition, we explore the effective use of unlabeled in-domain data
via self-training techniques by generating pseudo-labels. Finally, we present an efficient training
recipe for scaling large model finetuning while minimizing computational costs. Those models
were then deployed as part of a broader speech processing application that was developed to
assist in the auditing process of recorded railway communications.
When performing the training with the simulated data, it is was observed a relative reduction
of 51.7% in the character error rate considering the most challenging noise level (SNR of 0
dB), with a similar decrease at all noise levels when compared with the vanilla model. With
self-training using in-domain data, we observe a reduction of 63.8% in character error rate when
compared to the baseline model. We hope that the methodology developed in this work may
open space to develop more robust speech recognition models with future applications in radio
communication.