Hirofumi Inaguma

Toward low-latency and accurate simultaneous interpretations from speech

Humans perceive speech while listening, but how about machines?

Toward simultaneous interpretations from spontaneous speech, automatic speech recognition (ASR) and speech translation (ST) are crucial techniques. Recent progress on deep learning enables the rapid development of so-called “end-to-end” ASR systems without sophisticated modularization and provides even better accuracy than traditional hybrid systems. To generate transcriptions as soon as possible, online streaming decoding has been studied so far. However, due to its end-to-end optimization, models are more likely to use future acoustic observations as long as the recognition accuracy is maximized, which is problematic for user’s experience and downstream NLP tasks.

In this talk, I’ll introduce novel training frameworks to realize perceived latency reduction and accuracy improvement at the same time in streaming sequence-to-sequence ASR models. To this end, I’ll propose to leverage alignment information extracted from an external hybrid ASR model. The method will be further extended to a purely end-to-end framework without external models. Moreover, I’ll introduce recent efforts toward low-latency and accurate end-to-end speech translation modeling based on non-autoregressive sequence generation.


Hirofumi Inaguma is a PhD student at Kyoto University, advised by Tatsuya Kawahara. He works on automatic speech recognition (ASR) and speech translation. He is one of the main contributors to ESPnet.

Presentation Materials

Talk Video
Talk Slides