Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?


  • Héctor Delgado Universitat Autònoma de Barcelona
  • Anna Matamala Universitat Autònoma de Barcelona
  • Javier Serrano Universitat Autònoma de Barcelona




This article presents an overview of the technological components used in the process of audio description, and suggests a new scenario in which speech recognition, machine translation, and text-to-speech, with the corresponding human revision, could be used to increase audio description provision. The article focuses on a process in which both speaker diarization and speech recognition are used in order to obtain a semi-automatic transcription of the audio description track. The technical process is presented and experimental results are summarized.

Biografia do Autor

Héctor Delgado, Universitat Autònoma de Barcelona

BS in Computer Science Engineering by Universidad de Sevilla, Spain, and MS in Multimedia Technologies by Universitat Autònoma de Barcelona, Spain. PhD candidate at the Department of Telecommunications and Systems Engineering at Universitat Autònoma de Barcelona, Cerdanyola del Vallès, Barcelona, Spain.

Anna Matamala, Universitat Autònoma de Barcelona

BA in Translation and Interpreting by Universitat Autònoma de Barcelona, and PhD in Applied Linguistics by Universitat Pompeu Fabra (Barcelona). Tenured senior lecturer at Universitat Autònoma de Barcelona (Spain).

Javier Serrano, Universitat Autònoma de Barcelona

BA in Computer Science (Universitat Autònoma de Barcelona) and PhD in Automatic Control (Computer Science Program, UAB). Associate Professor at Universitat Autònoma de Barcelona.


ADLAB (2012). Report on user needs assessment. Report no. 1, ADLAB (Audio Description: Lifelong Access to the Blind) project. Retrieved from www.adlabproject.eu.

Álvarez, A.; Mendes, C.; Raffaello, M.; Luis, T.; Paulo, S.; Piccinini, N.; Arzelus, H.; Neto, J.; Aliprandi, C., & Del Pozo, A. (2015). Automating live and batchsubtitling of multimedia contents for several European languages. Multimedia Tools and Applications (MTAP).

Bourne, J., & Jiménez, C. (2007). From the visual to the verbal in two languages: a contrastive analysis of the audio description of The Hours in English and Spanish. In J. Díaz-Cintas, P. Orero, & A. Remael (Eds.), Media for All. Subtitling for the Deaf, Audio Description, and Sign Languages (pp. 175-188). Amsterdam: Rodopi.

Caruso, B. (2012). Audio Description Using Speech Synthesis. In Languages and the Media. 9th International Conference on Language Transfer in Audiovisual Media.Conference Catalogue (pp. 59-60). Berlin: ICWE.

Delgado, H., Fredouille, C., & Serrano, J. (2014). Towards a complete binary key system for the speaker diarization task. Interspeech 2014. Proceedings of the 15th Annual Conference of the International Speech Communication Association (pp. 572-576). Singapore.

Drożdż-Kubik, J. (2011). Harry Potter iKamieńFilozoficznysłowemmalowany – czylibadanieodbiorufilmu z audiodeskrypcją z synteząmowy. MA Thesis. Krakow: Jagiellonian University.

DTV4ALL (2009). Digital Television for All. D2.3. Interim Report on Pilot services. Retrieved from http://dea.brunel.ac.uk/dtv4all/ICT-PSP-224994-D23.pdf

Fernández-Torné, A., & Matamala, A. (2014, November). Machine translation and audio description. Is it worth it? Assessing the post-editing effort. Paper presented at Languages and the Media. 10th International Conference on Languages Transfer in Audiovisual Media, Berlin, Germany.

Fernández-Torné, A., & Matamala, A. (forthcoming). Text-to-speech vs human voiced audio descriptions: a reception study in films dubbed into Catalan. Jostrans. The Journal of Specialised Translation.

Fernández-Torné, A., Matamala, A., & Ortiz-Boix, C. (2012, June). Technology for accessibility in multilingual settings: the way forward in AD? Paper presented at The translation and reception of multilingual films Conference, Montpellier, France. Retrieved from http://ddd.uab.cat/record/117160

Fredouille, C.; Bozonnet, S., & Evans, N.W.D. (2009) The LIA- EURECOM RT‘09 Speaker Diarization System.RT’09, NIST Rich Transcription Workshop. Florida, USA. Retrieved from http://www.itl.nist.gov/iad/mig/tests/rt/2009/workshop/LIA-EURECOM_paper.pdf

Hyks, V. (2005). Audio description and translation: Two related but different skills. Translating Today Magazine, 4(1), 6–8.

Jankowska, A. (2013). Tłumaczenieskryptówaudiodeskrypcji z językaangielskiegojakoalternatywnametodatworzeniaskryptówaudiodeskrypcji. Unpublished doctoral dissertation, Jagiellonian University, Krakow, Poland.

Kobayashi, M., Fukuda, K., Takagi, H., & Asakawa, C. (2009). Providing synthesized audio description for online videos. ASSETS ’09: Proceedings of the 11th International ACM SIGACCESS Conference on Computers and Accessibility(pp. 249-250). New York, USA: ACM.

Mączyńska, M. (2011). TTS AD with audio subtitling to a non-fiction film. A case study based on La Soufriere by Werner Herzog. Unpublished MA Thesis, University of Warsaw, Warsaw, Poland.

Matamala, A. (2006). La accesibilidad en los medios aspectos língüísticos y retos de formación. In R. Pérez-Amat,& Á. Pérez-Ugena (Eds.) Sociedad, integración y televisión en España (pp. 293–306). Madrid: Laberinto.

Matamala, A., & Orero, P. (2009). L’accessibilitat a Televisió de Catalunya: parlemamb Rosa Vallverdú, directora del departament de Subtitulació de TVC. Quaderns, Revista de Traducció, 16, 301-312.

Mieskes, M., & Martínez Pérez, J. (2011).A web-based editor for audio-titling using synthetic speech. Paper presented at the 3rd International Symposium on Live Subtitling with Speech Recognition, Antwerp, Belgium Retrieved from http://www.respeaking.net/Antwerp%202011/Webbased_editor.pdf

Moreno, A.; Febrer, A., & Márquez, L. (2006). Generation of Language Resources for the Development of Speech Technologies in Catalan. Proceedings of the Language Resources and Evaluation Conference LREC 06 (pp. 1632-1635). LREC: Genoa, Italy.

Oncins, E., Lopes, O., Orero, P., Serrano, J., & J. Carrabina(2013). All together now: a multi-language and multi-system mobile application to make living performing arts accessible. Jostrans. The Journal of Specialised Translation, 20, 147-164.

Ortiz-Boix, C. (2012). Technologies for audio description: study on the application of machine translation and text-to-speech to the audio description in Spanish. Unpublished MA Thesis, UniversitatAutònoma de Barcelona, Barcelona, Spain.

Remael, A., & Vercauteren, G. (2010). The translation of recorded audiodescription from English into Dutch.Perspectives. Studies in Translatology, 18(3), 155-171.

Szarkowska, A. (2011). Text-to-speech audio description: towards wider availability of AD. The Journal of Specialised Translation, 15, 142-162.

Szarkowska, A., & Jankowska, A. (2012). Text-to-speech audio description of voice-over films. A case study of audio described Volver in Polish. In E. Perego (Ed.) (2012). Emerging topics in translation: Audio description (pp. 81-98). Trieste, Italy: Edizioni Università di Trieste.

Walczak, A., & Szarkowska, A. (2012). Text-to-speech audio description of educational materials for visually impaired children. In S. Bruti, & E. Di Giovanni (Eds.) Audio Visual Translation across Europe: An Ever-Changing Landscape (pp. 209-234). Bern/Berlin: Peter Lang.



Como Citar

Delgado, H., Matamala, A., & Serrano, J. (2015). Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?. Cadernos De Tradução, 35(2), 308–324. https://doi.org/10.5007/2175-7968.2015v35n2p308