Mon, 9 May, 12:00 - 12:45 UTC
Audio/video content transcription and associated activities form the predominant application market for ASR (1). As a world leader in the human-in-the-loop content transcription space, Rev is in a unique position to assess and understand the impact ASR can have on the productivity of transcriptionists whose job is to produce publishable transcripts. The usefulness of providing the output of NLP as a first draft for humans to postedit was first discussed with respect to MT (Bar-Hillel 1951). More recently, ASR has been provided as a first draft to increase transcriptionist productivity; its correlation with automatic transcript quality can be analyzed within Rev’s unique AI-powered transcription marketplace. Importantly, Papadopoulou et al. 2021 showed that the system with the lowest WER is not necessarily the system requiring the least transcriptionist effort.
We propose an analysis of the interaction between several metrics of ASR, diarization, punctuation and truecasing accuracy, and the productivity of our 50,000 Revvers transcribing more than 15,000 hours of media every week. We also examine the effects of noise conditions, audio/video content and errors that may impact transcriptionist quality of experience. For example, upon releasing OOV-recovery using subword models, the generation of occasional nonsense words aroused strong reactions that a bug had been introduced.
Because Rev owns both the AI and the human-centric marketplace, we have a unique advantage for studying the productivity and quality impact of model changes. In addition, we have enabled a virtuous circle whereby transcriptionist edits feed into improved models. Through our work, we hope to focus attention on the human productivity and quality of experience aspects of improvements in ASR and related technologies. Given both the predominance of content transcription applications, and the still elusive objective of perfect machine performance, keeping the human in the loop in both practice and mind is crucial.
(1.) “Speech-to-Text API Market: Global Forecast to 2026”, Markets and Markets, 2021. “Fraud Detection and Prevention” is the largest, but that is speaker verification. “Risk and Compliance Management” is the second largest, but that is essentially Content Transcription for a specific purpose. “Content Transcription is third largest, but combined with Risk and Compliance Management, forms the largest application space (over Customer Management, Contact Center Management, Subtitle Generation and Other Applications.
As Head of AI at Rev, Miguel leads the research and development team responsible for building the world’s most accurate English speech-to-text model, deployed internally to increase transcriptionist’s productivity, and externally as an API powering some of the most innovative tech speech companies.
Prior to Rev, Miguel spent 15 years as a Speech Scientist building voice applications in the call center and voice assistant industry at Nuance Communications, and, later on, in the automotive industry at VoiceBox.
In a past life, Miguel earned a Graduate degree in Mathematics at McGill University in Montreal, studying the mathematical properties and structure of phylogenetic trees.