Industry Expert Sessions
Sun, 8 May, 14:00 - 14:45 France Time (UTC +2)
Sun, 8 May, 12:00 - 12:45 UTC
Sun, 8 May, 08:00 - 08:45 New York Time (UTC -4)
Digital audio signal processing technology is often implemented in such a way that it requires little or no user interaction for it to function as intended. The user of a teleconferencing device or music playback system often isn't aware of the acoustic echo cancellation, noise suppression, speech enhancement, or various limiters and equalizers that are being applied to improve the quality of the audio. However, sometimes the user is ideally placed to provide inputs or measurements to the system that will improve its performance, or indeed enable it to perform. This is where DSP meets user experience, presenting new and sometimes challenging considerations for both the signal processing methods and interaction design.
In this presentation I will discuss the development and design of a feature called Trueplay, which exists on all Sonos products today. Trueplay is a user facing audio feature, which is used to adapt the sound of our speakers to the listening environment. Early on in the design of Trueplay, it was recognised that in order to estimate how a loudspeaker sounds in a room, there really is no substitute for an in-room measurement in multiple locations, including measurements not made directly on the speaker (e.g. from on-board microphones). This raised the question of whether it would be possible to get a regular user to make a room average acoustic measurement in their homes - something that would normally be performed by acoustics or audio systems engineers - whilst maintaining simplicity and quality of the Sonos user experience. I will discuss how the entire process - including the measurement method, stimulus tones, user guidance, and feedback were informed by balancing the objectives of premium sound quality with human-centric user design to achieve not only a performant result, but also a pleasant, and perhaps even magical end user experience.
Adib Mehrabi is a Senior Manager in the Advanced Technology group at Sonos, Inc, and Honorary Lecturer at Queen Mary University, London, UK. He received his PhD from the Centre for Digital Music at Queen Mary University, London, UK, and BSc in Audio Engineering from the University of the West of England, UK. Prior to working at Sonos, Adib was Head of Research at Chirp - a company that developed audio signal processing and machine learning methods for transmitting data between devices using sound. Adib currently leads the Advanced Rendering and Immersive Audio research group at Sonos, Inc.More Information
Sun, 8 May, 15:00 - 15:45 France Time (UTC +2)
Sun, 8 May, 13:00 - 13:45 UTC
Sun, 8 May, 09:00 - 09:45 New York Time (UTC -4)
It was Stanley Kubrick that first pictured the aspiration of mankind to create artificial intelligences that can communicate with humans in a movie titled "2001: Space Odyssey", but even before this 1968 motion picture, we, human beings, had made incessant effort to develop human-like intellectual systems. The pursuit of human-level automatic speech recognition (ASR) technology, along the same line, has its own history that has stimulated a significant deal of technological advances throughout the journey. This Industry Expert talk reviews the recent history of the Odyssey by the speech signal processing/machine learning communities to achieve or even exceed the human parity in ASR systems, focusing on the breakthroughs made in the deep learning era in the context of Switchboard and LibriSpeech, the two most widely-adopted standard benchmark datasets. In addition, we discuss how the industry employs the knowledge obtained by the research communities coming along these breakthroughs, for example, the transfer learning paradigm of leveraging neural network models trained in an unsupervised way such as Wav2Vec or HuBERT, into practical settings in the wild where ASR services should meet diverse market demands in a scalable manner. We also suggest the ways to address in-practice requirements, like memory constraint on hand-held devices or latency for streaming use cases, in order to make ASR services dependable with human-level accuracy across various end-user scenarios.
The targeted audience of this Industry Expert talk is broad, ranging from graduate students starting their research careers in the speech field to senior industry practitioners who would want to apply the state-of-the-art speech recognition approaches to their own problem domains. The talk will provide the timely perspectives of the industry expert, who used to be a well-known research scientist in prestigious industry research labs including IBM Watson and now is a leading figure to drive technology developments in ASAPP's cutting-edge AI services in a customer experience (CX) domain, on the present landscapes and future directions of scalable ASR modeling and deployments, which will be well received by and inspire to many ICASSP attendees.
Dr. Kyu J. Han is a Sr. Director of Speech Modeling and ML Data Labeling at ASAPP, leading an applied science team primarily working on automatic speech recognition and voice analytics for ASAPP's machine learning services to its enterprise customers in a customer experience (CX) domain. He received his Ph.D. from the University of Southern California, USA in 2009 under Prof. Shrikanth Narayanan and has since held research positions at IBM Watson, Ford Research, Capio (acquired by Twilio), JD AI Research and ASAPP. At IBM he participated in the IARPA Biometrics Exploitation Science and Technology (BEST) project and the DARPA Robust Automatic Transcription of Speech (RATS) program. He led a research team at Capio where the team achieved the state-of-the-art performances in telephony speech recognition and successfully completed a government-funded project for noise robust, on-prem ASR system integration across 13 different languages. Dr. Kyu J. Han is actively involved in speech community activities, serving as reviewers for IEEE, ISCA and ACL journals and conferences. He was a member of the Speech and Language Processing Technical Committee (SLTC) of the IEEE Signal Processing Society from 2019-2021 where he served as the Chair of the Workshop Subcommittee. He also served as a committee member for the Organizing Committee of the IEEE SLT-2021. He was the Survey Talk speaker and the Doctoral Consortium panel at Interspeech 2019, and the Tutorial speaker at Interspeech 2020. In addition, he won the ISCA Award for the Best Paper Published in Computer Speech & Language 2013-2017. He is a Senior Member of the IEEE.More Information
Sun, 8 May, 16:00 - 16:45 France Time (UTC +2)
Sun, 8 May, 14:00 - 14:45 UTC
Sun, 8 May, 10:00 - 10:45 New York Time (UTC -4)
On the road to digital transformation (DX), the current technology trends and challenges are changing dynamically according to the changes of people and the society. For example, due to the global pandemic, COVID-19 spread, the human beings start converting to new lifestyle for security and safety, and new society style for sustainability in our daily life. Towards the abovementioned changes, Dr. Liu will introduce an industrial level framework in this talk from a signal processing perspective, regarding how NEC is trying to sense and understand the human behavior in digital world for the realization of digital society.
This talk will mainly demonstrate a series of selected research achievements that contributed to both academia and industry in our framework. Our framework is composed of a series of cutting-edge technologies that sense the data generated in the real world, transform them into readable, visible, and modellable digital forms, and finally analyze these digital data to understand the human behavior. For example, such cutting-edge technologies include the human sensing by traditional cameras [MM’14, MM’16], 360 cameras [MM’19, WACV’20, ICIP’21], and microwave sensors [MM’20], the action recognition [MM’19, WACV’20, MM’20, ICIP’21], the object tracking [MIPR’19, BigMM’19], the human object interaction [CBMI’19], the scene recognition [MM’19], the behavioral pattern analysis [MM’16, ICMR’18 MIPR’19], the retrieval [MM’14, MM’16, MM’17, ICMR’18, CBMI’19, ICASSP’21] and the visualization [SIGGRAPH’16, ICMR’18], towards the fully understanding of human behavior. These works will be introduced in the way of a general overview with interactive technical demos and interesting insights, for human behavior sensing and understanding by adopting effective processing techniques and designing efficient algorithms. Finally, Dr. Liu will pick up and share some challenging issues and directions for the realization of digital society in the future.
Ref: all references are the technical works published at the corresponding international conferences from our research group.
Jianquan Liu is currently a principal researcher at the Biometrics Research Laboratories of NEC Corporation, working on the topics of multimedia data processing. He is also an adjunct assistant professor at Graduate School of Science and Engineering, Hosei University, Japan. Prior to NEC, he was a development engineer in Tencent Inc. from 2005 to 2006, and was a visiting researcher at the Chinese University of Hong Kong in 2010. His research interests include high-dimensional similarity search, multimedia databases, web data mining and information retrieval, cloud storage and computing, and social network analysis. He has published 60+ papers at major international/domestic conferences and journals, received 20+ international/domestic awards, and filed 60+ PCT patents. He also successfully transformed these technological contributions into commercial products in the industry. Currently, he is/was serving as the Industry Co-chair of IEEE ICIP 2023; the General Co-chair of IEEE MIPR 2021; the PC Co-chair of IEEE ICME 2020, AIVR 2019, BigMM 2019, ISM 2018, ICSC 2018, ISM 2017, ICSC 2017, IRC 2017, and BigMM 2016; the Workshop Co-chair of IEEE AKIE 2018 and ICSC 2016; the Demo Co-chair of IEEE MIPR 2019 and MIPR 2018. He is a member of ACM, IEEE, IEICE, IPSJ, APSIPA and the Database Society of Japan (DBSJ), a member of expert committee for IEICE Mathematical Systems Science and its Applications, and IEICE Data Engineering, and an associate editor of IEEE MultiMedia Magazine, ITE Transaction on Media Technology and Applications, and the Journal of Information Processing (JIP). Dr. Liu received the M.E. and Ph.D. degrees from the University of Tsukuba, Japan.More Information
Mon, 9 May, 14:00 - 14:45 France Time (UTC +2)
Mon, 9 May, 12:00 - 12:45 UTC
Mon, 9 May, 08:00 - 08:45 New York Time (UTC -4)
Audio/video content transcription and associated activities form the predominant application market for ASR (1). As a world leader in the human-in-the-loop content transcription space, Rev is in a unique position to assess and understand the impact ASR can have on the productivity of transcriptionists whose job is to produce publishable transcripts. The usefulness of providing the output of NLP as a first draft for humans to postedit was first discussed with respect to MT (Bar-Hillel 1951). More recently, ASR has been provided as a first draft to increase transcriptionist productivity; its correlation with automatic transcript quality can be analyzed within Rev’s unique AI-powered transcription marketplace. Importantly, Papadopoulou et al. 2021 showed that the system with the lowest WER is not necessarily the system requiring the least transcriptionist effort.
We propose an analysis of the interaction between several metrics of ASR, diarization, punctuation and truecasing accuracy, and the productivity of our 50,000 Revvers transcribing more than 15,000 hours of media every week. We also examine the effects of noise conditions, audio/video content and errors that may impact transcriptionist quality of experience. For example, upon releasing OOV-recovery using subword models, the generation of occasional nonsense words aroused strong reactions that a bug had been introduced.
Because Rev owns both the AI and the human-centric marketplace, we have a unique advantage for studying the productivity and quality impact of model changes. In addition, we have enabled a virtuous circle whereby transcriptionist edits feed into improved models. Through our work, we hope to focus attention on the human productivity and quality of experience aspects of improvements in ASR and related technologies. Given both the predominance of content transcription applications, and the still elusive objective of perfect machine performance, keeping the human in the loop in both practice and mind is crucial.
(1.) “Speech-to-Text API Market: Global Forecast to 2026”, Markets and Markets, 2021. “Fraud Detection and Prevention” is the largest, but that is speaker verification. “Risk and Compliance Management” is the second largest, but that is essentially Content Transcription for a specific purpose. “Content Transcription is third largest, but combined with Risk and Compliance Management, forms the largest application space (over Customer Management, Contact Center Management, Subtitle Generation and Other Applications.
As Head of AI at Rev, Miguel leads the research and development team responsible for building the world’s most accurate English speech-to-text model, deployed internally to increase transcriptionist’s productivity, and externally as an API powering some of the most innovative tech speech companies.
Prior to Rev, Miguel spent 15 years as a Speech Scientist building voice applications in the call center and voice assistant industry at Nuance Communications, and, later on, in the automotive industry at VoiceBox.
In a past life, Miguel earned a Graduate degree in Mathematics at McGill University in Montreal, studying the mathematical properties and structure of phylogenetic trees.More Information
Mon, 9 May, 15:00 - 15:45 France Time (UTC +2)
Mon, 9 May, 13:00 - 13:45 UTC
Mon, 9 May, 09:00 - 09:45 New York Time (UTC -4)
Recently we have seen a rapid rise in the amount of education data available through the digitization of education. This huge amount of education data usually exhibits in a mixture form of images, videos, speech, texts, etc. It is crucial to consider data from different modalities to build successful applications in AI in education (AIED). This talk targets AI researchers and practitioners who are interested in applying state-of-the-art multimodal machine learning techniques to tackle some of the hard-core AIED tasks. These include tasks such as automatic short answer grading, student assessment, class quality assurance, knowledge tracing, etc.
In this talk, I will share some recent developments of successfully applying multimodal learning approaches in AIED, with a focus on those classroom multimodal data. Beyond introducing the recent advances of computer vision, speech, natural language processing in education respectively, I will discuss how to combine data from different modalities and build AI driven educational applications on top of these data. Participants will learn about recent trends and emerging challenges in this topic, representative tools and learning resources to obtain ready-to-use models, and how related models and techniques benefit real-world AIED applications.
Zitao Liu is the Head of Engineering, Xueersi 1 on 1 at TAL Education Group (NYSE:TAL), one of the largest leading education and technology enterprises in China. His research is in the area of machine learning, and includes contributions in the areas of artificial intelligence in education, multimodal knowledge representation and user modeling. He has published his research in highly ranked conference proceedings, such as NeurIPS, AAAI, WWW, AIED, etc. and serves as the executive committee of the International AI in Education Society and top tier AI conference/workshop organizers/program committees. He won the 1st place at NeurIPS 2020 education challenge (Task 3), 1st place at Ubicomp 2020 time series classification challenge, 1st place at CCL 2020 humor computation competition and 2nd place at EMNLP 2020 ClariQ challenge. He is a recipient of ACM/CCF Distinguished Speaker and Beijing Nova Program 2020. Before joining TAL, Zitao was a senior research scientist at Pinterest and received his Ph.D degree in Computer Science from University of Pittsburgh.More Information
Mon, 9 May, 16:00 - 16:45 France Time (UTC +2)
Mon, 9 May, 14:00 - 14:45 UTC
Mon, 9 May, 10:00 - 10:45 New York Time (UTC -4)
The age of voice computing promises to open up exciting new opportunities that once seemed possible only in science-fiction movies. But it also brings new challenges and implications for privacy, security as well as accessibility and diversity. MyVoice AI specializes in speech technology that can run inference on "the edge" for very small devices that are not connected to the cloud and that operate on very low resources. The low resource environments that we work with involve deep neural networks that are optimized for ultra-low power and ultra-low memory and could run directly on a chip or battery-powered device. A familiar example of ultra-low power speech technology is the "wake-up" word detection that is used for Alexa, Siri, Google, etc. While wake word technology has already gone to market, there are many other opportunities for important speech technology innovation on the edge. MyVoice AI is developing several of these technologies including a focus on speaker verification. Speaker verification at the edge means that user data is not transferred away from device, adding to user privacy. This technology enables smart devices to respond only to authorized users, such as unlocking a car door or accessing personalized settings with a remote control.
This presentation will give an overview of speech signal processing applications on the edge and describe some of the engineering challenges that involve creating solutions that operate in ultra-low resource environments. We will discuss techniques for achieving state of the art performance despite using smaller and more compact neural networks. We will also highlight the need for new standards and testing protocols to be developed by the signal processing community for the purpose of streamlining innovation in this area. We will present our vision for how speech signal processing on the edge will transform everyday lives of consumers while enhancing privacy and accessibility.
MyVoice AI is a privately held company and is a pioneer and leader in conversational AI. MyVoice AI is building the most secure end-to-end voice intelligence platform using advanced machine learning technologies. MyVoice AI licenses software and services to bring speaker verification to the edge, enabling a more seamless and privacy-enhanced authentication experience. We specialize in state-of-the-art deep neural network and deep learning techniques, delivering the world’s smallest footprint and power efficient training and inference engines. Our customers include financial institutions and edge AI embedded platform leaders.
Dr. Jennifer Williams is an internationally recognized innovator in speech technology. She has more than a decade of experience developing speech and text applications. She spent five years on staff at MIT Lincoln Laboratory as a US Department of Defense civilian contractor. She holds a PhD from the University of Edinburgh in data science with a specialization in speech processing using deep learning. She is a committee member of the ISCA PECRAC, and helps organize events for the ISCA speech privacy and security special interest group. Dr. Williams also serves as a reviewer for numerous speech and language conferences.More Information
Tue, 10 May, 14:00 - 14:45 France Time (UTC +2)
Tue, 10 May, 12:00 - 12:45 UTC
Tue, 10 May, 08:00 - 08:45 New York Time (UTC -4)
The proposed presentation is essentially an overview of continual lifelong learning for (deep) neural network based machine learning systems, with a focus on conversational artificial intelligence.
It is structured in three parts: one, the evolution of continual lifelong learning in the context of deep learning and takeaways that inspire the present; two, the current state-of-the art in continual lifelong learning; and three, the challenges with the present techniques and where the current research to address those challenges is headed. Throughout all the three parts, insights will be mapped to spoken language understanding and conversational AI systems in general.
Most machine learning based products today employ copious usage of deep learning based systems across wide variety of domains and various modes of signals - visual, aural, linguistic, gestural and spatial. However, most of the deployed systems are also typically trained on batches of stationary data without accounting for the possible changes in incoming signals in data, and thus suffer from regression or a certain degree of catastrophic forgetting on encountering signal data drifts. Continual lifelong learning enables reduction in effort on retraining such systems by enabling generalization skills in intelligent systems; the current and future research in this field will therefore lead to such systems functioning with higher autonomy than they currently do.
In this talk, we will cover how this paradigm has evolved thus far, and how it applies to conversational AI; dialog systems in particular, that are based on spoken language understanding and typically consist of aural and linguistic signals in data. We will also cover current state of the art - systems that employ transfer learning based approaches and multimodal learning based approaches toward achieving continual lifelong learning and research in the last couple of years (2020), which exhibits focus in understanding the tradeoffs between generalization and alleviating catastrophic forgetting, characterizing the latter that has led to novel neural network structures. These structures are yet to be effectively and widely productized. We will go over these approaches in the talk as well, particularly placing emphasis on networks that employ regularization-based plasticity (to penalize forgetting older signals) and experience replays (to integration information from past in periodic episodes during training phase).
Lastly, we will cover some insights from recent research as to where current challenges lie and where the next few years will take us with respect to novel neural network architectures. This is especially of fascination and importance to the signal processing community - both in academia and industry - to discover, improve and integrate more biological aspects of lifelong learning exhibited in mammalian brains, like multisensory integration for example, into evolving artificial intelligent and autonomous systems.
Pooja received her Bachelor's degree from Ramaiah Institute of Technology, Bangalore, India in Telecommunications Engineering (2013) and M.S degree in Electrical Engineering from University of Southern California (2015), where she specialized in vision-based robotics and machine learning. She has held research intern roles at USC Institute for Creative Technologies, USC Robotic Embedded Systems Laboratory, and Nvidia, where she worked on projects related to robot navigation and 3D scanning hardware. Since 2015, she has been a part of the AI/ML software industry in Silicon Valley - developing and delivering Conversational AI based products. As part of Knowles Intelligent Audio, she shipped Keyword-Spotting systems for consumer mobile phones utilizing Hidden Markov Models and Deep Neural Networks. As part of Cisco, she developed deep learning based keyword spotting systems for Webex voice assistant deployed in enterprise meeting devices, and shipped call control and notification features for Cisco 730 series smart bluetooth headsets. Currently, as a part of IBM Data and AI, she’s working at the intersection of speech recognition, natural language understanding and software to help bring intelligent conversational experiences to Quick Service Restaurants like McDonald's drive-thrus. Outside of her day job, Pooja is a part of World Economic Forum's Global Shapers initiative, and works with the Palo Alto chapter to help the local community prepare for ramifications of the fourth industrial revolution, currently focussing on social impact in education through bridging digital divide.More Information
Tue, 10 May, 15:00 - 15:45 France Time (UTC +2)
Tue, 10 May, 13:00 - 13:45 UTC
Tue, 10 May, 09:00 - 09:45 New York Time (UTC -4)
Human-centric applications like automated screening of chronic Cardio-Vascular Diseases (CVDs) for remote and primary care is of immense global importance. Atrial Fibrillation (AF) is the most common sustained cardiac arrhythmia and is associated with significant mortality and morbidity. The benchmark deep learning-based approach from Andrew N.G.'s team at Stanford  and our own domain knowledge augmented signal processing features based machine learning approach (global winner of Physionet Challenge 2017) , have performed robustly at near expert-level classification for Atrial Fibrillation (AF) detection from single-lead Electrocardiogram (ECG). To create real impact, these models need to be run on wearable and implantable edge devices like smartwatches, smart bands, implantable loop recorders (ILR) for short-time and long-time ECG screening. However, such models are computationally expensive with more than 100 MB model memory size, whereas these edge devices are run by tiny low-power microcontroller units, often limited to sub-MB memory size having strict power budget due to limited battery life. This necessitates for elegant model size reduction approaches without penalizing the performance. In our earlier work, it is shown that Knowledge Distillation (KD)-based piecewise linear approach is capable of trimming down the memory requirement of the DL model by nearly 150x with more than 5000x reduction in computational complexity (and hence power consumption) for simple ECG analytics task , with less than 1% loss in inferencing accuracy . However it is seen that for such approaches, there is severe performance degradation issue in more complex analytics tasks (like AF detection) that uses sophisticated base DL model like  - the inferencing accuracy loss can go more than 20%. We introduce a more powerful iterative pruning-based model size reduction, like Lottery Ticket Hypothesis (LTH), that starts from a complex DL model like  and can elegantly prune and find the sub-network that is compact in parameter space (100x approx.) yet with less than 1% loss in inferencing accuracy . In future, we plan to extend this work further via hybrid KD-LTH approaches for spectrum of human-centric applications on Tiny Edge Devices like wearables and implantable - some of the applications we are working on include other cardiac, musculoskeletal and neurological disorders, geriatric care, human behavior sensing for Neuromarketing, Brain-computer Interface/Human Robot Interaction (BCI/HRI) etc.
References -  Andrew N.G. et. al. “Cardiologist-Level Arrhythmia Detection and Classification in Ambulatory Electrocardiograms Using a Deep Neural Network,” Nature Medicine, 2019.  A. Pal et. al., “Detection of atrial fibrillation and other abnormal rhythms from ECG using a multi-layer classifier architecture,” Physiological Measurements, June, 2019.  A. Pal et. al., "Resource Constrained CVD Classification Using Single Lead ECG On Wearable and Implantable Devices," IEEE EMBC 2021.  A. Pal et. al., "LTH-ECG: Lottery Ticket Hypothesis-based Deep Learning Model Compression for Atrial Fibrillation Detection from Single Lead ECG On Wearable and Implantable Devices," IEEE EMBC 2022, Submitted.
Arpan Pal has more than 29 years of experience in the area of Intelligent Sensing, Signal Processing &AI, Edge Computing and Affective Computing. Currently, as Chief Scientist and Research Area Head, Embedded Devices and Intelligent Systems, TCS Research, he is working in the areas of Connected Health, Smart Manufacturing and Remote Sensing. He is on the editorial board of notable journals like ACM Transactions on Embedded Systems, Springer Nature Journal on Computer Science and is on the TPC of notable conferences like ICASSP and EUSIPCO. He has filed 165+ patents (out of which 85+ granted in different geographies) and has published 140+ papers and book chapters in reputed conferences and journals. He has also authored two books – one on IoT and another on Digital Twins in Manufacturing. He is on the governing/review/advisory board of some of the Indian Government organizations like CSIR, MeitY, Educational Institutions like IIT, IIIT and Technology Incubation centers like TIH. Prior to joining Tata Consultancy Services (TCS), Arpan had worked for DRDO, India as Scientist for Missile Seeker Systems and in Rebeca Technologies (erstwhile Macmet Interactive Technologies) from as their Head of Real-time Systems. He is a B.Tech and M. Tech from IIT, Kharagpur, India and PhD. from Aalborg University, Denmark. Home Page - https://www.tcs.com/embedded-devices-intelligent-systems Linked In - http://in.linkedin.com/in/arpanpal Google Scholar - http://scholar.google.co.in/citations?user=hkKS-xsAAAAJ&hl=en Orcid - https://orcid.org/0000-0001-9101-8051More Information
Tue, 10 May, 16:00 - 16:45 France Time (UTC +2)
Tue, 10 May, 14:00 - 14:45 UTC
Tue, 10 May, 10:00 - 10:45 New York Time (UTC -4)
Voice assistants are becoming ubiquitous. In emerging contexts such as healthcare and finance, where the user base is broader than the market for consumer devices, poor performance on diverse populations will deter adoption. Models developed on ‘standard’ demographic datasets are not sufficiently robust to variability in human language production and behaviour. Evaluation of core (and evolving) assistant performance is ad hoc and intermittent, providing insufficient insight to focus and prioritize the case for engineering investment in areas of demographic underperformance.
We propose a multi-dimensional, quarterly benchmark to evaluate the evolution of voice assistant performance in both standard and diverse populations. Population dimensions include:
- age & gender
- regional dialects, ethnolects and regional sublects, e.g., of African American Language
- multilingualism, second-language & foreign-language accents
- intersections of all of the above
- Environmental dimensions include:
- noise levels & background speech
- distance from mic
- indoor/outdoor setting
- device hardware specifications & device type (wearable, in-car, smart speaker, etc)
Relevance & attractiveness to ICASSP
Improving human-centric signal processing requires a re-orientation to the variability of human language and behavioural signals. We can design better models by comprehensively anticipating the variability. Better post-production evaluation methods such as the proposed benchmark can help developers understand how their models will enable or deter standard and diverse users from interacting with their products.
Our benchmark is grounded in extensive experience of factors impacting signal variability in human language technology. We propose to cover core existing and emerging performance dimensions – Accuracy, Agreeableness, Adaptability and Acceleration – and core input dimensions – Skills (System Actions or Tasks), Environments and Demographics. All dimensions will be evaluated at a regular cadence to capture the impact of model upgrades and evolution of user behaviours.
Benchmark reports will combine human ratings, computational linguistic analysis and automated metrics to provide rich, actionable insights into the key factors impacting system performance. The intersection of linguistic variation, skill/task, and environment can impact performance in unforeseen ways. Two specific cases will be discussed: the elderly and African American Language speakers.
Inspirations and motivations
Appen provides a wide range of HCI evaluation and training data services. We have observed that evaluation of voice assistant models rarely reflects the complexity of real-world deployment. Appen’s proposed benchmark will provide insightful feedback to support model tuning for better real-world performance.
Ilia Shifrin is Senior Director of AI Specialists at Appen. He oversees a team of 70 distinguished data and language professionals that enable global NLP solutions and provide enterprise-scale multilingual and multimodal data collection, data annotation, and AI evaluation services to the world's largest corporations. Ilia Shifrin is an avid data researcher, localization, and AI personalization perfectionist with over 15 years of leadership and hands-on R&D experience in language and data engineering.
David Brudenell is VP of Solutions & Advanced Research at Appen. As Vice President, David works with many of the most accomplished and deeply knowledgeable solution architects, engineers, project managers, technical and AI specialists in the machine learning and artificial intelligence industries.
Dr MingKuan Liu is Senior Director of Data Science & Machine Learning at Appen. MingKuan has decades of industry R&D expertise in speech recognition, natural language processing, search & recommendation, fraud activity detection, and e-Commerce areas.
Dr. Judith Bishop is Chair of the External Advisory Board for the MARCS Institute for Brain, Behaviour and Development. For over 17 years, Judith has led global teams delivering AI training data and evaluation products to global multinational, government, enterprise, and academic technology developers.More Information
Wed, 11 May, 14:00 - 14:45 France Time (UTC +2)
Wed, 11 May, 12:00 - 12:45 UTC
Wed, 11 May, 08:00 - 08:45 New York Time (UTC -4)
Future communication networks must address the scarce spectrum to accommodate the extensive growth of heterogeneous wireless devices. Efforts are underway to address spectrum coexistence, enhance spectrum awareness, and bolster authentication schemes. Wireless signal recognition is becoming increasingly more significant for spectrum monitoring, spectrum management, secure communications, among others. Along with spectrum crunch and throughput challenges, such a massive scale of wireless devices exposes unprecedented threat surfaces. RF fingerprinting is heralded as a candidate technology that can be combined with cryptographic and zero-trust security measures to ensure data privacy, confidentiality, and integrity in wireless networks. Consequently, comprehensive spectrum awareness on the edge has the potential to serve as a key enabler for the emerging beyond 5G networks.
Signal Intelligence (SIGINT) can be referred to as a technique with the objective of characterizing unknown RF signals providing actionable information to the remaining components of the communication systems. State-of-the-art studies in this domain have (i) only focused on a single task - modulation or signal (protocol) or emitter classification - which in many cases is insufficient information for a system to act on, and (ii) does not address edge deployment during the neural network design phase. Motivated by the relevance of this subject in the context of advanced signal processing for future communication networks, in this talk, I will discuss some of the recent work performed to overcome the challenges and the related findings. Next, we describe some of the active areas of research in the domain to motivate several of the open problems and opportunities. Thereafter, we pave a path forward providing research direction to enhance these capabilities over the next few years. ICAASP being the flagship conference for IEEE Signal Processing Society, would be the perfect venue to present these ideas and motivation for researchers to solve and contribute towards this exciting field of research and development.
Dr. Jithin Jagannath is the Chief Technology Scientist and Founding Director of the Marconi-Rosenblatt AI/ML Innovation Lab at ANDRO Computational Solutions. He is also the Adjunct Assistant Professor in the Department of Electrical Engineering at the University at Buffalo, State University of New York. Dr. Jagannath received his Ph.D. degree in Electrical Engineering from Northeastern University. He is an IEEE Senior member and serves as an IEEE Industry DSP Technology Standing Committee member. He also serves on the Federal Communication Commission's (FCC) Communications Security, Reliability, and Interoperability Council (CSRIC VIII) Working Group 1. Dr. Jagannath was the recipient of the 2021 IEEE Region 1 Technological Innovation Award with the citation, "For innovative contributions in machine learning techniques for the wireless domain''.
Dr. Jagannath heads several of the ANDRO's research and development projects in the field of Beyond 5G, signal processing, RF signal intelligence, cognitive radio, cross-layer ad-hoc networks, Internet-of-Things, AI-enabled wireless, and machine learning. He has been the Technical Lead and Principal Investigator (PI) of several multi-million dollar research projects at ANDRO. This includes a Rapid Innovation Fund (RIF) and several Small Business Innovation Research (SBIR)s for several customers including the U.S. Army, U.S Navy, USSOCOM, and Department of Homeland Security (DHS). He is the inventor of 11 U.S. Patents (granted, pending, and provisional). He has been invited to give various talks including Keynote on the topic of machine learning and Beyond 5G wireless communication. He has been invited to serve on the Technical Program Committee for several leading technical conferences.More Information
Wed, 11 May, 15:00 - 15:45 France Time (UTC +2)
Wed, 11 May, 13:00 - 13:45 UTC
Wed, 11 May, 09:00 - 09:45 New York Time (UTC -4)
For coverage and capacity optimization, Uplink Power Control is one of the key steps, in addition to antenna tilting and Downlink Power Control. For self-organizing networks, automated algorithms for Uplink Power Control are a necessity. However, Uplink Power Control affects the noise in the neighboring cells. It is important to detect this interference to monitor uplink noise. Uplink noise due to power control manifests as static interference in the channel. The current state-of-the-art baseline model is based on regression models. We have proposed automated detection of static interference in uplink channel of cells based machine learning models. We have evaluated the same on customer data on LTE networks, with high accuracy. The detected cells are subsequently used to correct the Nominal power parameter through a proposed teacher-student model based on the primary cell and its neighbors. This approach shows better performance than the state-of-the-art baseline methods.
The dual of static is dynamic interferences, and can be attributed to traffic load, Passive Intermodulation (PIM) and thermal noise, etc. PIM identification is a major component in troubleshooting modern wireless communication systems. The introduction of carrier aggregation has increased PIM occurrences. Current state-of-the-art approaches include manual rule-based and hardware-based debugging. These approaches can detect the occurrence of PIM, long after the event occurrence and result in incurring incidental costs. We propose an ensemble of time series-based machine learning and signal processing approaches, that can automatically identify PIM in real-time, by analyzing Key Performance Indicators (KPI) of the primary cell and its nearest neighbors. We validate our results for various environmental conditions in data available from LTE and 5G consumer networks. We further have extended the work to multi-frequency time series to handle finer time granularities and detect PIM anomalies in an online learning setting. We further propose a self-supervised reinforcement learning approach to predict PIM related anomalies, before it happens. We forecast environmental conditions that give rise to PIM based on offline historical data and model that to predict future occurrences. Experimental results are on real-word datasets comprising of 50,000+ cells have shown to accurately predict PIM 60% of the time. To the best of our knowledge, this is the first work, where we are able to predict PIM anomalies before they happen.
Post PIM-identification, we propose a binary search-based solution that is amenable to real-time implementation. We show through simulations that this search in tandem with a reinforcement learning based solution can dynamically mitigate and cancel PIM. Results show that the number of steps to converge to identify and mitigate the PIM in uplink frequency is reduced by a large factor.
To summarize, our contributions include using machine learning algorithms for: (1) robust interference classification, (2) demonstrating p0-nominal recommender as teacher-student model, (3) a times-series analysis-based PIM identification, (4) extending the approach to multi-frequency time series, and for online learning, (5) demonstrating a self-supervised reinforcement learning approach to predict PIM anomalies, before they happen, and (6) mitigating PIM, in spite of environmental unknowns, by employing binary search in conjunction with ML/RL-based approaches.
Serene Banerjee has 17+ years of industrial experience post completion of her Ph.D. from The Univ. of Texas at Austin, under Prof. Brian L. Evans in 2004. She has a B. Tech. (H) in Electronics and Electrical Communications Engineering from IIT Kharagpur in 1999. At Ericsson she is focusing on developing AI/ML algorithms for Radio Access Networks. Prior to Ericsson she was with Texas Instruments, HP, and Johnson Controls.More Information
Wed, 11 May, 16:00 - 16:45 France Time (UTC +2)
Wed, 11 May, 14:00 - 14:45 UTC
Wed, 11 May, 10:00 - 10:45 New York Time (UTC -4)
1. Overview and technical contents Despite achieving high standard accuracy in a variety of machine learning tasks, deep learning models built upon neural networks have recently been identified as having the issue of lacking adversarial robustness. The decision-making of well-trained deep learning models can be easily falsified and manipulated, resulting in ever-increasing concerns in safety-critical and security-sensitive applications requiring certified robustness and guaranteed reliability. In recent years, there has been a surge of interest in understanding and strengthening adversarial robustness of an AI model in different phases of its life cycle, including data collection, model training, model deployment (inference), and system-level (software+hardware) vulnerabilities, giving rise to different robustness factors and threat assessment schemes.
This presentation will provide an overview of recent advances in the research of adversarial robustness and industrial perspectives, featuring both comprehensive research topics and technical depth. We will cover three fundamental pillars in adversarial robustness: attack, defense, and verification. Attack refers to efficient generation of adversarial examples or poisoned data samples for robustness assessment under different attack assumptions (e.g., white-box v.s. black-box attacks, prediction-evasion v.s. model stealing). Defense refers to adversary detection and robust training algorithms to enhance model robustness. Verification refers to attack-agnostic metrics and certification algorithms for proper evaluation of adversarial robustness and standardization. For each pillar, we will emphasize the tight connection between signal processing techniques and adversarial robustness, ranging from fundamental techniques such as first-order and zero-order optimization, minimax optimization, geometric analysis, model compression, data filtering and quantization, subspace analysis, active sampling, frequency component analysis to specific applications such as computer vision, automatic speech recognition, natural language processing, and data regression. Furthermore, we will also cover new applications originating from adversarial robustness research, such as data-efficient transfer learning and model watermarking and fingerprinting.
2. Relevance and attractiveness to ICASSP Many of the contents in adversarial robustness are related to signal processing methods and techniques, such as (adversary) detection, sparse signal processing, data recovery, and robust machine learning and signal processing. The presentation covers both advanced research topics and libraries (e.g. IBM Adversarial Robustness Toolbox), which is suitable to ICASSP attendees including researchers and practitioners.
3. Novelty, inspirations, and motivations to the audience a) Help the audiences to quickly grasp the research progress and existing tools in the fast-growing field of adversarial robustness b) Provide gateways for interested researchers with signal processing backgrounds to contribute to this research field and expand the impact of signal processing c) Create synergies between signal processing and adversarial robustness to identify and solve challenging tasks in adversarial robustness and deep learning d) Offer unique perspectives from industrial researchers studying trustworthy machine learning
4. Reference (videos on this topic) a) IBM Research Youtube: https://youtu.be/9B2jKXGUZtc b) MLSS 2021: https://youtu.be/rrQi86VQiuc c) CVPR 2021 tutorial: https://youtu.be/ZmkU1YO4X7U
Dr. Pin-Yu Chen is a research staff member at IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. He is also the chief scientist of RPI-IBM AI Research Collaboration and PI of ongoing MIT-IBM Watson AI Lab projects. Dr. Chen received his Ph.D. degree in electrical engineering and computer science from the University of Michigan, Ann Arbor, USA, in 2016. Dr. Chen’s recent research focuses on adversarial machine learning and robustness of neural networks. His long-term research vision is building trustworthy machine learning systems. At IBM Research, he received the honor of IBM Master Inventor and several research accomplishment awards, including an IBM Master Inventor and IBM Corporate Technical Award in 2021. His research works contribute to IBM open-source libraries including Adversarial Robustness Toolbox (ART 360) and AI Explainability 360 (AIX 360). He has published more than 40 papers related to trustworthy machine learning at major AI and machine learning conferences, given tutorials at AAAI’22, IJCAI’21, CVPR(’20,’21), ECCV’20, ICASSP’20, KDD’19, and Big Data’18, and organized several workshops for adversarial machine learning. He received a NeurIPS 2017 Best Reviewer Award, and was also the recipient of the IEEE GLOBECOM 2010 GOLD Best Paper Award. More details can be found at his personal website www.pinyuchen.comMore Information
Thu, 12 May, 14:00 - 14:45 France Time (UTC +2)
Thu, 12 May, 12:00 - 12:45 UTC
Thu, 12 May, 08:00 - 08:45 New York Time (UTC -4)
Astro is Amazon’s household robot released in September 2021, it is designed for home security monitoring, remote elder care, video calls, photo/selfie taking, music playback, and all tasks performed by a regular Amazon Echo device; it is connected to the cloud and have access to all its resources. Its two wheels enable mobility which allow it to follow a user with entertainment or to deliver calls, messages, timers, alarms, or reminders. Other capabilities include floor plan mapping, face recognition, high-fidelity audio reproduction, acoustic event recognition, and finding the directions of sound sources.
Sophisticated hardware components are incorporated to Astro to support advanced functionalities, these include multiple cameras and sensors for obstacle avoidance, navigation, and stereo depth. An eight-mic array is incorporated for audio capture. Acoustic events in the vicinity are monitored, with their directions estimated using sound source localization (SSL) algorithms. By knowing the directions of acoustic events, Astro can formulate the most suitable response depending on the circumstances; for instance, when a user utters the wake-word “Astro”, the robot rotates to face the user, based on the estimated direction of the wake-word.
Direction finding for sound sources is challenging for Astro because of the conditions found in typical indoor environment: Reflection, reverberation, acoustic interference, multiple sources, etc.; other factors that further complicate development are obstruction and motion. In this talk we describe the design choices for SSL in Astro, we start with the microphone array, followed by SSL algorithms, we then delve into reflection rejection, noise and interference suppression, performance measurement criteria, and test framework; we further describe the role of the SSL module within the larger system known as audio front end (AFE ), and interactions with other modules such as wake-word detector and floor plan mapping.
 Chu et al., Multichannel Audio Front End for Far-field Speech Recognition, EUSIPCO 2018.
Wai C. Chu is a Principal Scientist in the Audio Technology Team at Amazon Lab126. He joined Amazon in 2010 and has been involved with algorithm designs and software implementation for several projects requiring audio and speech processing. He is the system and software co-architect for audio front end (AFE) processing in Echo (Doppler), and has designed algorithms for acoustic echo cancellation, dereverberation, noise reduction, packet loss concealment, residual echo suppression, etc. He has served as technical leader for Alexa voice communication since December 2015. Besides developing the on-device software for speech processing and deploying them successfully to millions of Echo devices, he also designed cloud-based speech enhancement solutions for messaging. Currently his main focus are sound source localization and speech quality assessment. His primary programming language is C++, with exposure to Java and Python. Prior to joining Amazon he held engineering positions in corporations such as Texas Instruments and NTT DoCoMo, and start-ups such as Intervideo and Shotspotter. He received a PhD in Electrical Engineering from the Pennsylvania State University. He has an extensive publication record with 1600 citations in Google Scholar, he is a regular reviewer for various conferences and journals, and is the author of the text book "Speech Coding Algorithms" (Wiley 2003) with 30 US patents issued.
Google Scholar: https://scholar.google.com/citations?user=itLWaaYAAAAJ&hl=en&oi=ao
Research Gate: https://www.researchgate.net/profile/Wai-Chu-7
Amazon Author: https://www.amazon.com/Wai-C.-Chu/e/B001HMKLK4%3Fref=dbs_a_mng_rwt_scns_share
Publons: https://publons.com/researcher/1509783/wai-c-chu/More Information
Thu, 12 May, 15:00 - 15:45 France Time (UTC +2)
Thu, 12 May, 13:00 - 13:45 UTC
Thu, 12 May, 09:00 - 09:45 New York Time (UTC -4)
Sound recognition is a prominent field of machine learning that has penetrated our everyday lives. It is already in active use in millions of smart homes, and on millions of smartphones and smart speakers.
Temporal event detection applications such as sound event detection (SED) or keyword spotting (KWS), often aim for low-latency and low-power operation in a broad range of constrained devices without compromising the quality of performance.
In these temporal event detection problems, models are often designed and optimized to estimate instantaneous probabilities and rely on ‘ad-hoc decision post-processing’ to determine the occurrence of events. Within the constraints of commercial deployment, the impact of post-processing on the final performance of the product can be significant, sometimes reducing errors by orders of magnitude. However, these constraints are often ignored in academic challenges and publications, therefore the focus is mainly on improving the performance of the ML models. Furthermore, both in academia and industry, the design and optimization of the ML models often disregards the effect of such decision post-processing, hence potentially leading to suboptimal performance and products.
In this talk, I aim to demonstrate the importance of decision post-processing in temporal ML problems and help to bring it to the attention of the broad research community so that more optimized solutions can be realized.
Dr Çağdaş Bilen gained his PhD from NYU Tandon School of Engineering and went on to do postdoctoral work at Strasbourg University, Technicolor and INRIA. He has also worked in other research labs such as AT&T (Bell) Labs and HP Labs before joining Audio Analytic in 2018.
He has a keen interest in a greater sense of hearing.
Dr Bilen has authored articles in highly respected international journals and conferences and holds numerous patents on the topics of audio and multimedia signal representation, estimation and modelling. These include topics such as audio inverse problems (audio inpainting, source separation and audio compression) using nonnegative matrix factorization and on fast image search algorithms with sparsity and deep learning.
“My role at Audio Analytic allows me the opportunity to apply my passion for signal processing and machine learning and to explore how a greater sense of hearing can re-shape the way that humans and machines interact.”
Çağdaş leads Audio Analytic’s respected research team in developing core technologies and tools that can further advance the field of machine listening. This cutting-edge work has led to a number of significant technical breakthroughs and patents, such as loss function frameworks, post-biasing technology, a powerful temporal decision engine, and an approach to model evaluation called Polyphonic Sound Detection Score (PSDS) which has been adopted as an industry-standard metric by the DCASE Challenge.More Information
Thu, 12 May, 16:00 - 16:45 France Time (UTC +2)
Thu, 12 May, 14:00 - 14:45 UTC
Thu, 12 May, 10:00 - 10:45 New York Time (UTC -4)
Speech interaction systems have been widely used in smart devices, but due to low accuracy of speech recognition in noisy scene and natural language understanding in complicated scene, human-machine interaction is far from natural compared to human-human interaction. For example, speech interaction systems still rely on speech wake-up to initiate interaction. Moreover, when two or more people speak simultaneously or people interact with machine in noisy scene, the performance of speech recognition is still poor, resulting in a poor user experience. This proposal proposes a multimodal speech interaction solution applied in automotive scenarios. Main speaker is detected using lip movement to assist speech. At the same time, an end-to-end model from speech to intent is used to distinguish whether the main speaker is speaking to the machine or to the passenger in the car. Through this solution, a natural interaction that does not depend on speech wake-up is implemented. The success rate of interaction is comparable to that of traditional speech recognition when the false wake-up rate is 0.6 times in 24 hours. In addition, a multi-modal and multi-channel based speech separation technology is proposed. The speech, the spatial information provided by the microphone array and the lip movement of the main speaker are input into the CLDNN network to learn the mask of the main speaker relative to the mixed speech. Through this method, the main speaker's speech is enhanced, and the background noise and the interfering speaker's speech are suppressed. In the scene of multiple people talking at the same time and background music, the gain of speech recognition accuracy is more than 30%. This proposal implements a new way of speech interaction. A more reliable, natural, and robust human-computer interaction is achieved through the combination of speech and vision. The solution has been widely used by Chinese car manufacturers, and can be extended to smart homes, robots and other fields.
The author graduated from the University of Science and Technology of China with a Ph.D. degree and is a senior engineer. He is currently the deputy dean of the iFLYTEK AI Research Institute, responsible for the research and development of intelligent voice technology. He led the team to complete the research and development of the second and third generation speech recognition systems of iFLYTEK, which greatly improved the performance of speech recognition systems in complicated scenes. He successfully developed a series of speech transcription products represented by iFLYREC which provided the online speech transcription service. The developed speech interaction system was widely used in automotive, smart home and other fields. In recent years, more than 40 patents have been authorized or published.More Information
Fri, 13 May, 14:00 - 14:45 France Time (UTC +2)
Fri, 13 May, 12:00 - 12:45 UTC
Fri, 13 May, 08:00 - 08:45 New York Time (UTC -4)
In recent years, multi-armed bandit (MAB) framework has attracted a lot of attention in various applications, from signal processing and information retrieval to healthcare and finance, due to its stellar performance combined with certain attractive properties, such as learning from less feedback. The multi-armed bandit field is currently flourishing, as novel problem settings and algorithms motivated by various practical applications are being introduced, building on top of the classical bandit problem. This presentation aims to provide a comprehensive review of top recent developments in multiple signal processing applications of the multi-armed bandit. Furthermore, we identify important current trends and provide new perspectives pertaining to the future of this exciting and fast-growing field.
Dr. Djallel Bouneffouf has worked for many years in the field of online machine learning, with a main research interest is in building autonomous system that can learn to be competent in uncertain environments. Dr. Djallel Bouneffouf conducted his research in both public and privet sector. He spent 5 years at Nomalys a Mobile App Development Company (Paris, France) where he developed the first risk-aware recommender system, one year at Orange labs (Lannion, France) where he proposed to model the Active learning as a contextual bandit, 2 years at the BC Cancer Agency (Vancouver, Canada) where he come up with a very fast clustering algorithm helping the biologist analyzing all the collected and unstructured data and during his 5 years at IBM (USA and Ireland) he proposed the first reinforcement learning model that can mimic a different brain disorder and proposed a novel model of attention based on multi-armed bandit algorithm. According to google scholar Dr. Djallel Bouneffouf is the 10 most cited scientist in his field, he has more than 40 publications published in top tiers conferences, has over 1564 citations, and has served as a PC in more than 20 conferences.More Information
Fri, 13 May, 15:00 - 15:45 France Time (UTC +2)
Fri, 13 May, 13:00 - 13:45 UTC
Fri, 13 May, 09:00 - 09:45 New York Time (UTC -4)
Sensing technologies play an important role in realizing smart and sustainable buildings. Sensors of different modalities are increasingly being integrated into lighting systems that are ubiquitous in buildings. In this talk, the technical challenges in integrating sensors into a connected lighting system will be presented, along with the underlying signal processing problems in delivering sensing data to realize new and improved building applications. The presentation will cover the following topics: - Sensing-driven smart building applications: Lighting/HVAC controls, system monitoring and diagnostics, space management and user comfort/well-being; - Sensor system architectures and deployment aspects; - Sensor data analytics and machine learning based sensor data processing techniques. The presentation will provide a general overview of sensing in connected lighting and the enabled applications, while providing a flavor of the underlying data and signal processing opportunities.
Ashish Pandharipande is currently Lead Systems Engineer at NXP Semiconductors in Eindhoven, The Netherlands. Prior to that, he has held positions at Signify, Philips Research, and Samsung Advanced Institute of Technology. He also holds an adjunct position as Distinguished Research Associate at the Eindhoven University of Technology.
His research interests are in sensing, networked controls, data analytics, machine learning and system applications. He has more than 190 scientific publications in IEEE conferences and journals, and about 100 patents/filings, covering areas like IoT lighting, advanced sensing and control, data analytics and data-enabled services, machine learning, and cognitive wireless systems.
Ashish is a senior member of the IEEE. He is currently a topical editor (Sensor Data Processing Area) for IEEE Sensors Journal, senior editor for IEEE Signal Processing Letters, and former associate editor for Lighting Research & Technology Journal.
Ashish Pandharipande received the B.E. degree in Electronics and Communications Engineering from Osmania University, Hyderabad, India, in 1998, the M.S. degrees in Electrical and Computer Engineering, and Mathematics, and the Ph.D. degree in Electrical and Computer Engineering from the University of Iowa, Iowa City, in 2000, 2001 and 2002, respectively.More Information
Fri, 13 May, 16:00 - 16:45 France Time (UTC +2)
Fri, 13 May, 14:00 - 14:45 UTC
Fri, 13 May, 10:00 - 10:45 New York Time (UTC -4)
We will give an overview of CAIRaoke, an effort to build neural conversational AI models to power the next generation of task-oriented virtual digital assistants with augmented/virtual reality capabilities. We then continue with some of the challenges that we faced in training CAIRaoke dialog models, namely noisy training data resulting in faulty models, and lack of variations in training conversation flow data leading to poor generalization to real-world conditions. These prompted us to create new public benchmarks for quantifying robustness in task-oriented dialog as well as new modeling techniques to solve these challenges. We will survey the state-of-the-art efforts across industry and academia to tackle such challenges, and will also briefly present three of our recent efforts to solve these challenges. The first paper (CheckDST: https://arxiv.org/pdf/2112.08321.pdf) provides new metrics for quantifying real-world generalization of dialog state tracking performance. The second method (TERM: https://arxiv.org/pdf/2007.01162.pdf) is a simple tweak to the widely used empirical risk minimization framework that can promote robustness against noisy outlier samples. The third method (DAIR: https://arxiv.org/pdf/2110.11205.pdf) is a simple regularization add-on that targets performance consistency when data augmentation is used for better generalization to unseen examples. We will end the talk with an overview of existing open challenges in the field that we hope the signal processing society can tackle. Relevance to ICASSP: This presentation provides an overview of challenges of realizing neural conversational models and provides theoretical methods inspired from information theory and signal processing to address some of them. We end with presenting more open challenges that we hope that ICASSP community can tackle to help realize neural conversational models. Given that signal processing (SP) society has been a pioneer in image processing, speech processing, and video processing techniques, we believe that SP society is uniquely positioned to lead the way to solve these challenges and the main goal of this talk is to expose these problems and forge concrete connections between industrial research and SP society to this effect.
Ahmad Beirami is a research scientist at Meta AI, leading research to power the next generation of virtual digital assistants with AR/VR capabilities. His research broadly involves learning models with robustness and fairness considerations in large-scale systems. Prior to that, he led the AI agent research program for automated playtesting of video games at Electronic Arts. Before moving to industry in 2018, he held a joint postdoctoral fellow position at Harvard & MIT, focused on problems in the intersection of core machine learning and information theory. He is the recipient of the Sigma Xi Best PhD Thesis Award from Georgia Tech in 2014 for his work on fundamental limits of redundancy elimination from network traffic data.More Information