
1 minute read
Google Unveils Dataset for Virtual Assistant Training
Researchers from Google, the University of Rochester, the University of California, and Columbia University have introduced a new dataset of over 550K multilingual conversations between humans and virtual assistants in various contexts, allowing for more realistic model training to optimize language model performance. Google also announced the new dataset in a blog post. www.slator.com www.slator.com
With the wide adoption of virtual assistants such as Google Assistant, Alexa, and Siri, researchers have taken an interest in the study of task-oriented dialogue; however, the lack of datasets that capture a wide range of user pain points has limited the impact of academic research in this field.
Advertisement
Although some custom datasets have been created, they do not have the typical speech phenomena necessary for model training, leading to underperforming models and dissatisfaction with assistant interactions. The new dataset, coined PRESTO and released on March 17, 2023 spans six different languages (German, English, Spanish, French, Hindi, and Japanese) and contains a diverse array of challenges that occur in real-world natural language understanding (NLU) tasks, including disfluencies (e.g., repeated phrases and filler words), code-switching or code mixing (i,e., switching between or mixing words from two languages), and user revisions (i.e.,revising requests due to mistakes, changing or canceling requests).