Conversations by Native Speakers only

from Realistic Dialogues, No Translation: Google Unveils Dataset for Virtual Assistant Training

by Slator Language Industry Intelligence

Google Unveils Dataset for Virtual Assistant Training

What sets PRESTO apart from other datasets is that it only includes conversations provided by native speakers of the language with no translation. As the authors of the research paper introducing the dataset explain, prior large multilingual datasets contain non-English conversations obtained by translating English conversations into other languages, “resulting in unnatural and synthetic utterances which are unlikely to be spoken by native speakers of the non-English language.”

A typical user interacts with virtual assistants in a virtual world (i.e., context) that may contain structured objects, such as a list of contacts on the user’s phone, a shopping list, or a to-do list. According to the authors, PRESTO “is the only large-scale human generated conversational parsing dataset that provides structured context such as a user’s contacts and lists for each example.”

They explained that, depending on the query, this context may or may not be needed to correctly interpret the user’s utterances. Semantic parsing models often struggle to determine which part of the context is relevant to a given utterance (if any). Therefore, the authors emphasized that “modeling solutions should have the ability to model (and ignore) such structured information.”

Conversations by Native Speakers only

Next Article

Google Unveils Dataset for Virtual Assistant Training

More articles from this publication:

Google Unveils Dataset for Virtual Assistant Training

Realistic and Complex Utterances

This article is from:

Realistic Dialogues, No Translation: Google Unveils Dataset for Virtual Assistant Training