1 minute read

Conversations by Native Speakers only

What sets PRESTO apart from other datasets is that it only includes conversations provided by native speakers of the language with no translation. As the authors of the research paper introducing the dataset explain, prior large multilingual datasets contain non-English conversations obtained by translating English conversations into other languages, “resulting in unnatural and synthetic utterances which are unlikely to be spoken by native speakers of the non-English language.”

A typical user interacts with virtual assistants in a virtual world (i.e., context) that may contain structured objects, such as a list of contacts on the user’s phone, a shopping list, or a to-do list. According to the authors, PRESTO “is the only large-scale human generated conversational parsing dataset that provides structured context such as a user’s contacts and lists for each example.”

Advertisement

They explained that, depending on the query, this context may or may not be needed to correctly interpret the user’s utterances. Semantic parsing models often struggle to determine which part of the context is relevant to a given utterance (if any). Therefore, the authors emphasized that “modeling solutions should have the ability to model (and ignore) such structured information.”

This article is from: