微软阅读理解、问答数据集 MS MARCO

原文转载自 「Neo Fung's Blog」 (https://www.neofung.org/2017/10/19/微软阅读理解、问答数据集-MS-MARCO/)

预计阅读时间 0 分钟(共 0 个字, 0 张图片, 0 个链接)

Microsoft Machine Reading Comprehension (MS MARCO) is a new large scale dataset for reading comprehension and question answering. In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer.

看了一下,初次发布的数据集包含了 100,000 个问题。
比较惊人的是,这些问题不是根据网络文档而人工编辑的,而是从这么多年来Bing的搜索记录中挖掘出来的,也就是说最接近人的提问方式了 - 非清晰简明地。
而数据集中包含的用于检索答案的材料,是十篇从IR系统中检索出来的最相近的文本。

Homepage

more_vert