视觉问题回答文化遗产

论文标题

视觉问题回答文化遗产

Visual Question Answering for Cultural Heritage

论文作者

Bongini, Pietro, Becattini, Federico, Bagdanov, Andrew D., Del Bimbo, Alberto

论文摘要

技术和文化遗产的成果越来越纠缠在一起，尤其是随着智能音频指南的出现，虚拟和增强现实和互动装置的出现。机器学习和计算机视觉是这种正在进行的集成的重要组成部分，从而实现了用户和博物馆之间的新互动方式。尽管如此，与绘画和雕像互动的最常见方式仍然是拍照。然而，仅图像只能传达艺术品的美学，而缺乏信息，这通常是需要充分理解和欣赏它的信息。通常，这些额外的知识既来自艺术品本身（因此描绘了图像），也来自于外部知识来源（例如信息表）。虽然可以通过计算机视觉算法来推断前者，但后者需要更多结构化的数据将视觉内容与相关信息配对。无论其来源如何，此信息仍然必须有效地传输给用户。计算机视觉中的一个流行的新兴趋势是视觉问题回答（VQA），在该趋势中，用户可以通过自然语言提出问题并接收有关视觉内容的答案来与神经网络进行交互。我们认为，这将是博物馆访问的智能音频指南的演变，并在个人智能手机上浏览简单的图像。这将把经典音频指南变成智能的个人讲师，访客可以通过要求专注于特定兴趣的解释来与之互动。优点是双重的：一方面，访客的认知负担将减少，从而将信息流限制为用户实际想要听到的信息；另一方面，它提出了最自然的与指南互动的方式，有利于参与。

Technology and the fruition of cultural heritage are becoming increasingly more entwined, especially with the advent of smart audio guides, virtual and augmented reality, and interactive installations. Machine learning and computer vision are important components of this ongoing integration, enabling new interaction modalities between user and museum. Nonetheless, the most frequent way of interacting with paintings and statues still remains taking pictures. Yet images alone can only convey the aesthetics of the artwork, lacking is information which is often required to fully understand and appreciate it. Usually this additional knowledge comes both from the artwork itself (and therefore the image depicting it) and from an external source of knowledge, such as an information sheet. While the former can be inferred by computer vision algorithms, the latter needs more structured data to pair visual content with relevant information. Regardless of its source, this information still must be be effectively transmitted to the user. A popular emerging trend in computer vision is Visual Question Answering (VQA), in which users can interact with a neural network by posing questions in natural language and receiving answers about the visual content. We believe that this will be the evolution of smart audio guides for museum visits and simple image browsing on personal smartphones. This will turn the classic audio guide into a smart personal instructor with which the visitor can interact by asking for explanations focused on specific interests. The advantages are twofold: on the one hand the cognitive burden of the visitor will decrease, limiting the flow of information to what the user actually wants to hear; and on the other hand it proposes the most natural way of interacting with a guide, favoring engagement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题