1 Introduction
Vast amounts of multimedia content is being produced, archived and digitised, resulting in great troves of data of interest. Examples include user-generated content, such as images, videos, text and audio posted by users on social media and wikis, or content provided through official publishers and distributors, such as digital libraries, organisations and online museums. This digital content can serve as a valuable source of inspiration to the cultural and creative industries to produce new assets or to enhance and (re-)use the already existing ones.
However, the re-use and re-purposing of digital content is mainly realised based on individual designers skills and a variety of non-interlinked heterogeneous tools. To this end, the content remains largely under-exploited, despite its great potential for re-use and re-purpose, due to the lack of appropriate solutions for its retrieval and integration into the design process. For example, existing heterogeneous multimedia content, such as video and images of buildings and objects, can be collected and transformed (e.g. into 3D models1), so as to inspire and support the creation of new content in creative industries. One of the main challenges in this area is to maximise the potential for re-purposing of digital content through the development of innovative technologies to systematically analyse, combine, link and foster searchability and reusability of heterogeneous multimedia content in different contexts.
In this paper we describe V4Ann, an ontology-based framework for capturing and interlinking digital assets and duly annotations at two levels: (a) content analysis level, during which visual and textual content is analysed to extract labels, called atoms; and (b) retrieval and repurposing level, where the assets (e.g. 3D models and images) are interlinked and contextually enriched to facilitate their discovery. At the content analysis level, V4Ann provides the conceptual structures to capture and interlink multimedia analysis results on digital content, such as video, image and text. During retrieval and repurpose, V4Ann provides practical retrieval capabilities, allowing users, e.g. game designers, to search for assets relevant to their needs. V4Ann is part of the V4Design platform2, enriching multimedia processing with a semantic annotation layer.
We describe a resource annotation model that implements the W3C standard for defining annotations (Web Annotation Data Model [17]).
We define a core set of rules that perform valid inferences for annotation propagation and interlinking, as well as for validity checking.
We propose an atom similarity metric along with a searching algorithm for keyword-based digital asset retrieval.
The rest of the paper is structured as follows: Sect. 2 presents related work. Section 3 gives an overview of the framework and presents our motivation. In Sect. 4 we describe the basic concepts of the V4Ann annotation model, while in Sect. 5 we elaborate on the inference and validation capabilities. Section 6 describes the atom similarity metric and the searching functionality. In Sect. 7 we present evaluation results and, finally, in Sect. 8 we conclude our work.
2 Related Work
Annotations are typically used to convey information about a resource or associations between resources. Simple examples include a comment or tag on a single web page or image, video or a blog post about a news article. In 2017, the Web Annotation Data Model (WADM) [17] became the W3C recommendation for defining annotations. It provides an extensible, interoperable framework for expressing annotations, such that they can easily be shared between platforms3.
In the domain of digital libraries, the Europeana Data Model (EDM) [4] adopts an open and scalable approach that can accommodate the range and level of details of particular standards, such as LIDO for museums, EAD for archives or METS for digital libraries. EDM is not built on any particular standard, however it is conceptually in line with WADM and the ORE4 initiative.
The Open Provenance Model (OPM) [11] enables to specify what caused “things” to be, i.e., how “things” depended on others and resulted in specific states. In essence, it allows provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model. OPM predates PROV-O [9], and has a very similar approach to modelling provenance by relating agents, artifacts and processes and the concepts of OPM are covered by equivalent PROV-O concepts. PAV [3] extends PROV-O and specifies Provenance, Authoring and Versioning information.
The Dublin Core metadata (DCMI) standard5 is a simple yet effective general-purpose set of 15 elements for describing a wide range of networked resources. Although DCMI favors document-like objects, it can be applied to other resources as well. The SKOS Core Vocabulary [10] is a model for expressing the basic structure and content of concept schemes. Specifically for multimedia, the Ontology for Media Resources6 was developed by the W3C Media Annotations Working Group to identify a minimum set of core properties to describe and retrieve information about media resources. VidOnt [18] provides a formally grounded core reference ontology for video representation. Several attempts have been made to map the XML Schema of MPEG-7 to RDFS and OWL [19] and X3D to OWL (OntologyX3D [6]) and the 3D Modeling Ontology (3DMO7).
V4DAnn aims to serve as the semantic annotation layer of multimedia processing results for fostering data exchange among analysis services and for human consumption. In order to promote interoperability and extensibility, it implements the WADM pattern, introducing the concept of atoms and providing several annotation entities and properties. In contrast to existing models that mostly focus on metadata defined by data providers and curators, V4Ann aims to capture content analysis results (e.g. visual and textual analysis), serving as a semantic middleware for metadata exchange. For example, EDM views refer to digital representations, whereas in V4Ann a view represents an atom-based interpretation of a content analysis procedure, e.g. aesthetics extraction. However, V4Ann provides alignments to conceptual structures of existing models, such as the EDM, ORE and SKOS (see Sect. 4 for more details).

The position of V4Ann in the integrated V4Design platform.
3 Key Concepts and Motivation
In a world where visual and textual data are in abundance, creative industries need to re-use and re-purpose them so as to remain competitive to other industries and provide to society and creativity a novel financial prism. V4Design is an H2020 project that aims at exploiting state-of-the-art digital content analysis techniques to generate 3D models, extract aesthetic and stylistic information from paintings and videos, localise buildings and objects of interest within visual content, and integrate it with textual information so as to inspire and support the design, architecture, as well as 3D and VR game industries.
Annotation propagation and linking: In a multimodal content analysis setting, like in V4Design, a single media type can be analysed by multiple technologies. For example, an image can be used for extracting building masks, as well as for aesthetics (style) extraction. Also, in many cases, there are interdependencies among the components, e.g. 3D model reconstruction needs as input video frame masks extracted by building localisation. It is important to have an efficient and interoperable way to represent, exchange and further link metadata, both structurally and semantically.
Context-aware retrieval: V4Design aims to create new multimedia content that can be integrated in existing architecture and video game design platforms, such as Unity8 and Rhino9. Therefore, there is a need for practical and efficient retrieval mechanisms on top of the multimodal annotations. For example, to allow users to search for assets with certain styles or with advanced contextual filters, such as “castles near lakes”.
In order to address the aforementioned challenges, V4Ann capitalises on and combines existing Semantic Web standards for resource annotation and interlinking, inference and validation. More precisely, the WADM model is used as the core resource annotation pattern, combined with existing structured ontologies and schemata (Sect. 4). SPIN rules [7] and SHACL shapes [8] are used to derive additional relations among the annotated resources and for validating the generated knowledge graphs (Sect. 5). Finally, keyword-based context-aware retrieval is facilitated to retrieve assets (Sect. 6).
4 V4Ann Annotation Model
Figure 2 illustrates the upper-level concepts of the V4Ann annotation model. The conceptual model revolves around the notions of annotations, media types, views and atoms. Annotations serve as resource containers, implementing the annotation pattern of WADM. Each annotation associates a media type (image, video, text, 3D model) with a view, which encapsulates a set of atoms. Each view defines one or more atoms, e.g. entities, tags, styles, etc. that are derived from multimedia content analysis. These atoms describe: (a) Aesthetics, i.e. architectural styles and creators that are extracted from images and videos; (b) Object and building types that are recognised in images and videos; (c) Named entities and concepts that are extracted from textual descriptions, e.g. image captions; (d) images and video frames used to reconstruct a 3D model. All atoms derived by aesthetics, localisation and text analysis are disambiguated, i.e. they are already mapped to WordNet, BabelNet or DBpedia resources by the content analysis services. Figure 2 also presents SKOS mappings to the ORE specification, as well as subclass and subproperty relations to WADM and EDM. In the following we describe in details each key concept.
4.1 Annotation Resources




The core concepts of the V4Ann annotation model defined as specialisation of WADM (oa namespace). Mappings to other models are also depicted, such as to Europeana Data Model (EDM) and Object Reuse and Exchange (ORE) initiative.
4.2 Media Types
In order to define the targets of annotations (describes property assertions), V4Ann provides the MediaType upper-level class. There are four media types for annotations: Video, Text, Image, Mask Image, Texture
Image and 3DModel. Each media type can be associated with additional descriptive information, such as the source of the asset (e.g. the URL), license information, date of retrieval, etc. Intuitively, each media type resource represents a single multimedia asset for which a set of annotation atoms needs to be captured.
4.3 Views and Atoms
Views are container classes that encapsulate the annotation metadata (atoms) and they are used in hasContext property assertions. Each media type has a different view. For example, the atoms of spatio-temporal building (BuildingView View) and object localisation (ObjectView
View) in images and videos specify their type, i.e. whether the image or video contains a building, object or a painting. The semantics of OWL 2 allows us to define useful complex class descriptions to specify further dependencies, as described below. It should be noted that content analysis is not part of the V4Ann framework. As described in Sect. 3, V4Ann aims to semantically capture the results of content analysis, which is part of the overall V4Design platform [1].



Object and Building Localisation. Building and interior objects localisation on art and architecture-related movies, documentaries and multiple art-images, aims to extract content that can be re-purposed and re-used in a meaningful and innovative way. Examples include buses, trains, as well as statues, buildings, etc.







5 Inference and Validation
5.1 Implicit Relationships
Additional inferences are derived by combining native OWL 2 RL reasoning and custom rules. The former is based on the OWL 2 RL profile semantics (OWL 2 RL/RDF rules [12]), which is implemented by state-of-the-art triple stores, such as GraphDB. However, the semantics OWL 2 is limited. For example, only instances connected in a tree-like manner can be modelled [13]. V4Ann implements domain rules on top of the graphs to express richer relations. SPARQL-based CONSTRUCT graph patterns are used that identify the valid inferences that can be made on the annotation graphs. It is beyond the scope of the paper to include an extensive coverage of relevant reasoning capabilities. In the following we present the concept of atom propagation that illustrates the principle idea.

Example of atom propagation. The dashed arrow illustrates the enrichment of the 3D annotation resource with the aesthetics style derived from visual analysis.

5.2 Validation and Consistency Checking

6 Context-Based Asset Retrieval
In Sect. 4 we described the process of creating the V4Ann annotation graphs, which involves the representation and further interlinking (e.g. through annotation propagation) of media type atoms. In this section we describe the approach of V4Ann towards enabling keyword-based context-aware retrieval of assets, capitalising on the concept of local context.
Definition 1
The local content of an atom t is defined as the tuple
, where
is the set of conceptually relevant terms,
is the set of hypernyms and
is the set of hyponyms of t.

(a) Generic local context of atom: relevant atoms are extracted from ConceptNet and BabelNet properties, hypernyms stem from WordNet and IS-A BebelNet relationships, hyponyms stem from WordNet; (b) example local context for “Gendarmenmarkt”.
6.1 The
Metric

![$$S(A, B) \in [0..1]$$](../images/480663_1_En_10_Chapter/480663_1_En_10_Chapter_TeX_IEq16.png)

- 1.
exact (e). The two atoms should have either the same URI, or they should be equivalent concepts, that is,
.
- 2.
plugin (p). The atom B should belong to the set of hypernyms of A (
) or to the set of relevant concepts of A (
), that is,
.
- 3.
subsume (su). The atom B should belong to the set of the hyponyms of A, that is,
.



Definition 2




![$$\begin{aligned} \mathcal {AH}_{set}(S_A,S_B,F)=\frac{{\displaystyle \sum _{\forall B \in S_B} \max _{\forall A \in S_A} \bigl [ \mathcal {AH}(B,A,F) \bigr ]}}{|S_B|} \end{aligned}$$](../images/480663_1_En_10_Chapter/480663_1_En_10_Chapter_TeX_Equ11.png)








6.2 Atom Similarity S
As a similarity function S(A, B), V4Ann uses a heuristic function that takes into account the information captured in local contexts of A and B, i.e. in the sets r, he and ho (see Definition 1). The implementation of S is summarised in the following priority rules , where
.
: if
, then
.
: if
, then
.
: if
, then
.
:
.





The number of annotations and atoms in the V4Ann annotation graphs, along with the average size of local context for each atom (r + hy + ho).
#annotations | #atoms | Avg. local context size |
---|---|---|
17245 | 154610 | 17 per atom |
7 Evaluation and Discussion
7.1 Digital Content
Example questions answered by users.
# | Question | Mean (SD) |
---|---|---|
Q1 | Atoms that are derived from visual analysis are most of the time correct |
|
Q2 | Atoms that are derived from text analysis are most of the time correct |
|
Q3 | Many times irrelevant results are top-ranked |
|
Q4 | There are many irrelevant results |
|
Q5 | It takes too long for the system to provide a response |
|
Q6 | There are too many “No results” responses |
|
7.2 Evaluation
User-Centred. A user-centred evaluation has been performed with a twofold purpose. First, to collect qualitative feedback on the results, as well as on non-functional aspects, such as query response time. Second, and most important, to generate an annotation dataset and assess the performance of V4Ann.
Quality of atoms: The quality and relevance of local contexts depends on the performance of content analysis, e.g. visual and textual analysis. Table 2 shows that visual analysis provides, in principle, better results than text analysis (Q1, Q2).
Retrieval results: According to Q3, the system achieves good top-ranked accuracy, however the complete set of the results contain quite a lot irrelevant entries (Q4). As we explain in the next section, this is mainly relevant to the context provided in the query (i.e. number of keywords). Due to the local context, the system was able to provide a response in most cases (Q6), even partially correct (Q4).
Response time: The response time of the system was positively assessed (Q5). The average response time was 4.1 seconds, which includes query analysis, building of local context and searching algorithm execution.


It should be noted that the overall performance of V4Ann strongly depends on the quality of the atoms, which in turn depends on the quality of the results provided to V4Ann. For example, if the wrong style for a painting is provided by aesthetics, this will affect precision, since V4Ann does not aim at improving the classification of incoming atoms. However, we plan to integrate multimodal data aggregation and fusion techniques to derive the most plausible classification of atoms and help improve the contextual information captured in local contexts.



Average precision and recall (top-20 results).
|
| |||
---|---|---|---|---|
Recall | Precision | Recall | Precision | |
exact | 0.59 | 0.77 | 0.44 | 0.51 |
plugin | 0.67 | 0.69 | 0.52 | 0.48 |
subsume | 0.73 | 0.61 | 0.59 | 0.42 |
8 Conclusion
In this paper we presented V4Ann, an ontology-based framework for representing, linking and enriching results of multimedia analysis on digital content. V4Ann generates annotation graphs of image, video, textual analysis and 3D model reconstruction, so as to facilitate the systematic process, integration and organisation of information and establish practical repurposing mechanisms.
The annotation model of V4Ann reuses existing standards and schemata, building the atom-based annotations graphs on top of standard ontologies, controlled vocabularies and patterns. The vocabularies are defined in OWL 2 and atoms are associated with assets using the WADM pattern. As such, it promotes interoperability, as well as fosters the use of declarative languages to identify further inferences and ensure the semantic consistency of the knowledge graphs. We also elaborated on the concept of local contexts, as well as on the metric for asset retrieval. We evaluated the framework using actual multimedia content and atoms provided by the V4Design modules and discussed the findings.
V4Ann is accessible through Rhinoceros 3D (Rhino)12 and Unity plugins developed in the V4Design project through which users (architects and video games designers) can search for assets and import them in the scene. For future work we plan to implement context-aware algorithms to improve the classification accuracy of incoming atoms, as well as to extend the context-aware retrieval algorithm with more sophisticated similarity metrics and functions.
Acknowledgments
This work was supported by the EC funded projects V4Design (H2020-779962) and MindSpaces (H2020-825079).

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.