Easy use and creation of vocabularies: the LoCloud vocabulary microservice

Gerda and Walter Koch of AIT Applied Information Technology Graz, Austria write:

LoCloud provides support services and tools to regional and local heritage institutions. These tools and services simplify ingestion of data into Europeana and help to improve metadata quality with the use of cloud-based services. One of the LoCloud services developed during the second project year is the vocabulary microservice.

Microservices? An introduction

The past years have seen a move from monolithic applications towards microservice architectures. Microservice architectures are generally described as suites of small and independent services compiled into single applications, in opposite to monolithic applications that are built as single units, with a client-side user interface, a database, and a server-side application. While any changes to monolithic applications require the re-building and deployment of the entire applications, in microservice architectures it is quite easy to introduce new versions of the individual services or integrate new services. This will become even more important in the future, as applications will increasingly be deployed to the cloud.

Monolithic versus microservices architecture

Figure 1 – Monolithic versus microservices architecture

Microservices are independently deployable and scalable and they could even be written in different programming languages by different software teams. Usually. the services are built upon business capabilities and communicate via web service requests or remote procedure calls.

The LoCloud microservices have been developed by different project partner teams and are being deployed in the LoCloud MORe [1] aggregator. Additionally, some of the services provide front-end applications. The LoCloud vocabulary microservice provides the business capability for vocabulary management and enrichment of metadata.

Vocabulary Microservice

Vocabulary services support the enrichment of metadata (catalogue data) by adding vocabulary terms to metadata records. This can be done right “at the beginning” when the object is registered in the local cataloguing system. Alternatively, the data can be enriched by an automated service “after” cataloguing has taken place. The second possibility often prevails when data is ingested into joint virtual catalogues (like Europeana) where common vocabularies provide a means to semantically link data and to support easy browsing through the entire repository.

The requirement for the LoCloud vocabulary service team was to implement a cloud-based vocabulary web service and vocabulary application for the LoCloud network. The vocabulary application should support the development of multilingual, semantic thesauri for local heritage content and the vocabulary webservice had to be based on international standards such as SKOS [2] and the ISO thesaurus norms [3].
Four main use cases for the vocabulary services exist:
Three uses cases for vocabulary provision: Use the vocabulary microservice…
a)            …in the various enrichment workflows automatically through the generic enrichment service
b)           …through the Aggregator User interface by choice
c)            …in local cataloguing systems via web services.
One use case for vocabulary creation including import of existing vocabularies: Use the service
d)           …with a cloud based online tool.
There was a six-month implementation period and a three-month testing period for the service scheduled in the LoCloud workplan. A systematic research of already existing applications revealed that the open source tool TemaTres (http://sourceforge.net/projects/tematres/) was the best starting point for a rapid development and implementation of such a service.
TemaTres [4] supports the handling of vocabularies in accordance with the ISO standard thesaurus norms. It allows for import and export of data as simple text files or in SKOS format. The tool was installed in the cloud testlab of LoCloud and then adapted to the project’s needs. These needs concerned mainly the implementation of a simplified administration and import facility. Additionally, we implemented a note extension that allows storing coordinates for place names. Two new web service [5] calls “import” and “linkTerm” were added to the extensive list of already available TemaTres web services. For the import process of multilingual vocabularies a new identifier for concepts was introduced, connecting automatically the terms in the various languages. This identifier also carries the link information of the publicly published vocabulary and is later added to the metadata during the enrichment process.
How the LoCloud vocabulary services may be used.
Use Case 1 – Automatic usage through the generic enrichment service.

The LoCloud microservice “Generic enrichment” automatically receives the vocabularies available in the vocabulary tool and uses them during the automated enrichment process conducted in the MORe aggregator.

To date around 30 vocabularies are imported to the vocabulary tool. A list of these vocabularies can be found here  > LoCloud Vocabularies.

Figures 2 and 3 show how the microservices Vocabulary and Vocabulary Matching (available currently for  English and Spanish language) are added to a data source enrichment plan in the the MORe aggregator.

vocabo2-3_campaign

vocabo4_campaign

Figure 4 – Use Case 1 – Automatic usage through the generic enrichment service

Consequently, the EDM metadata is checked against the LoCloud vocabularies and where appropriate the vocabulary links are added.

Use Case 2 – Match vocabulary terms in the aggregator.

vocabo5_reference
Figure 5 – Selection of individual vocabulary terms for a Subject Collection in MORe

Instead of matching the vocabularies automatically to the metadata it is possible to selected individual terms from the LoCloud vocabularies that should be added to the metadata. These selections are called Subject Collections in MORe.

Use case 3 – Use the vocabularies in your local cataloguing systems via web services

vocabo6_reference
Figure 6 – Integration of a vocabulary to a local cataloguing tool via webservice

One of the benefits of microservices is that these small and independent services may be plugged in into other application and can be used via the APIs [6] they provide. Figure 6 depicts an example of a local cataloguing system for music archiving that calls the vocabulary web service in the metadata field “Genre(s)”. The technical information on how the services can be integrated is published on the LoCloud online support centre. There are currently fifteen different web service calls for vocabulary integration available.

Use case 4 – The vocabulary experimental application.

The LoCloud vocabularies have been imported into the Tematres tool in order to use them via web services in the aggregation process. But

vocabo7_reference
Figure 7 – LoCloud Vocabularies, Website

Tematres can also be used to create vocabularies from the scratch or to import already existing vocabularies. The advantage of importing vocabularies to the tool is clear: The vocabularies become available in SKOS format with an own web presence and can be used for semantic linking afterwards. With the consent of the vocabulary owner, the vocabulary might then become available for other LoCloud partners in the MORe aggregator. Moreover, the LoCloud Tematres installation allows collaborating online in creating and extending vocabularies (for example in order to create new translations of existing vocabularies).

The vocabulary application is accessible through this link (see figure 7). Please check the LoCloud support portal for information on a test user account which is necessary to access and create a new vocabulary.

 

***

[1]https://support.locloud.eu/MORE 26 January, 2015

[2]http://www.w3.org/2004/02/skos/ 26 January, 2015

[3]http://www.niso.org/schemas/iso25964/ 26 January, 2015

[4]http://www.vocabularyserver.com/ 26 January, 2015

[5]http://en.wikipedia.org/wiki/Web_service 26 January, 2015

[6]http://en.wikipedia.org/wiki/Application_programming_interface 26 January, 2015

LoCloud and MINT

The MINT service is  a web based platform  designed and developed by  NTUA (the National Technical University of Athens – partner in LoCloud) to facilitate the aggregation of  digital cultural heritage content and metadata in Europe.

The service includes all the various  steps of  workflows, from the ingestion, mapping and aggregation of metadata records to  the implementation of a variety of remediation approaches for the resulting repository. The platform offers  users an organisation management system enabling the deployment and operation of different aggregation schemes (thematic or cross-domain, international, national or regional) and corresponding access rights. Registered organisations can upload (http, ftp, oai-pmh) their metadata records in xml or csv serialisation in order to manage, aggregate and publish their collections.

A reference metadata model serves as the aggregation schema to which the ingested (standard or proprietary) schemata are aligned to. Users can define their metadata crosswalks with the help of a visual mappings editor for the XSL language. The mapping is performed with simple drag-and-drop or input operations, which are then translated into the corresponding code. The mapping editor visualises both the input and target XSD, in an intuitive interface that provides access and navigation of the structure and data of the input schema, as well as the structure, documentation and restrictions of the target one. It supports string manipulation functions for input elements in order to perform 1-n and m-1 (with the option between concatenation and element repetition) mappings between the two models. Additionally, structural element mappings are allowed, as well as constant or controlled value (target schema enumerations) assignment, conditional mappings (with a complex condition editor) and value mappings between input and target value lists. Mappings can be applied to ingested records, edited, downloaded and shared as templates between users of the platform.

Preview interfaces present to users the steps of the aggregation including the current input xml record, the XSLT of their mappings, the transformed record in the target schema, subsequent transformations from the target schema to other models of interest (e.g. Europeana’s metadata schema), and available html renderings of each xml record. Users can transform their selected collections using complete and validated mappings in order to publish them in available target schemas for the required aggregation and remediation steps.

The MINT platform has been deployed for a variety of aggregation workflows corresponding to the whole or parts of the backend services. Specifically, it has served the aggregation of a significant amount of museum content for Europeana through the ATHENA project, that has ingested and aligned to the LIDO format over 4 million items from 135 organisations. The resulting repository offers an OAI-PMH interface presenting the records in the Europeana Semantic Elements schema (ESE). The use of a reference model allowed the rapid support of updated ESE versions that were introduced during  the project (2008-2011), with minimal input from providers. The users’ effort to align their data to an adopted domain model also motivated them to update their collection management systems and improve the quality of their annotations in order to take advantage of a well defined, machine understandable model and, subsequently, control and enrich their organisation’s contribution and visibility through the aggregator and Europeana.

Τhe MINT ingestion platform that is used in the LoCloud project is meant for large-scale ingestion of metadata with the final aim of  delivering  to Europeana a significant content from small and medium cultural institutions. Τhe development of MINT started within the ATHENA project when the NTUA team integrated all the necessary components for ingesting, mapping and publishing metadata to Europeana into a common technology platform, while it evolved through its use in other Europeana-feeder projects such as Linked Heritage, EuScreen, ECLAP, Carare, Europeana Fashion, Europeana Photography and others. The MINT platform provides content holders with the ability to perform the required mapping of their own metadata schemas into LIDO, Carare2.0 and EDM. It enables the ingestion of metadata from multiple sources, the mapping of the imported records to a target metadata schema and the transformation and storage of the metadata in a repository. Although its deployment is also guided by expediency, the system has been developed using established tools and standards and embodying best practices in order to animate familiar content provider procedures in an intuitive and transparent way also for newcomers.

Vassilis Tzouvaras
Senior Researcher
National Technical University of Athens

Micro-services and a lightweight digital library

LoCloud will implement a series of micro-services aimed to benefit Europeana’s user’s experience as well as the content providers institutions.

Currently, work is in progress on the development of an experimental application to enable local cultural institutions to collaborate in the development of multilingual vocabularies for local history and archaeology. Moreover, some of the technical partners are implementing Natural Language Processing (NLP) tools to analyse and enrich the metadata being provided to Europeana as well as to develop and test an application on contents uploaded to Wikimedia and capture and enrich the available metadata. The idea behind this effort is to test the potential of exploiting Wikimedia for crowd sourcing projects, as well as for small institutions and individuals for capturing the local heritage.

A prototype geo-location tool is being developed and it will be made available to partners for testing in the coming months. This service will offer content providers the opportunity  to use a combination of local place names data and spatial filtering mechanisms to assign geo-location coordinates to a given digital object as well as to improve the accuracy of the existing geo-coding. Moreover, the spatial metadata will become useful for detailed navigation such as mobile devices, offering mobile users the chance to explore cultural content relevant to the place they are in.

Other services under development are an historic place names service, metadata enrichment services allowing a wide range of connections between items in a collection, and a Wikimedia application which will enable the integration of Wikimedia services with the LoCloud infrastructure.

Finally, work has progressed with regard to the development of a Lightweight Digital Library System. A prototype will hopefully be ready before the end of 2014 when it will undergo a testing phase by the LoCloud content provider partners. This tool will be useful for those small institutions that currently do not have a digital archive and therefore need a user-friendly system for cataloguing their digital content and metadata.

Reviewing cloud computing for LoCloud

During the first five months of the LoCloud project a review of cloud computing has been conducted by a working group consisting of The Danish Agency for Culture, Rijksdienst voor het Cultureel Erfgoed in the Netherlands, The Spanish Ministerio de Educación, Cultura y Deporte, Vilniaus Universitas in Lithuania, Universitaet Duisburg-Essen in Germany and Univerzita Komenskeho v Bratislave in Slovakia.

On the basis of the review, an analytical report has been produced as the first deliverable of the project, which will be made available from the project web site, after approval by the European Commission.

The purpose of this report is to monitor the state-of-the art of cloud computing and make an assessment of aspects of the cloud relevant to the needs of the project and to small and medium sized institutions. The report is supposed to inform content providers in their further action planning. The methodology for writing the report is primarily desktop research and analysis of the available literature.

The first section of the report offers a general description of cloud computing, the different kinds of infrastructure and models of service available, and the advantages and potential risks associated with the technology.

The second section offers an introduction to the uptake of cloud computing by small and medium-sized enterprises in the EU and the barriers that exist. It also presents a brief overview of European policy regarding cloud computing, and an analysis of the potential for cloud computing in the heritage sector.

In the third and final section, special attention is paid to the needs of the LoCloud project and to small and medium sized cultural institutions.

The findings of the review are summarised below.

Cloud computing has become ubiquitous, but the concept has no strict definition. Ideally, cloud computing is meant to turn computing into a utility like water or power. Elasticity, availability, improved resource utilisation and support for multiple tenants are key features of the concept. There are three main models of service: Infrastructure as a service (Iaas), Platform as a service (Paas) and Software as a service (Saas).

Cloud computing may aid heritage institutions with its oft-cited benefits such as cost effectiveness, quick deployment and access to resources beyond the abilities of individual small institutions. Developers of cloud-based services in the heritage sector should distinguish between the three user groups: content providers & aggregators, the general public, and scholars.

Though cloud computing is still emerging, a stamp of approval is that The European Commission has adopted a cloud computing strategy based on the reports from expert working groups and open consultations. It was adopted in September 2012 and is part of the ‘Digital Agenda for Europe’.

There is high awareness and willingness to participate in cloud-based development from the heritage institutions and agencies voicing their opinion in this report. The barriers to participating cited are mainly lack of knowledge and skills, trust and legal issues. The main legal obstacle is the fact that many institutions are charged with the governance of their data and there will often be restrictions as to where that data may be placed and whom it may be given to. It lies at the heart of cloud computing that the customer may not know exactly where the data resides.

There are a number of Saas providers providing services for the cultural sector. Some of the commercial vendors of collections management systems offer cloud based versions of their software, and in the library domain the OCLC offers a number of relevant services. However, none of these come with plug-in aggregation tools for Europeana.

There is probably still a need for online tools with a very low barrier to entry which are suited to the needs (and budgets) of smaller local and community museums. This is the window of opportunity for the LoCloud project. The LoCloud project builds on past successful projects such as Europeana Local and CARARE and aims to bring the benefits of cloud computing to especially small- and medium-sized cultural institutions to aid them in aggregating their data to Europeana.

Beyond the space: the LoCloud historical place names service

The Faculty of Communication of Vilnius University in cooperation with the Digital Curation Unit of the Institute for the Management of Information Systems of Athena Research and Innovation Centre in Information Communication & Knowledge Technologies, Angewandte Informationstechnik Forschungsgesellschaft mbH, Universidad del País Vasco and Javni Zavod Republike Slovenije za Varstvo Kulturne Dediscine,  building on the results of the CARARE project, are developing an application to enable local cultural institutions to collaborate in the development of an historical place name microservice.

Why historical place names?

Nowadays, space and time are considered to be the most important dimensions of reality. Much attention is paid to their scientific analysis in Physics and Astronomy. Moreover, these dimensions are very important in the research of cultural heritage,  the understanding of its role in contemporary society and its use  in cultural industries. Historical space and time are important aspects of the life cycle of a cultural heritage object since they help to identify, interpret and communicate that object and [or] attached ideas. Moreover,  the dating of sources and their association with certain geographical spaces allow for further historical interpretations.

Historical place names (HPN) are the point, where past time and space meet. HPN are place names, which exist in history (not contemporary place names) and are fixed in historical sources. An HPN is considered to be a place appellation, which is used to refer to several places, because its application may change over time (similarly to E48 in CIDOC-CRM). On the other hand, the place as an object (similarly to E53 in CIDOC-CRM, excluding movable objects) is determined as GIS defined immovable geographic object: point, polygon or line (such as landscape, inhabited places, buildings, natural objects (mountains, river, etc, administrative areas, etc.)). A place name can be understood as an historical identifier for several places (with the same meaning   of E4 in CIDOC-CRM) and (or) as a kind of immovable heritage (“non-material products of our minds”, e.g. E28 Conceptual Object in CIDOC-CRM).

The transcoding of reality from analogue to digital system performed during the heritage digitisation affects the application of HPN used in the real world to artificial system. This  way HPN becomes a link between reality and virtuality ensuring quality of digitisation, interoperability of reality and virtuality, internal interoperability within the information system and external interoperability of several systems, as well as efficient communication of digital data.

On the other hand, while interpreting and using the space of a certain period, it is important to take into account the invisible “human factor” – the people who lived in  particular historical periods – which we can call “historical or cultural multilingualism”.  Thus can be defined as terminological differences of the common language determined by cultural differences of various nations. According to the communicative model of the pioneer of the American trend of semiotics, C. S. Peirce, a term (in our case HPN) is a conventional sign, which is developed by the interpretant in his mind perceiving the object of reality. So, in terms of communication of meanings, HPN is a piece of work of different human groups intended to name the same object of reality (the place). On the contrary, miscommunication and non-interoperability occurs at the level of signs (words) rather than objects. Scientists and politicians from the 19th,20th century brought additional confusion into the understanding of the historical space through nationalistic historical narratives, thus having a huge impact on history and social geography. Computing technology based on algorithms and binary code, provides possibilities for maximising objectivity in the geographical representation of reality. Paradoxically, the comprehension of space based on historical narratives is much stronger than that based on the ICT discourse. When digitising cultural heritage, we link it less to modern geographical space realities, than to the historical space of the 19th century that was marked with the myths and narratives of nationalism.

What does the HPN thesaurus contain?

Two methodological models can be employed for the digitisation of historical geographical and chronological data: the “text oriented” model and the “object oriented” model. The “text oriented” model was created at the early stage of the computerisation of cultural heritage. It is based on a “hierarchical” paradigm and usually describes the world via hierarchically organised controlled vocabularies of proper names. Despite the evident significance of the “text oriented” model for the development of digitisation of cultural heritage, it is also necessary to note the essential limitations of this model. The actual world (reality) is continuous and is composed of interconnected objects (not  concepts) that are organised according to  a non-hierarchical structure.  The “object oriented” model proposes a different point of view. This model was created during the modern stage of the computerisation of cultural heritage. It is based on a “network” paradigm and usually describes the world via network-organised object’s ontology. The ontological “object oriented” model is more connected with reality: real place-time and place-time appellations are described as separate classes of reality.

The HPN microservice will be developed on the basis of HPN Thesaurus, which is intended for aggregation, storage and long-term preservation of historical geo-information. The principal schema of the HPN microservice is presented in Fig. 1.

The HPN Thesaurus is a controlled vocabulary that can be used to aggregate, preserve and improve the interoperability and semantics  amongst historical geo-information, between historical geo-information and contemporary geo-data and historical geo-information in access to information about cultural heritage. The HPN thesaurus can be used as data standard at the point of documentation or cataloguing (as a controlled vocabulary or authority by the cataloguer or indexer, preferred names/terms and synonyms for places, structure and classification schemes); as browsing assistants in CARARE, LoCloud databases and in Europeana (knowledge base that show semantic links and paths between historical and contemporary places); as research tools (information and contextual knowledge about historical place names and places).

The HPN Thesaurus is a qualification of the CARARE metadata schema at the conceptual level (“Heritage Asset Identification Set”- global type “Spatial” – “Historical name”). The strength of the HPN Thesaurus lies with its ability to collect the full range of historical geo-information about digitised cultural heritage, born-digital objects, related events, their representations and to support the full range of HPN micro-services and user’s cases. The implementation of the HPN Thesaurus and the HPN micro-services is closely connected with the creation and implementation of the LoCloud Geolocation enrichment services (D3.3) and Vocabulary services (D3.4.), due to be released in Autumn

How will it work?

The HPN shall perform the following functions:

1. Reliably transfer historical geo-data from a series of local and international databases, information systems and (or) providers to the HPN Thesaurus, including the possibility of providing historical geo-data manually, via a user’s interface. The system  will connect with the semantic mapping and transfer of historic geo-data from local systems to the LoCloud HPN Thesaurus. The HPN geo-data will be imported in the GeoJSON,  JSON, CSV, SQL, TXT  formats. They will then be matched with other historical geo-data at HPN Thesaurus, using an automatic HPN data Import tool.. After the matching,  a manual quality check will be carried out and new HPN will be added to the HPN Thesaurus. A similar procedure is used for  other enrichment scenarios. . The scenarios for the enrichment of the HPN Thesaurus are presented in Fig. 2. On the one hand,  this process will ensure interoperability between different historical geo-data sets. On the other hand, it will create tools for enabling  crowd-sourcing and wiki paradigm in the HPN field.

2. Analyse and enrich the HPN data in the metadata sets of provided objects.  Created Analysis and enrichment tool will be based on the integrated algorithm that will normalise and reconcile similar place names, estimating similarities between names and geographic coordinates (it could rank accuracy by special algorithm. e.g. if a names and relevant coordinates are exact, it is ranked by 100%; if the name is exact, but the coordinates deviate by 50% , it would be 75%. If the name is not exact and the coordinates do not match the allowed deviation, it would be 0%). A user interface for each  LoCloud partner will enable to see, correct and quality check the results of the reconciliation algorithm. Each partner will be able to log in and visualise a list with different colours (from green to red) with the percentage of accuracy.

Rimvydas Laužikas
Ingrida Vosyliūtė
Faculty of Communication. Vilnius University, Lithuania