Aller au contenu principal

Entity Reconciliation Guide

Introduction

This page covers details you should know before reconciling the entities in your data. Some of the advice here is targeted towards the lead researchers from your group as it addresses larger project level decisions about URI use. Other concepts and examples are for research assistants who are doing the actual reconciliation work.

LINCS’s Approach to URIs

LINCS’s strategy is to prioritize re-using existing URIs as identifiers whenever possible. We do this to enrich the linked data that exists online while limiting the number of duplicate identifiers we add to the LOD ecosystem that people need to choose between.

The principals we follow are:

  • Every entity in the LINCS triplestore has one primary URI that acts as the identifier to represent that real-world concept.
    • We try to consistently use the same URI for the same entity across all datasets.
    • This requires incoming datasets to reconcile against datasets already in LINCS and prioritize the URIs that LINCS is already using.
  • Each entity can have many owl:sameAs relationships to connect it to additional equivalent URIs from other sources

One of the benefits of the LINCS knowledge graph is that our contributing datasets contain obscure entities that are not well represented elsewhere online. However, this creates a challenge for reconciliation as we often cannot find external URIs for entities. In these cases, projects who have the capacity to do so will mint URIs using project specific namespace. As a final option, LINCS will mint URIs using the namespace http://id.lincsproject.ca/.

Basically any named thing in your data should be reconciled so that we can represent it with a URI. A few examples of things we would try to reconcile include:

  • People
  • Places
  • Companies
  • Specific objects
  • Categories of objects
  • Creative works
  • Materials
  • Abstract concepts
  • Political movements

Choosing a Source

LINCS has a selection of LOD sources—or authority files—that we tend to use. Here is a description of what we like about each of our most commonly used sources and details of where it fits in our order of preference. However, each project has its own priorities that may impact a different order of preference. The domain of your data will impact this as well. We suggest that before you start reconciling, you investigate the authority files of interest to ensure you are comfortable connecting your data to that source. You should also look for domain specific sources not listed here.

DBpedia

DBpedia is a good source for entities that are notable enough to have Wikipedia pages. We tend to use Wikidata before DBpedia when the same entity is in both.

GeoNames

GeoNames is our first choice source for modern geographic locations. If you cannot find a location in GeoNames then, in order of LINCS preference, try Getty TGN, VIAF, or Wikidata.

Getty

There are 4 separate datasets in Getty that you can search through for different types of entities:

For person entities, LINCS uses Getty as the second choice after VIAF when the people are likely to be artists.

LINCS

If you plan on publishing your converted data with LINCS, then you will need to reconcile against existing LINCS data. This helps us prevent having the same entity in our triplestore under multiple URIs.

LOC

The Library of Congress (LOC) is a good place for concepts and types. There are many different groups of terms within LOC so you will have to browse to find relevant groupings.

VIAF

The Virtual International Authority File (VIAF) is our first choice for bibliographic records as well as people and companies connected to those records—like authors and publishers.

VIAF also contains geographic locations, but more often use GeoNames.

Wikidata

Wikidata contains billions of entities covering a large variety of types. We often use Wikidata when we cannot find an entity in a domain specific source. Wikidata comes with the caveat that it is community-created so the way that entities are defined is subject to frequent changes.

Because Wikidata is so widely used, it is a good place to find out about other domain specific sources. If you search an entity on Wikidata, then scroll to the bottom of the page to the “identifiers” section, you can see URIs for that entity from other authority files. This can help you learn about other authorities relevant to your domain. Though, be sure to research those newly found authority files and ensure that the identifiers they provide are valid LOD URIs.

Wikidata acts as a bridge between many authority files. Once you have one external URI, from VIAF for example, you can query Wikidata using SPARQL to find the Wikidata URI that corresponds to a VIAF URI. This can help you add additional owl:sameAs links to your data or switch the authority of preference.

Reconciling Vocabulary Terms

The concepts and techniques for reconciling entities in your data apply to vocabulary terms as well. Whenever possible, choose a vocabulary term that is already used frequently in LINCS data. This will help connect your data to others. If there are multiple vocabulary terms that match yours, you can typically use multiple.

Specific vocabularies that are already in use in LINCS data include:

  • DBpedia
  • GeoNames
  • Getty Art & Architecture Thesaurus (AAT)
  • Getty Thesaurus of Geographic Names (TGN)
  • Getty Union List of Artist Names (ULAN)
  • Homosaurus
  • Library of Congress Subject Headings
  • Library of Congress Names
  • MARC List for Languages
  • MARC Relators
  • Nomenclature for Museum Cataloging
  • VIAF
  • Wikidata

See our Vocabularies documentation for additional background and the Vocabulary Browser to find vocabulary terms created by or used in LINCS.

Valid URIs

If you would like to reconcile your data against a source not listed in our documentation, first check that it is a source of linked data and that they have permanent URIs hosted for each entity. If you are unsure about using a source, check with the Conversion Team.

When using any source of URI, be mindful that you have the namespace and formatting of the URI exactly as it is listed. This should be the permanent link for the entity, and not necessarily the link you see in the address bar of your web browser.

Here are the namespaces of the sources we frequently use, with common errors listed:

DBpedia

  • http://dbpedia.org/resource/
    • Not https://dbpedia.org/page/

GeoNames

  • https://sws.geonames.org/
    • https not http

Getty

  • These start with http://vocab.getty.edu/ followed by the vocabulary name. Be careful not to use the page URIs that start with http://vocab.getty.edu/page/.
  • AAT
    • http://vocab.getty.edu/aat/
  • ULAN
    • http://vocab.getty.edu/ulan/
  • TGN
    • http://vocab.getty.edu/tgn/

LOC

  • There are multiple valid namespaces within LOC data, typically begining with http://id.loc.gov/authorities/
    • The URI should be listed under “URIs” within an entity’s web page
    • Make note of the use of http not https and not including .html at the end

VIAF

  • http://viaf.org/viaf/
    • You can find this listed as Permalink within a record’s web page
    • There should not be a trailing /

Wikidata

  • http://www.wikidata.org/entity/
    • http not https
    • /entity/ not /wiki/

Non-LOD Sources

If you cannot find an LOD source for a URI, but can find references to the entity in documents on the web, there are ways we can include those in your conceptual mapping.

As an example from Map of Early Modern London (MoEML), they included Wikipedia pages within which entities were mentioned:

<https://mapoflondon.uvic.ca/MORE14>
rdf:type crm:E21_Person ;
rdfs:label "Dame Alice More (née Harpur)"@en ;
crm:P129i_is_subject_of <https://en.wikipedia.org/wiki/Alice_More> .


<https://en.wikipedia.org/wiki/Alice_More>
rdf:type crm:E73_Information_Object ;
crm:P2_has_type <http://www.wikidata.org/entity/Q36774> .

Minting URIs

When a project cannot find an existing URI for an entity, does not approve of the URIs it finds, or does not have the capacity to reconcile all entities, we can mint new URIs instead.

The first option is for the project or data owner to mint and host new URIs. It is then the responsibility of the data owner to maintain those URIs, keeping them stable and online.

Examples of namespaces that contributing projects used to mint their own URIs include:

  • https://mapoflondon.uvic.ca/
  • https://personography.1890s.ca/
  • https://anthologiagraeca.org/api/

If your project is not able to commit to minting and hosting URIs, then LINCS can mint them for you under the namespace http://id.lincsproject.ca/.

Note that your data is going to end up with entities with the namespace http://id.lincsproject.ca/ because CIDOC CRM introduces intermediate nodes for events that do not have URIs elsewhere in LOD sources.

URIs in your Data

When multiple LINCS projects each use the same URI as an entity’s primary identifier, people will be able to easily view the merged version of those records in ResearchSpace and query each dataset to see the individual contributions. This shared use of primary URIs helps with the “linked-ness” of the linked data.

With that said, we do have projects that choose to use their own identifiers as the primary identifiers—even if the same entity is already in LINCS—so that their whole dataset is consistent. That choice is ultimately up to your research team.

Here are some examples of how URI found through reconciliation can be added to your data, using Map of Early Modern London (MoEML) data as an illustrative sample:

Option 1

When we find an external URI for an entity, we use that as the primary identifier for that entity in LINCS. We then have two sub-choices for how the project specific URI could be used:

  1. Project URIs become the objects of owl:sameAs relationships:
<http://www.wikidata.org/entity/QYYY> owl:sameAs <https://mapoflondon.uvic.ca/XXX>
  1. Project URIs become identifiers for the entities:
<http://www.wikidata.org/entity/QYYY> crm:P1_is_identified_by <http://id.lincsproject.ca/AAA> .
<http://id.lincsproject.ca/AAA> rdf:type crm:E42_Identifier .
<http://id.lincsproject.ca/AAA> crm:P190_has_symbolic_content "FLEM1" .
<http://id.lincsproject.ca/AAA> crm:P2_has_type <http://id.lincsproject.ca/BBB> .
<http://id.lincsproject.ca/BBB> rdfs:label "Map of Early Modern London Project Identifier" .

When there is no existing URI for an entity from any authority source, we have two more choices:

  1. Use project URIs as the primary identifier
  2. Mint a LINCS URI and use that as the primary identifier and then connect to the project URI using one of the choices above

Option 2

Every person in the data would have a MoEML URI as the primary identifier for that entity.

We would connect those entities to their reconciled value using owl:sameAs:

<https://mapoflondon.uvic.ca/XXX> owl:sameAs <http://www.wikidata.org/entity/QYYY>

De-duplication within your Data

As explained in the Reconcile Entities step, we recommend that you enhance your source data with internal unique identifiers for each entity before starting to reconcile. These can be temporary identifiers that will be replaced before your data is published. The benefit is that if you extract entities from your text, reconcile them, and put the new URIs into the source or converted data then you will be able to easily put the new URIs in the correct locations.

Depending on your approach, having these internal identifiers can also help you de-duplicate your own data before you start reconciling it against external sources. If you do not de-duplicate your data first, then you will effectively still do that as long as you assign the same external URI to each occurrence of that entity in your data. The downside is that reconciliation may take longer if you are looking up each instance of the same thing.

Placeholder URIs

Projects often do not have reconciliation complete by the time we start setting up the code or tools to implement the rest of the conversion. In these cases, we introduce placeholder URIs that can be swapped out in the final data once we have all completed reconciliation. LINCS, for example, uses the namespace http://temp.lincsproject.ca/ to represent a placeholder.