Aller au contenu principal

Reconcile Entities

Introduction

Entity reconciliation, also called entity linking and named entity disambiguation, is the step where we add unique identifiers in the form of URIs to your data to represent each unique entity. The goal is to use the same identifier every time that the same real-world thing is mention in your data, other LINCS data, and, ideally, linked data elsewhere on the web. By using the same identifier, we can connect all of the statements made about that entity together to create a rich and informative graph about that entity.

This page covers how reconciliation fits into the data conversion workflows and the available tools, while our Reconciliation Guide gets into the details of how to reconcile entities.

For a list of all reconciliation tools that LINCS uses, see the Reconcile page.

Resources Needed

Given that this process is so time consuming, we recommend starting to reconcile our data as early as possible. You can follow our Reconciliation Guide to start reconciling even before you have committed to the rest of the LINCS conversion process.

Reconciliation can be completed in tandem with the other conversion steps. Use placeholder values in the conversion until your research team has finished reconciling. Once URIs have been found, they can be added to either the source data or to the converted data to replace the placeholder values. The dataset, however, will not be published by LINCS until either the research team finishes their reconciliation or LINCS and the research team come to an agreement that no more reconciliation can take place, and new URIs need to be minted for the remaining entities. Note that once the data is published, the research team can continue to enhance it, including further reconciliation.

remarquer

Reconciliation will always be completed by your research team because it requires domain knowledge to ensure you are choosing the correct identifiers. This is a great task for undergraduate or graduate research assistants.

The Conversion Team can offer guidance on this step and particularly with how to set up your data for reconciliation and how to merge the URIs back into your data.

Time Commitment

Reconciliation tends to be the slowest part of the conversion process. It can be sped up with tools that perform automated linking, but this comes at the sacrifice of accuracy. The loss in accuracy is worsened for the type of data coming into LINCS because it references more obscure or historically overlooked entities that are not well represented in existing LOD sources.

LINCS’s approach is to mix automation with manual review:

  1. Start with tools that automatically suggest candidate matches for entities
  2. If possible, apply filtering based on the context for each entity in your data and the authority data to separate trustworthy suggestions from ones that need review
  3. Have students manually review the uncertain candidates

With that mixed approach, you can estimate the time needed by assuming that each entity in your data will need 1-5 minutes for a human to reconcile it. This range depends on how familiar the person is with the data and whether they will need to spend time researching the entities to confirm a match.

For large datasets, it is not always feasible to carefully reconcile all entities. Our strategy for this has been:

  1. Reconcile as much as is feasible
  2. When the data is ready for publication, other than reconciliation not being completed, your team can discuss with LINCS at what point you would like to call it and mint URIs for the remaining unreconciled entities
  3. Once the data is published, you can slowly continue to add URIs for the unreconciled entities in ResearchSpace Review
Research TeamOntology TeamConversion TeamStorage Team
Set Up your Data
Reconcile your Entities
Merge Reconciled Data
Choose Vocabularies

Set Up your Data

For each workflow there are typically two options for setting up your data:

  • Use a tool that takes your data in its original format and allows you to add URIs to the source data. For example, LEAF Writer lets you annotate XML and TEI data with entity types and URIs.
    • In this case, you should not need to do any setup beyond the typical Clean Data step.
  • Use a script or query tool to pull entities out of your data, along with contextual information about those entities. Then use a tool such as VERSD or OpenRefine to find URIs for the entities. Finally, use another script or query tool to insert those URIs back into either the source data or the converted data.
    • This case will typically result in one or more spreadsheets where each row represents one entity and the columns contain contextual details about the entity. For example, you may have one internal unique identifier per row to represent a person, and then columns for their name, birth date, and death date so that you can quickly check if candidate URIs are correct.
    • LINCS typically uses custom scripts to complete this step. The Conversion Team can offer advice and sample scripts from previous conversions.
attention

Consider enhancing your source data with internal unique identifiers for each entity. These can be temporary identifiers that will be replaced before your data is published. The benefit is that if you extract entities from your text, reconcile them, and put the new URIs into the source or converted data then you will be able to easily put the new URIs in the correct locations.

Reconcile your Entities

For structured data, extract entities and their context from your source data. This process may require a custom script, but often it will be as easy as using a simplified version of your source spreadsheets or using an online tool to convert structured data into a spreadsheet. To find and confirm URIs, use VERSD if your data is bibliographic and OpenRefine otherwise. Note that OpenRefine accepts a broad range of starting file types so you may be able to skip the initial extraction step.

Particularly for small datasets, you may find it sufficient to manually lookup URIs and add them directly to your source data or wait and add them to the converted data.

Merge Reconciled Data

The research team and Conversion Team will use a custom script or the Linked Data Enhancement API to merge the new URIs with either the cleaned version of the source data or the converted data.

The Conversion Team will mint new URIs for anything that could not be reconciled.

Choose Vocabularies

Similarly to reconciling entities, you also need to choose vocabulary terms to use in your data and include their URIs. These vocabulary terms are often used to add fine-grained types to entities and relationships, compared to the broad types that CIDOC CRM uses. Choosing the appropriate vocabulary terms for your project will require you to explore the terms’ definitions to find ones that fit. When possible, prioritize using terms that are already frequently used within LINCS data to increase the connections and potential for interesting queries between your data and other LINCS data.

See our Vocabularies documentation for additional background and the Vocabulary Browser to find vocabulary terms created by or used in LINCS.