Skip to main content

Entity Matching Practice

Introduction

There are three steps to matching entities in your data to authority files:

1) Gather context about your entity

Review your data to see what facts you know about the entity you are trying to match. This could come from structured data like metadata records or sentences that mention the entity from a text document.

2) Find candidate matches

There are many approaches to this step. You may use a tool like OpenRefine that suggests potential matches. If the tool does not suggest a correct match, or you are working with an authority file that the tool does not connect to, then you will likely need to go to the external data source directly and use their search functionality.

3) Review the candidates and decide if one is a match

You need to review the context you have about the entity in your source data and compare it to the information about the candidate in the external data source. Does the evidence match well enough to claim they represent the same entity?

Examples

Let’s walk through examples for different types of entities. Try to think of your answer to each problem before looking at our suggestions.

Persons

Wikidata

Charles O’Brien

Entity

Name from Metadata:

  • Charles O’Brien

Excerpts:

“September 1998 — Hilary Mantel’s eighth novel, The Giant, O’Brien, published by Fourth Estate, traced the hapless career of a historical figure, the Irish Giant, a man who measured a little under eight feet tall.”

Charles Byrne travelled from Ireland to London at the end of the eighteenth century to put himself on display as a freak or monster. Though he took ingenious steps to try to keep his body out of the hands of dissecters, he was indeed dissected in the end by the famous Scottish surgeon and scientist Dr John Hunter.”

Source:

Candidates

We start by searching Wikidata using the person name in our metadata, Charles O’Brien. We do a quick scan of the results and the Wikidata description of each one. We find a British colonial governor, a cricketer (1921-1980), and a composer (1882-1968). None of the candidates look correct.

Let's move on and search for the name mentioned in the text, Charles Byrne. We see an Irish painter, an animator who worked at Disney, and an Irish entertainer. That one is promising, so we will visit the page and investigate.

Review

Let's review the evidence in Wikidata for this candidate: http://www.wikidata.org/entity/Q1063865

Date of death:

  • Listed as 1783, which would correctly make him a historical figure in September 1998 when Hilary Mantel wrote about him.

Medical condition:

  • Gigantism, which is strong evidence for a match.

Occupation:

  • Circus performer, which is again strong evidence for a match.

Notable people’s Wikidata pages often link to their Wikipedia pages. In this case, the linked page explicitly references Hilary Mantel’s novel The Giant, O’Brien.

Match Confirmed

Mary Fane

Entity

Name from Metadata:

  • Mary Fane

Excerpts:

“Frances Neville, Baroness Abergavenny, died, having entrusted her collection of prayers to her daughter on her deathbed.”

“Frances Neville, Baroness Abergavenny had one child (or one surviving child), a daughter, Mary (Nevill) Fane, born on 25 March 1554, to whom she entrusted her collection of prayers.View reference”

Source:

Candidates

All we know about this person is their name and their mother. See if you can find Mary in Wikidata? Can you think of a faster way by using the mother relationship?

Review

Instead of reviewing every Mary Fane in Wikidata, it would be faster to find the famous mother, Frances Neville, Baroness Abergavenny, for which there is only one match in Wikidata. On the mother’s page, we find the child property and click through to Mary Neville, Baroness le Despenser.

On Mary’s Wikidata page, we see:

  • The alternative name Frances Nevill is a match to the name in our excerpt.
  • She married a Thomas Fane, so the last name Fane in our metadata makes sense.
  • We know that the mother is a match.

Match Confirmed

Note that if your data has many known family relationships, you can speed up your reconciliation using SPARQL queries to find the family members of your already reconciled entities.

Adelaide Manning

Entity

Name from Metadata:

  • Manning, Adelaide

Excerpts:

“While in England, Sarojini Naidu lived at 5 Pembridge Crescent, Notting Hill, at the house of Adelaide Manning, a social reformer and strong believer in the importance of women's education, who hosted a number...”

“1864: Unitarian and feminist Mentia Taylor formed in London the Pen and Pencil Club to foster literary and artistic exchange.

Arthur Munby described the first meeting, at which the subjects were Suspense and Witchcraft: Everyone has to contribute something in prose or verse or in painting or sculpture. People sat round the drawingroom & listened to the stories, and then looked at the drawings &c afterwards.View reference He mentions good contributions from Adelaide Manning and a Miss Keary (either Annie or Eliza); other members included Frances Power Cobbe, Edwin Arnold, Austin Dobson, Edmund Gosse, and William Allingham.”

Sources:

  • The Orlando Project: Author profile of Sarojini Naidu
  • The Orlando Project: Freestanding event in 1864: Unitarian and feminist Mentia Taylor formed in London the Pen and Pencil Club to foster literary and artistic exchange.
Candidates

Based on the information in Orlando, we know that Adelaide Manning was a social reformer who hosted the author Sarojina Naidu at their house between 1895 and 1898.

Both VIAF and Wikidata produce one candidate match each for Adelaide Manning.

Review

Let’s start with reviewing the VIAF candidate. This candidate is British, but would have been about four years old when the biography subject lived at their house. For this reason, it does not seem reasonable to list this record.

By contrast, the Wikidata candidate offers us much more information. This candidate was alive during the correct era (1828-1905), lived in England, and was a social reformer.

Match Confirmed

Adolfo Suárez

Entity

Name from Metadata:

  • Suárez, Adolfo

Excerpts:

“Franco was succeeded as Spanish ruler, as he had arranged by negotiations in 1969, by Juan Carlos, grandson of the last reigning king of Spain. At this date only 17% of the Spanish parliament was elected by universal suffrage, and the expectations were that Juan Carlos would reign briefly or as a shadow of Franco, or both. In a referendum on 15 December 1976, however, unintimidated by a backdrop of terrorist murders and kidnappings, the Spanish people overwhelmingly voted for democratic constitutional reform, and in June the next year the king presided over democratic elections. They elected a coalition government under Premier Adolfo Suárez.”

Source:

  • The Orlando Project: Freestanding event on 20 November 1975: The Spanish dictator Francisco Franco died…
Candidates

Based on this excerpt, we know that Adolfo Suárez was a Premier of Spain in 1976.

Review

Since the excerpt does not include any mention of published work by Adolfo Suárez, we can assume that the VIAF candidate will not be as useful when it comes to confirming this match.

Because we know of his political involvement, we can likely get a more accurate match with the Wikidata candidate, which foregrounds his time as the Prime Minister of Spain (1976-1981).

Note that VIAF is not going to be as useful when the people are not writers. In these cases, Wikidata will likely be better.

Match Confirmed

VIAF

John Venn

Entity

Name from Metadata:

  • Venn, John,, 1759 - 1813

Excerpts:

“Her family was English, white; most of her male relations were merchants or clergymen. Various members of her family belonged to the Evangelical Anglican group called the Clapham Sect, a coterie of social reformers and anti-slavery activists, whose founding members included her maternal uncle John Venn and possibly her grandfather Henry Venn (1725 - 1797) and her father.”

“The group was established when John Venn became rector of Clapham, a post he held from 8 June 1792. Henry Thornton was a driving force in the establishment of the society, and his home on Clapham Common became the organization's first meeting place. Other important members included Charles Grant, William Wilberforce, James Stephen, Zachary Macaulay and Hannah More.”

Sources:

  • The Orlando Project: Author profile of Charlotte Elliott
  • The Orlando Project: Freestanding event in mid 1792-1815: These were the active years of the informal evangelical Anglican group later called the Clapham Sect (then known as the Saints).
Candidates

We know several things about John Venn based on the information provided in the metadata and excerpts: he lived from 1759 to 1813 and was a rector of Clapham as well as the maternal uncle of Charlotte Elliott.

VIAF and Wikidata both produce multiple candidate entities for John Venn.

Since we have metadata for John Venn that provides birth and death dates, VIAF may be a good starting point to locate a match for this entity.

Review

Based on the matching lifespan, we can be almost certain that the VIAF record for “Venn, John, 1759-1813” is the correct candidate entity.

However, if we want further confirmation, we can scroll down to the “About” heading near the bottom of his VIAF page. Here, we can find a link to a Wikipedia page for “John Venn (priest)” that provides more details about his occupation as rector of Clapham and his involvement with the Clapham Sect.

Match Confirmed

John Colet

Entity

Name from Metadata:

  • Colet, John

Excerpt:

“Even after its final reprinting, a prayer from it re-appeared in Daily Devotions, 1641 (attributed to John Colet), which in turn was reprinted at dates up to 1722 and was gifted from a mother to her daughter in 1812.”

Source:

Candidates

Using the context provided in Frances Neville, Baroness Abergavenny’s profile, we know that a work titled Daily Devotions was attributed to John Colet in 1641.

When we are looking to confirm a match based on published work attributed to a person, VIAF is often a better starting point than Wikidata.

Review

Searching for John Colet on VIAF produces a candidate entity of a theologian who lived from 1467 to 1519. We can see that there are many works attributed to this person, among which is listed: Daily devotions, or, the Christians morning and evening sacrifice. Digested into prayers, and meditations, for every day of the weeke, and other occasions. With some short directions for a godly life. Written by John Colet., 1646.

Match Confirmed

Hunton Addie W.

Entity

Name from Metadata:

  • Hunton, Addie W.

Excerpt:

“The numbering of Pan-African Congresses is counted differently by different sources. The first was held in London in 1900, organized by a West Indian barrister, Henry Sylvester-Williams, and attended by W. E. B. DuBois among others. At a second, held in Paris to coincide with the peace talks at Versailles in 1919, a few African-American women attended, like Addie W. Hunton and Ida Gibbs Hunt.”

Source:

  • The Orlando Project: Freestanding event on 15-21 October 1945: The fifth Pan-African Congress, held in Manchester, UK, marked the beginning of the end of colonial rule in Africa and the Caribbean.
Candidates

From this excerpt, we know that Addie W. Hunton was an African-American woman who was politically active around 1919, based on her attendance at the Pan-African Congress that year. However, this context in our source is a bit vague and does not provide enough specific information that would allow us to confirm a match with full confidence.

If we start by searching for a candidate match on VIAF, a record for Hunton, Addie W., 1866-1943 appears. As this person would have been of a reasonable age to attend the Pan-African Congress in 1919, she is worth further investigation.

Review

This VIAF candidate appears to be a weak but reasonable match based on what we know about Addie W. Hunton in our source data: she was an African American woman active in relevant political causes who would have been approximately 53 years old when she attended the Pan-African Congress in 1919.

We might be able to gain a bit more confidence in this match by checking out the Wikipedia page for Addie Waites Hunton linked in the About subheading of her VIAF page. It might mention something about her belief that more African-American women should be involved in the Pan-African movement, which was highly male-dominated at the time.

Match Confirmed

Multiple records in your data

Thomas Price

Entity

Sometimes, when you are working with a batch of data from the same source, you will have two different person records that have the same name. For instance, in the Orlando dataset, there are two entities named Thomas Price, which can be differentiated by their metadata and the contexts in which they appear: Price, Thomas,, manufacturer was the father of author Ellen Wood, and Price, Thomas,, 1787 - 1848 was a Welsh priest who went by the name Carnhuanawc. Note that it is important to check URIs or another identifier to ensure that you are locating the correct entity, especially when you are locating these entities in a database.

In this example, we are interested in looking at Price, Thomas,, 1787 - 1848.

Name from Metadata:

  • Price, Thomas,, 1787 - 1848

Excerpts:

“Her supporters in her scholarly project were the learned Welsh clergymen Thomas Price (whose bardic name was Carnhuanawc) and John Jones (Tegid).”

“Jane Williams published The Literary Remains of the Rev. Thomas Price, Carnhuanawc, in two volumes edited by herself.”

Sources:

Candidates

Once we have confirmed which Thomas Price in our dataset that we intend to reconcile, we can then move forward with locating a matching candidate entity in another source. For instance, this record in VIAF, which has the same name and birth and death dates as this person in Orlando, looks like a good potential match.

Review

This entity on VIAF matches the birth and death dates, name and alternative name, as well as the nationality of Thomas Price listed in Orlando.

Match Confirmed

Using additional sources

Adewale Maja-Pearce

Entity

Name from Metadata:

  • Maja-Pearce, Adewale

Excerpt:

“Maja-Pearce, Adewale. “Where to begin?”. London Review of Books, Vol. 40, No. 8, pp. 20-4.”

Source:

  • The Orlando Project: Bibliography
Candidates

Not much information Adewale Maja-Pearce is provided on their person page in Orlando because they are included in the dataset through a citation rather than being written into the prose of an event or author profile. We do know something about them based on this information: they are the author of an article titled “Where to begin?”.

We know that Adewale Maja-Pearce is an author, and we can still search for a candidate record in VIAF.

Review

In reviewing this candidate record on VIAF, there does not seem to be any mention of “Where to begin?”.

One way to double check this is through a quick Google search for Adewale Maja-Pearce and “Where to begin?”. This shows that “Where to begin?” is mentioned on the Wikipedia page for Adewale Maja-Pearce, which also features other works listed on the VIAF page. This confirms that this entity is a match.

Match Confirmed

Charles Martindale

Entity

Charles Martindale was a religious figure who one student reconciled by comparing the distance of the place of burial mentioned on his Wikidata page to the church he was mentioned as being associated with in Orlando. You can also follow the link to his Wikipedia page, which mentions his employment at Farm Street Catholic Church in London. While we don’t expect students to do that much to confirm a match, it’s an example of how you can use context and external information in creative ways to confirm or discount a potential match.

Name from Metadata:

  • Martindale, Charles

Excerpt:

“Ann Bridge was received into the Catholic Church in Farm Street, London, by Father Charles Martindale.”

Source:

Candidates

Searching for this person on Wikidata produces candidate entities for Charles Martindale or Cyril Martindale.

Review

Although at first glance Cyril Martindale’s Wikidata record might not look like a match, when looking at the Wikidata page further, you will realize his name is Cyril Charles Martindale.

You can also follow the link to his Wikipedia page, which provides further context clues by mentioning his employment at Farm Street Catholic Church in London.

Match Confirmed

Organizations

CBC

Entity

When you are reconciling an organization that consists of multiple parts or subsidiaries, it is important to ensure that you are matching to the correct part of the organization. If you accidentally link to a smaller part of the organization instead of the main body, it can result in false equivalencies.

Name from Metadata:

  • Canadian Broadcasting Corporation

Excerpt:

“In addition, if Newfoundland were given a public station there might also be requests from different quarters in New Brunswick, Saskatchewan and Alberta for the establishment of C.B.C. stations in those provinces”

Source:

Candidates

In Wikidata, a parent organization and its subdivisions may have separate records. For example, a Wikidata record that represents a parent organization is CBC/Radio-Canada, which established and oversaw various regional stations across the country. Subsidiaries of this organization include CBRT-DT (a CBC television station in Calgary), CBX (a CBC Radio One station in Edmonton), and CBYK-FM (a CBC Radio One station in Kamloops).

Review

This excerpt refers to the parent organization’s role in creating subsidiaries in various locations across Canada. Therefore, the correct match is CBC/Radio-Canada and not a child organization.

Match Confirmed

Locations

Daly's Theatre

Entity

Name from Metadata:

  • Daly's Theatre

Excerpt:

“Early 1934: Lesley Storm began her long and productive career as a playwright (for which she is chiefly remembered) when her first play, Dark Horizon, opened at Daly's Theatre in London.”

“In the same year John Oliver Hobbes and Moore also collaborated on the one-act comedy Journeys End in Lovers' Meeting (titled from Shakespeare), which was performed in June 1895 (according to her father's memoir). View reference at Daly's Theatre in London…”

“March 1892: Alfred, Lord Tennyson's The Foresters: Robin Hood & Maid Marion had its first performance at Daly's Theatre in New York. After playing in New York successfully, Tennyson's drama was published and was staged in London. Tennyson had first printed the play as a trial book for his own use in 1881.”

Source:

Candidates

This is an example where we need to be careful about which location our data is talking about because the same name is used in our data to refer to a theatre in London (presumably London, England) as a theatre in New York (presumably New York city, but could be New York state).

When we search Wikidata for Daly's Theatre, we again need to be careful as there are again multiple theatres with the same name. With locations and organizations, it's also common for them to change names over time and you may need to consider if it counts as the same entity after a name change. This is something to determine within your project when choosing authority URIs.

Review

Let's review the options in Wikidata for Daly's Theatre in London and use dates and locations to help us narrow the matches.

Daly's Theatre in London

  • We know that Lesley Storm was alive 1903 to 1975
  • We know that The John Oliver Hobbes play was performed in June 1895

Our first candidate seems correct for the London theatre based on dates and location. Depending on your project, you may be asked to do more research to find stronger evidence or this may be enough.

Materials

Entity

The xDX Project is a structured dataset focused on physical objects and their data includes descriptions of the materials that make up those objects.

For example, they have many objects made of metal:

The University of Saskatchewan Art Collection also lists materials for their artworks. For example:

Candidates

The Getty Art & Architecture Thesaurus is a good authority to use for materials as it has many detailed material URIs and a built in material hierarchy to say which materials fall under shared categories.

If we matched all of these specific types of metal to a generic Getty term for metal then we would inadvertently be saying that all metal objects are brass and stainless steel and wrought iron and bronze. Instead, we need to match each specific type of metal to their exact match. We can optionally say that each of those specific matches have broader term metal.

Review

Our confirmed specific matches are then:

For all of these material types, we can set a shared broader term of metal (http://vocab.getty.edu/aat/300010900).

Match Confirmed