Aller au contenu principal

Data Cleaning Guide

Introduction

Below are common data cleaning tasks that LINCS has applied to previously converted data.

First, determine which tasks apply to your data and what your team has the capacity to update. You may choose to prioritize the easiest tasks that have the biggest impact, while leaving less impactful steps to fix slowly on your published data.

Most of these concepts are relevant to any format of data—the input for any of the four conversion workflows—but they will be applied slightly differently in each case. Maybe an entity is in a spreadsheet cell for structured data, an XML attribute for semi-structured data or TEI data, or in the middle of a sentence in natural language data. The guiding principles behind why these cleaning tasks are important remain the same.

Separate Entity Mentions

It is important to separate each mention of an entity. This task is most relevant to structured and TEI data workflows where the goal is for any element in your source data to represent one single entity. Separating each entity allows LINCS’s automated or semi-automated methods of Implement Conceptual Mapping to grab a single entity at a time.

Split Entity Lists

Imagine your starting data is a spreadsheet. The goal is for every cell in the sheet to represent one single thing—be that an entity, label, concept, ID, or type. If you have a column that contains a list of a person’s names, for example, then you should split that column into several columns so that each name is in its own cell.

Here is a sample of Yellow Nineties Personography data. Note how the cells in the “alternateName” column contain multiple names, separated by a semi-colon:

IDnamealternateNamesameAs
farr-florenceFlorence FarrMary Lester; Florence Emery; S. S. D. Dhttp://viaf.org/viaf/94732733
norregard-julieJulie NorregardEva; Julie Norregaard Le Galliennehttp://viaf.org/viaf/26376093

The cleaned version of this data has each name in a separate column:

IDnamealternateName1alternateName2alternateName3sameAs
farr-florenceFlorence FarrMary LesterFlorence EmeryS. S. D. Dhttp://viaf.org/viaf/94732733
norregard-julieJulie NorregardEvaJulie Norregaard Le Galliennehttp://viaf.org/viaf/26376093

This change can also be made in XML data. Note how the <Y90s_also_known_as> element contains a list of names:

<?xml version="1.0" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="https://personography.1890s.ca/persons/farr-florence/">
<y90s_name>Florence Farr</y90s_name>
<Y90s_also_known_as>Mary Lester; Florence Emery; S. S. D. D</y90s:also_known_as>
<Y90s_same_as rdf:resource="http://viaf.org/viaf/94732733"/>
</rdf:Description>
<rdf:Description rdf:about="https://personography.1890s.ca/persons/norregard-julie/">
<y90s_name>Julie Norregard</y90s_name>
<Y90s_also_known_as>Eva; Julie Norregaard Le Gallienne</y90s:also_known_as>
<Y90s_same_as rdf:resource="http://viaf.org/viaf/26376093"/>
</rdf:Description>
</rdf:RDF>

Each name should go in its own element:

<?xml version="1.0" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="https://personography.1890s.ca/persons/farr-florence/">
<y90s_name>Florence Farr</y90s_name>
<Y90s_also_known_as>Mary Lester</y90s:also_known_as>
<Y90s_also_known_as>Florence Emery</y90s:also_known_as>
<Y90s_also_known_as>S. S. D. D</y90s:also_known_as>
<Y90s_same_as rdf:resource="http://viaf.org/viaf/94732733"/>
</rdf:Description>
<rdf:Description rdf:about="https://personography.1890s.ca/persons/norregard-julie/">
<y90s_name>Julie Norregard</y90s_name>
<Y90s_also_known_as>Eva</y90s:also_known_as>
<Y90s_also_known_as>Julie Norregaard Le Gallienne</y90s:also_known_as>
<Y90s_same_as rdf:resource="http://viaf.org/viaf/26376093"/>
</rdf:Description>
</rdf:RDF>

Pull Information out of Sentences

If you have an element in your data that contains embedded information like full sentence notes about an object, then you have three main options.

Option 1

You can keep the note as full text and it will end up as a full sentence string in the RDF data, like in this example from Canadian Centre for Ethnomusicology data.

As a spreadsheet:

IDNOTE
CCEA2003.5.12On display in Studio 27, FAB. June 22, 2009

As XML:

<?xml version='1.0' encoding='UTF-8'?>
<artifacts>
<catalogue_artifact>
<id_number>CCEA2003.5.12</id_number>
<note>On display in Studio 27, FAB. June 22, 2009</note>
</catalogue_artifact>
</artifacts>

When converted into RDF, this becomes:

<http://exampleURI/artifacts/note/CCEA2003.5.12>
a crm:E33_Linguistic_Object ;
rdfs:label "Note for Canadian Centre for Ethnomusicology object CCEA2003.5.12"@en ;
crm:P190_has_symbolic_content "On display in Studio 27, FAB. June 22, 2009"@en ;
crm:P2_has_type <http://vocab.getty.edu/aat/300027200> .

This is a valid option and will allow you to search these notes in your converted data for keywords.

Option 2

You can pull out information into new elements in your source data. In that example, you may add columns “displayLocation” and “displayStartDate”:

IDdisplayLocationdisplayDate
CCEA2003.5.12Studio 27, FAB.June 22, 2009

Or as XML, one option could be:

<?xml version='1.0' encoding='UTF-8'?>
<artifacts>
<catalogue_artifact>
<id_number>CCEA2003.5.12</id_number>
<display>
<location>Studio 27, FAB.</location>
<start_date>June 22, 2009</start_date>
</display>
</catalogue_artifact>
</artifacts>

Now, when this is converted into RDF, there can be relationships for where each artifact is on display. This would allow you to query your data to find all of the artifacts on display at a given location during a certain time period.

Option 3

If the text is extensive, you may choose to run all of the text through the natural language data workflow in addition to using your current workflow on the rest of the data.

Refer to Entities Consistently

In LOD, the exact way in which you refer to an entity matters. If you use two labels for something and there is a discrepancy in the spelling, capitalization, or spacing, then it will appear as two separate labels which may imply meaning and intention behind the choice.

This step is to ensure you always refer to the same entity in the same way. The exception here is if your data contains intentional alternate names or names in a different language. If so, separate that information into different columns or elements and explicitly identify things like the languages, or any sources or hierarchy of alternate names.

astuce

Make sure to:

  • Use consistent spelling
  • Use consistent capitalization
  • Use consistent punctuation
  • Use consistent accented characters in non-English text
  • Remove leading, trailing, and duplicated spaces
  • Remove or replace erroneous characters and symbols (e.g., HTML tags left in the data or errors from encoding changes like &nbsp; instead of a space)
  • Replace obscure abbreviations or acronyms with full names if you have them

There are tools such as OpenRefine that offer clustering functionalities to spot words or phrases that are very similar but are not quite equal. Most spreadsheet tools like OpenRefine, Google Sheets, or Microsoft Excel also allow you to apply facets and filters to columns in your data so you can see a list of unique terms and choose what to rename and combine.

Person Names

LINCS prefers to have the primary label for a person entity to be of the form Firstname Lastname rather than Lastname, Firstname. This order results in cleaner labels for activities connected to those people, such as "Friend relationship between Aubrey Beardsley and Joseph Pennell"@en in Yellow Nineties instead of "Friend relationship between Beardsley, Aubrey and Pennell, Joseph"@en.

If possible, add this name format option to the version of the source data you will convert.

There is still the option to include additional labels that use other name formats. LINCS often pulls in alternate labels from external sources and we cannot control for all name variants so you will find a mix in LINCS data.

Dates

The date format you choose is typically not important because we have options to include different formats that vary in specificity. What matters is that it is consistent throughout your data—especially within a single column—and that it is clear what calendar is being followed.

Geographic Coordinates

ResearchSpace has functionality to display geographic locations with coordinates in visual maps. You will get specific guidance on this during the Develop Conceptual Mapping step, but if you have coordinates in your data, you could prepare them by converting them into strings of the following formats:

"POINT(lon lat)"

"MULTIPOINT(lon lat,lon lat)"

"LINESTRING(lon lat,lon lat,lon lat)"

"POLYGON((lon lat, lon lat, lon lat))"

"MULTIPOLYGON(((lon lat, lon lat),(lon lat)))"

"MULTILINESTRING((lon lat, lon lat), (lon lat))"

For example, a simple point coordinate will end up in the converted data as a string like this:

@prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> .
@prefix geonames: <https://sws.geonames.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

geonames:2077453 a crm:E53_Place ;
rdfs:label "Australind Shoal"@en ;
crm:P168_place_is_defined_by "POINT(114.91667 -21.56667)" .

For details on the supported geometry types, refer to the specifications for the GeoJSON format.

Unknown Values

The way that you indicate an unknown value matters.

For example, if there is a written work where it is important that the author is unknown then you may choose to have the value “unknown” in the author column and there will be an entity linked to that work in the final RDF data called “unknown.”

If, however, you have incomplete data, leave that data blank. If you have a mix of null, n/a, ?, or unknown values, they will unintentionally end up in your data.

Entity Identifiers

Introduce Internal Unique IDs

This is discussed in more detail during the Reconcile Entities step, but it is often useful to add unique internal identifiers for entities. You may choose to handle that during the data cleaning step of your workflow.

URIs and Prefixes

If there are entity identifiers in your data, like unique project IDs or Uniform Resource Identifiers (URIs) from any source, you need to make sure they are valid and that you have included the official version of them.

When using external URIs in your data, make sure to:

  • Remove or add a / to the end of a URI. When using a Virtual International Authority File (VIAF) URI as an entity identifier, for example, the URI should not have a trailing slash even though the Uniform Resource Locator (URL) in the address bar for a record has one.
  • Double check your use of http vs https. For example, VIAF and Wikidata both use http for their LOD URIs, while GeoNames uses https.
  • Confirm you are using the official LOD version of an entity URI. For example, when you visit a page on Wikidata, the page will have a URL starting with https://www.wikidata.org/wiki/, but the official namespace URI for entity identifiers is http://www.wikidata.org/entity/. Similarly, if you are using project specific URIs, decide what prefix to use and be consistent.

LINCS has a list of commonly used identifier prefixes as a useful first place to check.

Natural Language Data

There are some cleaning tasks specific to natural language data.

  • Extract text from any semi-structured format like HTML or XML. Tags and annotations will cause errors in the natural language extraction tools, but are fine to keep if you are following the semi-structured workflow or using tools like LEAF-Writer that are designed for XML documents.
    • If there was structured information that you removed like entity tags and identifiers, then you could supplement your extracted natural language data with a spreadsheet of where each entity mention is and the information you removed about it. This process will help you map that extra information back to the correct entities at the end and further enhance your converted data.
  • Fix spacing, punctuation, and capitalization errors caused by the text extraction or poor OCR. The automated extraction tools will have better performance the closer your text is to perfect grammatically correctly text.
  • Replace obscure abbreviations or acronyms with full names if you have them.
  • Separate headings from the original text with new lines. If there is important heading hierarchy that you want to keep, you could use #s like in markdown format or have different numbers of new lines between certain levels of headings.
  • If there is a logical way to split large texts into self-contained sections, it may be beneficial to help break up the conversion into parts.