Data Cleaning Guide
Introduction
Below are common data cleaning tasks that LINCS has applied to previously transformed data.
First, determine which tasks apply to your data and what your team has the capacity to update. You may choose to prioritize the easiest tasks that have the biggest impact, while leaving less impactful steps to fix slowly on your published data.
Most of these concepts are relevant to any format of data—the input for any of the four transformation workflows—but they will be applied slightly differently in each case. Maybe an entity is in a spreadsheet cell for structured data, an XML attribute for semi-structured data or TEI data, or in the middle of a sentence in natural language data. The guiding principles behind why these cleaning tasks are important remain the same.
Separate Entity Mentions
It is important to separate each mention of an entity. This task is most relevant to structured and TEI data workflows where the goal is for any element in your source data to represent one single entity. Separating each entity allows LINCS’s automated or semi-automated methods of Implement Conceptual Mapping to grab a single entity at a time.
Split Entity Lists
Imagine your starting data is a spreadsheet. The goal is for every cell in the sheet to represent one single thing—be that an entity, label, concept, ID, or type. If you have a column that contains a list of a person’s names, for example, then you should split that column into several columns so that each name is in its own cell.
Here is a sample of Yellow Nineties Personography data. Note how the cells in the “alternateName” column contain multiple names, separated by a semi-colon:
ID | name | alternateName | sameAs |
---|---|---|---|
farr-florence | Florence Farr | Mary Lester; Florence Emery; S. S. D. D | http://viaf.org/viaf/94732733 |
norregard-julie | Julie Norregard | Eva; Julie Norregaard Le Gallienne | http://viaf.org/viaf/26376093 |
The cleaned version of this data has each name in a separate column:
ID | name | alternateName1 | alternateName2 | alternateName3 | sameAs |
---|---|---|---|---|---|
farr-florence | Florence Farr | Mary Lester | Florence Emery | S. S. D. D | http://viaf.org/viaf/94732733 |
norregard-julie | Julie Norregard | Eva | Julie Norregaard Le Gallienne | http://viaf.org/viaf/26376093 |
This change can also be made in XML data. Note how the <Y90s_also_known_as>
element contains a list of names:
<?xml version="1.0" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="https://personography.1890s.ca/persons/farr-florence/">
<y90s_name>Florence Farr</y90s_name>
<Y90s_also_known_as>Mary Lester; Florence Emery; S. S. D. D</y90s:also_known_as>
<Y90s_same_as rdf:resource="http://viaf.org/viaf/94732733"/>
</rdf:Description>
<rdf:Description rdf:about="https://personography.1890s.ca/persons/norregard-julie/">
<y90s_name>Julie Norregard</y90s_name>
<Y90s_also_known_as>Eva; Julie Norregaard Le Gallienne</y90s:also_known_as>
<Y90s_same_as rdf:resource="http://viaf.org/viaf/26376093"/>
</rdf:Description>
</rdf:RDF>
Each name should go in its own element:
<?xml version="1.0" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="https://personography.1890s.ca/persons/farr-florence/">
<y90s_name>Florence Farr</y90s_name>
<Y90s_also_known_as>Mary Lester</y90s:also_known_as>
<Y90s_also_known_as>Florence Emery</y90s:also_known_as>
<Y90s_also_known_as>S. S. D. D</y90s:also_known_as>
<Y90s_same_as rdf:resource="http://viaf.org/viaf/94732733"/>
</rdf:Description>
<rdf:Description rdf:about="https://personography.1890s.ca/persons/norregard-julie/">
<y90s_name>Julie Norregard</y90s_name>
<Y90s_also_known_as>Eva</y90s:also_known_as>
<Y90s_also_known_as>Julie Norregaard Le Gallienne</y90s:also_known_as>
<Y90s_same_as rdf:resource="http://viaf.org/viaf/26376093"/>
</rdf:Description>
</rdf:RDF>
Pull Information out of Sentences
If you have an element in your data that contains embedded information like full sentence notes about an object, then you have three main options.
Option 1
You can keep the note as full text and it will end up as a full sentence string in the RDF data, like in this example from Canadian Centre for Ethnomusicology data.
As a spreadsheet:
ID | NOTE |
---|---|
CCEA2003.5.12 | On display in Studio 27, FAB. June 22, 2009 |
As XML:
<?xml version='1.0' encoding='UTF-8'?>
<artifacts>
<catalogue_artifact>
<id_number>CCEA2003.5.12</id_number>
<note>On display in Studio 27, FAB. June 22, 2009</note>
</catalogue_artifact>
</artifacts>
When transformed into RDF, this becomes:
<http://exampleURI/artifacts/note/CCEA2003.5.12>
a crm:E33_Linguistic_Object ;
rdfs:label "Note for Canadian Centre for Ethnomusicology object CCEA2003.5.12"@en ;
crm:P190_has_symbolic_content "On display in Studio 27, FAB. June 22, 2009"@en ;
crm:P2_has_type <http://vocab.getty.edu/aat/300027200> .
This is a valid option and will allow you to search these notes in your transformed data for keywords.
Option 2
You can pull out information into new elements in your source data. In that example, you may add columns “displayLocation” and “displayStartDate”:
ID | displayLocation | displayDate |
---|---|---|
CCEA2003.5.12 | Studio 27, FAB. | June 22, 2009 |
Or as XML, one option could be:
<?xml version='1.0' encoding='UTF-8'?>
<artifacts>
<catalogue_artifact>
<id_number>CCEA2003.5.12</id_number>
<display>
<location>Studio 27, FAB.</location>
<start_date>June 22, 2009</start_date>
</display>
</catalogue_artifact>
</artifacts>
Now, when this is transformed into RDF, there can be relationships for where each artifact is on display. This would allow you to query your data to find all of the artifacts on display at a given location during a certain time period.
Option 3
If the text is extensive, you may choose to run all of the text through the natural language data workflow in addition to using your current workflow on the rest of the data.
Refer to Entities Consistently
In LOD, the exact way in which you refer to an entity matters. If you use two labels for something and there is a discrepancy in the spelling, capitalization, or spacing, then it will appear as two separate labels which may imply meaning and intention behind the choice.
This step is to ensure you always refer to the same entity in the same way. The exception here is if your data contains intentional alternate names or names in a different language. If so, separate that information into different columns or elements and explicitly identify things like the languages, or any sources or hierarchy of alternate names.
Make sure to:
- Use consistent spelling
- Use consistent capitalization
- Use consistent punctuation
- Use consistent accented characters in non-English text
- Remove leading, trailing, and duplicated spaces
- Remove or replace erroneous characters and symbols (e.g., HTML tags left in the data or errors from encoding changes like
instead of a space) - Replace obscure abbreviations or acronyms with full names if you have them
There are tools such as OpenRefine that offer clustering functionalities to spot words or phrases that are very similar but are not quite equal. Most spreadsheet tools like OpenRefine, Google Sheets, or Microsoft Excel also allow you to apply facets and filters to columns in your data so you can see a list of unique terms and choose what to rename and combine.
Person Names
LINCS prefers to have the primary label for a person entity to be of the form Firstname Lastname
rather than Lastname, Firstname
. This order results in cleaner labels for activities connected to those people, such as "Friend relationship between Aubrey Beardsley and Joseph Pennell"@en
in Yellow Nineties instead of "Friend relationship between Beardsley, Aubrey and Pennell, Joseph"@en
.
If possible, add this name format option to the version of the source data you will transform.
There is still the option to include additional labels that use other name formats. LINCS often pulls in alternate labels from external sources and we cannot control for all name variants so you will find a mix in LINCS data.
Dates
The date format you choose is typically not important because we have options to include different formats that vary in specificity. What matters is that it is consistent throughout your data—especially within a single column—and that it is clear what calendar is being followed.
Geographic Coordinates
ResearchSpace has functionality to display geographic locations with coordinates in visual maps. You will get specific guidance on this during the Develop Conceptual Mapping step, but if you have coordinates in your data, you could prepare them by transforming them into strings of the following formats:
"POINT(lon lat)"
"MULTIPOINT(lon lat,lon lat)"
"LINESTRING(lon lat,lon lat,lon lat)"
"POLYGON((lon lat, lon lat, lon lat))"
"MULTIPOLYGON(((lon lat, lon lat),(lon lat)))"
"MULTILINESTRING((lon lat, lon lat), (lon lat))"
For example, a simple point coordinate will end up in the transformed data as a string like this:
@prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> .
@prefix geonames: <https://sws.geonames.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
geonames:2077453 a crm:E53_Place ;
rdfs:label "Australind Shoal"@en ;
crm:P168_place_is_defined_by "POINT(114.91667 -21.56667)" .
For details on the supported geometry types, refer to the specifications for the GeoJSON format.
Unknown Values
The way that you indicate an unknown value matters.
For example, if there is a written work where it is important that the author is unknown then you may choose to have the value “unknown” in the author column and there will be an entity linked to that work in the final RDF data called “unknown.”
If, however, you have incomplete data, leave that data blank. If you have a mix of null
, n/a
, ?
, or unknown
values, they will unintentionally end up in your data.
Entity Identifiers
Introduce Internal Unique IDs
This is discussed in more detail during the Match Entities step, but it is often useful to add unique internal identifiers for entities. You may choose to handle that during the data cleaning step of your workflow.
URIs and Prefixes
If there are entity identifiers in your data, like unique project IDs or Uniform Resource Identifiers (URIs) from any source, you need to make sure they are valid and that you have included the official version of them.
When using external URIs in your data, make sure to:
- Remove or add a
/
to the end of a URI. When using a Virtual International Authority File (VIAF) URI as an entity identifier, for example, the URI should not have a trailing slash even though the Uniform Resource Locator (URL) in the address bar for a record has one. - Double check your use of
http
vshttps
. For example, VIAF and Wikidata both usehttp
for their LOD URIs, while GeoNames useshttps
. - Confirm you are using the official LOD version of an entity URI. For example, when you visit a page on Wikidata, the page will have a URL starting with
https://www.wikidata.org/wiki/
, but the official namespace URI for entity identifiers ishttp://www.wikidata.org/entity/
. Similarly, if you are using project specific URIs, decide what prefix to use and be consistent.
LINCS has a list of commonly used identifier prefixes as a useful first place to check.
Natural Language Data
There are some cleaning tasks specific to natural language data.
- Extract text from any semi-structured format like HTML or XML. Tags and annotations will cause errors in the natural language extraction tools, but are fine to keep if you are following the semi-structured workflow or using tools like LEAF-Writer that are designed for XML documents.
- If there was structured information that you removed like entity tags and identifiers, then you could supplement your extracted natural language data with a spreadsheet of where each entity mention is and the information you removed about it. This process will help you map that extra information back to the correct entities at the end and further enhance your transformed data.
- Fix spacing, punctuation, and capitalization errors caused by the text extraction or poor OCR. The automated extraction tools will have better performance the closer your text is to perfect grammatically correctly text.
- Replace obscure abbreviations or acronyms with full names if you have them.
- Separate headings from the original text with new lines. If there is important heading hierarchy that you want to keep, you could use #s like in markdown format or have different numbers of new lines between certain levels of headings.
- If there is a logical way to split large texts into self-contained sections, it may be beneficial to help break up the transformation into parts.