Implement Conceptual Mapping
Introduction
The implement conceptual mapping step is where we finally convert your data from its original structure and ontology into LINCS RDF.
Resources Needed
For TEI data and natural language data, your team will do this step using LINCS tools. Our tools use templates for common relationships so you can get output in a few minutes. Though you may need to spend some time playing around with processing your source data to get the output you want.
For structured data and semi-structured data, we still have tools to help, but our approach is customized to each dataset so the process takes longer. An experienced user could convert a dataset in a few days but we find this step ends up taking a few weeks to a few months for the average project when you consider training, implementation, and troubleshooting. For these workflows, it will be a combined effort between the LINCS Conversion Team and your research team.
Research Team | Ontology Team | Conversion Team | Storage Team | |
---|---|---|---|---|
Set Up your Data | ✓ | ✓ | ✓ | |
Transform your Data | ✓ | ✓ | ✓ |
Set Up your Data
To proceed with this step, you must have a conceptual mapping developed for your specific data. Ideally this mapping will be final so that you do not need to redo this implementation step later on. With that said, it is fine to have a mapping that only covers certain relationships of interest as a starting point and then to add to that mapping and to this implementation step in phases.
It is best if you have already cleaned your data before this step. However, if your implementation is going to use code or a tool that can be rerun easily then it is fine to start on this step before you have finished data cleaning. You can rerun the implementation step when the final cleaned data is ready.
Transform Your Data
Whenever possible, use tools or scripts that let you easily edit and rerun this step. That way if you find errors in your source data or if you have more data later on, you can rerun this step to quickly convert it.
- Structured Data
- Semi-Structured Data
- TEI Data
- Natural Language Data
Every dataset in this category comes with a unique starting structure and by this point should have its own conceptual mapping. To grab each piece of information from the source data and reconnect it together as CIDOC CRM triples, LINCS prefers to use the 3M mapping tool.
The 3M mapping tool takes XML documents as input and, through its graphical user interface, allows users to select data from their source files and map it into custom CIDOC CRM triples. We have found that this is the easiest method to get consistently converted data. LINCS has developed 3M documentation to guide you through creating your first mapping file, and the Ontology Team and Conversion Team can provide support as you get started.
You may choose to use 3M if:
- Your data is already in XML or is in a format that can be easily converted to XML (e.g., a spreadsheet or JSON files)
- You do not have a team member with programming experience and need a tool with a graphical user interface
- Your data contains many relationships so the reliability of 3M output and its treatment of intermediate nodes will have a large benefit
Alternatively, you may choose to write custom scripts to convert your data instead of using 3M. You may choose to write custom scripts if:
- You have a team member who understands the source data, understands the conceptual mapping, and has sufficient programming experience
- Your data only covers a small number of relationships so learning 3M is not worth the time investment
- Your data is in a highly normalized relational database and the code needed to transform the relational data into XML would be equivalent to code needed to output triples
3M requires your data to be input as XML. If your structured data is not in XML, you can convert it following our Preparing Data for 3M documentation. This documentation also gives suggestions for ways to edit your XML data to make working in 3M easier.
The Conversion Team and Ontology Team use 3M or custom scripts to write out the conceptual mapping and run the transformation on the data, resulting in LINCS RDF. To make sure that the output from 3M is correct, either the research team or the Conversion Team and Ontology Team transform a small sample of the data and vet the results using the built in 3M visualization tools and a manual comparison process. The full dataset is then converted.
For semi-structured data, LINCS recommends that you use LEAF-Writer, which is a web-based editor that allows you to mark up XML documents, including tagging and reconciling entities. The tool does not require any programming knowledge, but does take manual effort to tag the documents. While this can be time consuming, we have found that for unique semi-structured data a manual approach is worth the resulting quality. With LEAF-Writer, you can continue to slowly mark up your documents and re-extract the output until you are happy with it.
In the future, LINCS will provide tools to move from working in LEAF Writer and the resulting web annotations data to converting that into CIDOC CRM and publishing the output with LINCS. As that process is being developed, the Conversion Team can work with you through this step.
If LEAF-Writer is not the right fit, you can use other XML processing tools or create custom extraction scripts that execute the conceptual mapping for a given dataset.
Development of custom scripts and work done in LEAF-Writer would both be completed by your research team, with the Conversion Team available to offer advice.
For TEI data, you can use the LINCS instance of XTriples to select a conversion template and automatically extract CIDOC CRM triples from your TEI files.
The LINCS XTriples templates expect your source files to conform to the templates in LEAF-Writer. If you are working with files that do not conform to these templates, please transform your TEI using the XSLTs linked from the XTriples documentation. For details on these conversion templates and XSLTs, see our XTriples documentation.
For additional extractions not covered by the XTriples templates, you will need to follow another workflow. If you have fairly regularized XML data then you could follow the structured data workflow, otherwise the semi-structured workflow. If there is significant natural language text contained in the TEI documents, then use the natural language workflow to extract facts from those textual elements.
The Natural Language Data workflow is still in progress. Check back over the next few months as we release the tools described here.
The task of extracting triples from natural language text in an automated way—without a human manually marking up a document—breaks down into the tasks of named entity recognition (NER) and relation extraction (RE), where we use a computer system to predict which words or phrases in the text represent named entities and what relationships the text expresses between those entities.
LINCS has developed APIs to make these automated tasks accessible. These APIs take plain text as input and output either:
- Triples where the predicate—or relationship—must be from a list of allowable predicates
- Triples where the predicate can be any word or phrase from the text
The first is the fastest and most reliable way to generate valid LOD, but limits the number of triples you will get because of the limits on possible predicates. The second method will give you many more triples to start with and can act as a productive first step in a more manual approach where a human goes through and cleans up the extracted triples. LINCS tools provide both of these options to you.
This level of automation is meant to be a faster, though less precise, conversion method than that of the structured conversion workflow or a manual treatment of natural language texts. If your Research Team has the time, then you can put more manual curation into it, using the tools as a starting point.
These extraction APIs will be part of LINCS-API and made accessible through programming notebooks and eventually through tools such as NERVE. In the meantime, NERVE is a great starting point for creating LOD from natural language texts even without the future relation extraction functionality. It allows you to tag entities in the text, reconcile them against external LOD sources, and then connect the mentions of those entities to the source text using the web annotations data model (WADM). LINCS can then help you transform NERVE’s output into CIDOC CRM triples ready for publication with LINCS.
One of the systems that we are using behind the scenes of these APIs is made possible through our collaboration with Diffbot. We have worked with them to tailor their Natural Language Processing API to handle the unique challenges of processing humanities texts.
The triples output from these automated systems may change slightly each time you run them if you edit the input text or if the system has updated since your last run. If you plan to run the tools on the same texts multiple times, consider how you will merge all your results after. For example, if you are manually reviewing to remove incorrect facts, keep track of those so that you can automatically remove them from future results.
After extracting triples with these tools, depending on how you choose to balance time versus data quality, you may want to inspect the results, removing inaccurate extractions and potentially adding missed triples. Finally, LINCS has an additional API to transform triples with predicates from our allowable list into CIDOC CRM triples ready for the Validate and Enhance step.
You should now have RDF data that follows LINCS’s ontology and vocabulary standards. Your data may not be quite ready for ingestion into the LINCS triplestore yet, but it will be after some final cleanup in the next step.