Export Data

Introduction

Before transforming your data, you need to prepare a version of it that is easy to share and work with. The idea is to export all of the data that you want to transform from your project’s data store and save it in a format suited for your next transformation step. Because this is a custom step for each project, relying on your unique data store, LINCS can only guide you on what the outcome should look like.

Resources Needed

If your data needs to be extracted from a data store like a relational database, you will likely need support from your database administrator. Support from a team member with basic programming knowledge (e.g., undergraduate Python experience) is useful for restructuring your data.

This step could range from a one hour task of downloading text files and moving them to a shared location, to a few days task of exporting complex relational data into XML files that group the relevant information.

	Research Team	Ontology Team
Identify your Source Data	✓	✓
Send a Representative Sample	✓
Export your Full Dataset	✓
Send your Full Dataset	✓

Identify your Source Data

The research team identifies which dataset or part of a dataset they hope to transform. You will need to work with your technical team to answer:

Where is the data stored?
How can you access it?
Whose help do you need to access it? To export it? To restructure it?
Can you make changes directly to it during the cleaning and entity matching steps?
Will you transform all of your data? Or only some fields?

Many projects have their data available to view on a website. When we talk about exporting data, we mean that you need to find out where the data is actually stored—where is the website getting the data from? You will need to go straight to that source or identify that there is an API or a data download link that can handle the export for you.

Choose your Source Data

LINCS emphasizes finding your source data because that will be the copy of your data that we use as the starting point for your transformation workflow. For many of the workflows, we develop code and mappings that rely on the structure of your source data remaining constant—even if the contents change through cleaning and entity matching as we go.

Think of your source data as the place where you will go to make changes if you find data errors during the transformation process. Would you go back to the database, edit there, re-export, then re-run the LINCS transformation code? This would mean you have improved the version of your data that you are likely to use for other purposes. Or would you rather export your data into an easy to work with format like a spreadsheet, clean your data in the sheet, run the LINCS transformation code, and leave changes to your true source data for a future project?

An important consideration is whether you intend to continue transforming more data to add to LINCS after the initial transformation. If yes, it is better to spend time at the start making sure the export and cleaning steps are easily repeatable without needing to waste time duplicating manual work like creating a spreadsheet with special formatting. If not, and this is a one-time transformation, it is safe to prioritize speed.

If you plan to contribute your data to LINCS, your research team, the Ontology Team, and the Transformation Team will meet to discuss the structure and contents of the identified dataset. We will discuss the overarching research goals of the research team so we can create linked open data (LOD) that is useful and meaningful.

It is helpful for the research team to come to this meeting prepared with some research questions they are hoping to answer.

remarquer

A common scenario for the structured data transformation workflow is where a project has a relational database that is the true initial source of their data. They have a few options of what to count as the source of data when working with LINCS:

Treat that relational database as the source. While we are transforming data, if errors need to be fixed in the data, the research team makes changes directly to their database. Their database administrator then creates a copy or a data dump of that database, which includes all of the data and the schema that tells us how the data is organized. LINCS’s transformation code takes that data dump as the input or starting point.
If there is an API that allows us to request data from their database, then they still treat the database as the source like in the previous option. But we do not need a database dump to be regenerated each time significant changes are made. Instead, the transformation code would consider the API as the starting point, where we can call on it from the transformation code to get up-to-date data.
The research team does not have access, permission, or capacity to make changes directly to the source database. Instead, they create a one-time export of their database into a format they like working in such as a spreadsheet or XML document. Cleaning and entity matching happen directly in that new format and the transformation code uses that new format as the starting point.

Send a Representative Sample

When you start working with LINCS, send us a representative data sample so that we can help create a conceptual mapping.

A representative sample:

Must include all of the fields you want to be present in your transformed data. For example, there is a spreadsheet column for every category of data.
Must not include fields that should remain private to your institution or that will not be useful in LOD form. Examples include internal database IDs, institutional information about object purchasing, or personal information about still-living persons.
Must include blank fields or placeholders for data you will add before transformation is complete. If adding the fields is not possible, communicate to LINCS the changes you intend to make.
Should not include blank fields or placeholders for data you will not have time to add in the near future. Those fields are usually best left for a second round of transformation once you are comfortable with the process.

Info

LINCS will request a representative sample in our first meeting. Come prepared and kick-start the transformation process.

Export your Full Dataset

The expected output depends on the structure of your data and the transformation tools you plan to use. It is helpful to review the rest of the transformation workflow steps before completing this step, particularly the Implement Conceptual Mapping step.

Structured Data
Semi-Structured Data
TEI Data
Natural Language Data

If you plan to use 3M for transformation, as LINCS typically does for structured data, then export the data and transform it into XML document(s) following the suggestions in Preparing Data for 3M.

Otherwise, you can:

Setup a custom data export, in the programming language of your choice, that outputs your data as LINCS compliant Resource Description Framework (RDF), as defined by your Develop Conceptual Mapping step.
Find a tool online that suits your data and export your data into the format the tool expects as input.

See the Implement Conceptual Mapping page for help deciding if 3M or a custom solution is right for your data.

Typically, it is easiest to export data into individual XML files that can be worked on one at a time. However, if you have short documents with highly connected content, you may prefer to combine them into a single XML file.

Be sure to name each file with a unique and meaningful title, like with a unique document identifier.

Each file will have an extension .xml and, while it is fine for the XML to contain tags unique to your project, it needs to be valid XML. You can check this using an online XML validator.

Export the data so you end up with an individual file for each TEI document. Be sure to name each file with a unique and meaningful title, like with a unique document identifier.

Each file will have a .xml extension.

Send your Full Dataset

As we get further into the transformation process, if LINCS is completing any of the transformation steps with you, we will need a copy of your full dataset before any more work can continue.

If the data is publicly available on a website or through an API, then please share links and documentation. This documentation helps us see the meaning that you are showing through the data.

Introduction​

Resources Needed​

Identify your Source Data​

Choose your Source Data​

Send a Representative Sample​

Export your Full Dataset​

Send your Full Dataset​