Breaking Down Barriers to Data Conversion
If there’s one thing I have learned during my library, archival, and information graduate studies, it is that information institutions are adverse to change. The archival profession progresses at a glacial pace. This is juxtaposed with the leaps and bounds made in information technology over the past twenty-five years. At first glance, it doesn’t make sense why many libraries are still using the antiquated MARC format for their bibliographic records, or why archival institutions in Canada are still mandated to use the 2008 release of Rules for Archival Description, which doesn’t have any real solution for describing electronic records...
Why can’t we modernize our standards and our practices? Why can’t we develop shiny new software to provide more elegant, complex, and robust information infrastructure? The technology to do and build these things is there—all we have to do is reach out and grab it.
The problem boils down to the cost. The cost of building new infrastructure, the cost of the labour needed to make systemic changes, the cost of educating staff in new systems. Most Canadian information institutions rely on funding—often public funding—to operate. These institutions are not designed to turn a profit; their value is precisely in being free for their users. Unfortunately, funding is rarely enough to support capital projects such as infrastructure upgrades, and it’s hard to explain to the budgeting department why it’s so expensive to move from an old data standard to a new one or why this might be an important project. Information infrastructure—and the value it brings—is so often invisible to the public.
With this perspective in mind, the data transformation aspect of LINCS is a massive endeavour. We have gone through the ontological challenge of choosing a standard, triplestore using CIDOC CRM, for our Linked Data (LD) triples. We have conceptualized how existing metadata schemas and ways of describing information can fit into this new framework through the process of mapping: identifying the elements in a target schema that are functionally equivalent to the elements in a source schema. We have picked out platforms for storing the resulting transformed data. Even with all these puzzle pieces in place, we must still take the original data for each dataset being contributed to LINCS and turn it into CIDOC CRM TTL triples for ingested into the new triplestore system. Moving from one traditional metadata standard to another is challenging on its own, let alone moving from an element-value style metadata to CIDOC CRM. We can write TTL triples using CIDOC CRM’s structure by hand—and we certainly did, when testing out how to map our first dataset to be transformed, the Yellow Nineties Personography—but we can’t transform a whole dataset manually (let alone all of the datasets that are part of the project), because that would take time, and time is money. Manual transformation is messy, slow, and does not contribute to replicability of processes, whereas automated transformation pipelines decrease errors, save time, and are reusable for future batches of data.
Having the right tools for the job makes all the difference. Having an online, free, open-source platform like the X3ML Toolkit and its data transformation tool Mapping Memory Manager (3M) makes an even bigger difference. 3M is a data transformation tool designed to transform data to different structures; in our case, a custom metadata schema in XML-RDF to CIDOC CRM in TTL-RDF. 3M can ingest a sample of metadata from an original dataset and be given a target ontology to transform the data to. Then, it’s a matter of telling 3M which element from the original data you want to map, and what pattern in your target ontology you want to map to. Once a mapping is set up and tested, it can transform a full dataset in seconds. 3M has automated the process of data transformation with minimal customization required, relieving us from the need to create a bespoke tool and saving the data transformation team valuable time. It has allowed us to transform individual datasets over the course of a few weeks rather than extending the transformation time to include tool development as well.
The ontological magic of teaching 3M how to transform metadata to CIDOC CRM happens in the Matching Table, a tab within the 3M tool—this was the primary arena of my work in the transformation process. The Matching Table can handle complex transformations, allowing for the multiple connected triples that CIDOC CRM requires in order to semantically express nuanced information, for example conceptualizing the school a person attended as an education activity with roles, as seen in the bottom row of the Matching Table below.
The 3M Matching Table.
Using 3M, we defined the desired CIDOC CRM–compliant output for each Yellow Nineties Personography metadata element, verified that each element was transforming correctly using a small sample of real data, and transformed the entire Yellow Nineties Personography dataset to CIDOC CRM in TTL format. We’re looking forward to automating this process further by using a graphing tool or by writing code that can help us vet the data, but for now we’re very happy with the output of this first transformation—it’s certainly a milestone for the team that worked on the transformation of Yellow Nineties Personography and for LINCS in general. We could not have completed the transformation process without having the right tool, a tool that happened to be the free, open source option. 3M greatly reduces the labour required to transform data, making the transformation process more achievable. My hope is that our success transforming the data can be used as an example of how other institutions may do the same, all without breaking the bank. We’re confident that 3M is the right fit for LINCS and are ready to transform more datasets in the future!
Check out my colleague and teammate in data transformation Justin’s blog post for more information about 3M and how it fits into the framework of an Extract-Transform-Load Pipeline.