Skip to content

Corpus Description

Paco Nathan edited this page Mar 8, 2020 · 7 revisions

The corpus for this machine learning competition provides a graph of research publications linked with their datasets in that research plus other metadata. This uses the ADRF vocabulary based on DCAT, etc.

The corpus is provided in both TTL and JSON-LD serialization formats. The former is more human-readable and can have axioms applied for consistency checking, while the latter is generally more usable for machines. It takes about two lines of Python to convert between the two formats.

Entity Definitions

Entities in the graph each have a title field and an id (a generated UUID), and as much as possible they are uniquely identified by persistent identifiers. These linked data annotations have been verified by domain experts.

entity persistent identifier required fields optional fields notes
dataset ADRF vocab ID id, title, provider doi, url, alt_title (list), description, date  
provider ROR id, title ror, url, description  
publication DOI id, title, datasets (list), journal doi, url, pdf (open access URL) open access PDFs are downloaded and provided in a public S3 bucket
journal ISSN id, titles (list), issn (list) url first element in titles list is the ISO 4 standard abbreviation; first element in issn list is the linking ISSN
author ORCID id, title orcid, url  
topic madsrdf:Authority label    

Usage

For examples of how to read and write the corpus files in Python -- both in TTL and JSON-LD formats -- see the write_corpus() method in the gen_ttl.py script.

The corpus will be extended over time, with updates managed using GitHub tags and versioning. After each update, previous entries in the leaderboard will get re-evaluated.

Note that names in the dct:alternative field are merely informational -- what our human annotators have encountered when reading PDFs to identify dataset references manually. For the purposes of the competition the ML models don't need to use them in any way.