Corpus Description

The corpus for this machine learning competition provides a graph of research publications linked with their datasets in that research plus other metadata. This uses the ADRF vocabulary based on DCAT, etc.

The corpus is provided in both TTL and JSON-LD serialization formats. The former is more human-readable and can have axioms applied for consistency checking, while the latter is generally more usable for machines. It takes about two lines of Python to convert between the two formats.

Entity Definitions

Entities in the graph each have a title field and an id (a generated UUID), and as much as possible they are uniquely identified by persistent identifiers. These linked data annotations have been verified by domain experts.

entity	persistent identifier	required fields	optional fields	notes
dataset	ADRF vocab ID	`id`, `title`, `provider`	`doi`, `url`, `alt_title` (list), `description`, `date`
provider	ROR	`id`, `title`	`ror`, `url`, `description`
publication	DOI	`id`, `title`, `datasets` (list), `journal`	`doi`, `url`, `pdf` (open access URL)	open access PDFs are downloaded and provided in a public S3 bucket
journal	ISSN	`id`, `titles` (list), `issn` (list)	`url`	first element in `titles` list is the ISO 4 standard abbreviation; first element in `issn` list is the linking ISSN
author	ORCID	`id`, `title`	`orcid`, `url`
topic	madsrdf:Authority	`label`

Usage

For examples of how to read and write the corpus files in Python -- both in TTL and JSON-LD formats -- see the write_corpus() method in the gen_ttl.py script.

The corpus will be extended over time, with updates managed using GitHub tags and versioning. After each update, previous entries in the leaderboard will get re-evaluated.

Note that names in the dct:alternative field are merely informational -- what our human annotators have encountered when reading PDFs to identify dataset references manually. For the purposes of the competition the ML models don't need to use them in any way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus Description

Entity Definitions

Usage

Clone this wiki locally