Skip to content

Note

To run this notebook in JupyterLab, load examples/ex1_0.ipynb

Building a graph in RDF using rdflib

First we'll build a Graph object in rdflib to which we can add nodes and relations:

import rdflib

g = rdflib.Graph()

In RDF, a graph is constructed from triples, each of which represents an RDF statement that has at least three components:

  • subject: the entity being annotated
  • predicate: a relation between the subject and the object
  • object: another entity or a literal value

We'll represent the anytime crepes recipe by making programmatic calls to rdflib, starting with a URL constructed from the recipe id as an initial node. We'll show this as our first subject s to be annotated using RDF statements.

uri = "https://www.food.com/recipe/327593"
s = rdflib.URIRef(uri)
s
rdflib.term.URIRef('https://www.food.com/recipe/327593')

Throughout work with KGs, there's an important practice of using persistent identifiers which are both unique and persistent, in other words the opposite of link rot.

We could have used other ways to identify that node, such as a unique name. Even so, if we think of this recipe as a resource online, then its URL is both unique and persistent as long as the "food.com" website is available.

Next we'll use rdf:type as the predicate p to describe the subject as an instance of wmt:Recipe

from rdflib.namespace import RDF

p = RDF.type
p
rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')

While the first two nodes in the graph used vocabularies that are predefined in rdflib, now we'll need to reference other vocabularies. We'll need to use the NamespaceManager in rdflib to bind and access the namespaces for those vocabularies, which is the nm variable:

nm = g.namespace_manager

By convention we use a prefix as a convenience way to abbreviate each namespace. For example, in the rdf:type predicate above the rdf: prefix is an abbreviation for the full http://www.w3.org/1999/02/22-rdf-syntax-ns# URL of the RDF namespace. See the http://prefix.cc/ online resource to lookup the common usages for prefixes.

Next we'll define the wtm prefix for the "What to Make Base Ontology" at http://purl.org/heals/food/

uri = "http://purl.org/heals/food/"
ns_wtm = rdflib.Namespace(uri)

prefix = "wtm"
nm.bind(prefix, ns_wtm)

Now we can use this wtm: namespace to reference the object o as the wtm:Recipe entity:

o = ns_wtm.Recipe
o
rdflib.term.URIRef('http://purl.org/heals/food/Recipe')

Note how that object resolves to the URL http://purl.org/heals/food/Recipe – which is a link to the vocabulary's RDF description.

Finally, we'll add the tuple (s, p, o,) to the graph:

g.add((s, p, o,))
g
<Graph identifier=Neb99c9c875314576b4462509c2c9af99 (<class 'rdflib.graph.Graph'>)>

Now let's add the remaining metadata for the anytime crepes recipe. The required cooking time of "8 minutes" can be represented as a predicate wtm:hasCookTime and the literal 8 which we'll define as an xsd:integer value:

p = ns_wtm.hasCookTime
p
rdflib.term.URIRef('http://purl.org/heals/food/hasCookTime')
from rdflib.namespace import XSD

o = rdflib.Literal("8", datatype=XSD.integer)
o
rdflib.term.Literal('8', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer'))
g.add((s, p, o,))
<Graph identifier=Neb99c9c875314576b4462509c2c9af99 (<class 'rdflib.graph.Graph'>)>

Now let's add the three ingredients ["eggs", "milk", "whole wheat flour"] based on the vocabulary http://purl.org/heals/ingredient/ of food ingredients:

p = ns_wtm.hasIngredient
p
rdflib.term.URIRef('http://purl.org/heals/food/hasIngredient')
uri = "http://purl.org/heals/ingredient/"
ns_ind = rdflib.Namespace(uri)

prefix = "ind"
nm.bind(prefix, ns_ind)
o = ns_ind.ChickenEgg
o
rdflib.term.URIRef('http://purl.org/heals/ingredient/ChickenEgg')
g.add((s, p, o,))
<Graph identifier=Neb99c9c875314576b4462509c2c9af99 (<class 'rdflib.graph.Graph'>)>
g.add((s, p, ns_ind.CowMilk,))
g.add((s, p, ns_ind.WholeWheatFlour,))
<Graph identifier=Neb99c9c875314576b4462509c2c9af99 (<class 'rdflib.graph.Graph'>)>

To confirm what we've built so far, we can iterate through each of the (s, p, o,) statements in the graph:

for s, p, o in g:
    print(s, p, o)
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/CowMilk
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasCookTime 8
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/WholeWheatFlour
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/ChickenEgg
https://www.food.com/recipe/327593 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/heals/food/Recipe

Serialization as "Turtle" statements

First let's show how to serialize the graph as ttl or turtle format. This will be returned from RDF as a byte array, so we'll need to use a Unicode codec to convert the serialized graph into a string:

s = g.serialize(format="ttl")
print(s)
@prefix ind: <http://purl.org/heals/ingredient/> .
@prefix wtm: <http://purl.org/heals/food/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://www.food.com/recipe/327593> a wtm:Recipe ;
    wtm:hasCookTime 8 ;
    wtm:hasIngredient ind:ChickenEgg,
        ind:CowMilk,
        ind:WholeWheatFlour .

Similarly, we can serialize the graph as RDF statements to a file tmp.ttl in the local directory:

g.serialize(destination="tmp.ttl", format="ttl", encoding="utf-8") ;

Try taking a look at the tmp.ttl file. Is it the same as the serialization shown above?

Serialization as JSON-LD

Next, let's serialize the graph in JSON-LD format, stored in the tmp.jsonld file in the local directory:

data = g.serialize(
    format="json-ld",
    indent=2,
    encoding="utf-8",
    )

with open("tmp.jsonld", "wb") as f:
    f.write(data)

Try taking a look at the tmp.jsonld file. Each entity, relation, and literal value has a full URL known as an IRI (internationalized resource locator) which identifies a resource used to define it.

We can make these JSON-LD files a bit more succinct by adding a context that defines prefixes for each of the vocabularies used:

context = {
    "@language": "en",
    "wtm": "http://purl.org/heals/food/",
    "ind": "http://purl.org/heals/ingredient/",
    }
context
{'@language': 'en',
 'wtm': 'http://purl.org/heals/food/',
 'ind': 'http://purl.org/heals/ingredient/'}

Now we'll serialize again as JSON-LD, this time using the context:

data = g.serialize(
    format="json-ld",
    context=context,
    indent=2,
    encoding="utf-8",
    )

with open("tmp.jsonld", "wb") as f:
    f.write(data)

Open these two files and compare the difference. Notice how the ttl file is easier to read (for people), while the json-ld file has all of the metadata explicitly linked and it easier for machines to read – even simply as a JSON file, not using any semantic technologies.


Exercises

Exercise 1:

By using ns_ind.AllPurposeFlour to represent "flour" as another possible ingredient, how would you extend the graph to represent the German Egg Pancakes https://www.food.com/recipe/406738 recipe?

Exercise 2:

The wtm:hasCookTime predicate uses an xsd:integer literal to represent cooking time in minutes. That may be confusing to someone who is not familiar with this dataset. Instead, represent the cooking time using an xsd:duration literal.


Last update: 2022-03-23