Note
To run this notebook in JupyterLab, load examples/ex1_0.ipynb
Building a graph in RDF using rdflib
¶
First we'll build a Graph object in rdflib
to which we can add nodes and relations:
import rdflib
g = rdflib.Graph()
In RDF, a graph is constructed from triples, each of which represents an RDF statement that has at least three components:
- subject: the entity being annotated
- predicate: a relation between the subject and the object
- object: another entity or a literal value
We'll represent the anytime crepes recipe by making programmatic calls to rdflib
, starting with a URL constructed from the recipe id
as an initial node.
We'll show this as our first subject s
to be annotated using RDF statements.
uri = "https://www.food.com/recipe/327593"
s = rdflib.URIRef(uri)
s
rdflib.term.URIRef('https://www.food.com/recipe/327593')
Throughout work with KGs, there's an important practice of using persistent identifiers which are both unique and persistent, in other words the opposite of link rot.
We could have used other ways to identify that node, such as a unique name. Even so, if we think of this recipe as a resource online, then its URL is both unique and persistent as long as the "food.com" website is available.
Next we'll use rdf:type
as the predicate p
to describe the subject as an instance of wmt:Recipe
from rdflib.namespace import RDF
p = RDF.type
p
rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')
While the first two nodes in the graph used vocabularies that are predefined in rdflib
, now we'll need to reference other vocabularies.
We'll need to use the NamespaceManager
in rdflib
to bind and access the namespaces for those vocabularies, which is the nm
variable:
nm = g.namespace_manager
By convention we use a prefix as a convenience way to abbreviate each namespace.
For example, in the rdf:type
predicate above the rdf:
prefix is an abbreviation for the full http://www.w3.org/1999/02/22-rdf-syntax-ns#
URL of the RDF namespace.
See the http://prefix.cc/ online resource to lookup the common usages for prefixes.
Next we'll define the wtm
prefix for the "What to Make Base Ontology" at http://purl.org/heals/food/
uri = "http://purl.org/heals/food/"
ns_wtm = rdflib.Namespace(uri)
prefix = "wtm"
nm.bind(prefix, ns_wtm)
Now we can use this wtm:
namespace to reference the object o
as the wtm:Recipe
entity:
o = ns_wtm.Recipe
o
rdflib.term.URIRef('http://purl.org/heals/food/Recipe')
Note how that object resolves to the URL http://purl.org/heals/food/Recipe – which is a link to the vocabulary's RDF description.
Finally, we'll add the tuple (s, p, o,)
to the graph:
g.add((s, p, o,))
g
<Graph identifier=Neb99c9c875314576b4462509c2c9af99 (<class 'rdflib.graph.Graph'>)>
Now let's add the remaining metadata for the anytime crepes recipe.
The required cooking time of "8 minutes" can be represented as a predicate wtm:hasCookTime
and the literal 8
which we'll define as an xsd:integer
value:
p = ns_wtm.hasCookTime
p
rdflib.term.URIRef('http://purl.org/heals/food/hasCookTime')
from rdflib.namespace import XSD
o = rdflib.Literal("8", datatype=XSD.integer)
o
rdflib.term.Literal('8', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer'))
g.add((s, p, o,))
<Graph identifier=Neb99c9c875314576b4462509c2c9af99 (<class 'rdflib.graph.Graph'>)>
Now let's add the three ingredients ["eggs", "milk", "whole wheat flour"]
based on the vocabulary http://purl.org/heals/ingredient/ of food ingredients:
p = ns_wtm.hasIngredient
p
rdflib.term.URIRef('http://purl.org/heals/food/hasIngredient')
uri = "http://purl.org/heals/ingredient/"
ns_ind = rdflib.Namespace(uri)
prefix = "ind"
nm.bind(prefix, ns_ind)
o = ns_ind.ChickenEgg
o
rdflib.term.URIRef('http://purl.org/heals/ingredient/ChickenEgg')
g.add((s, p, o,))
<Graph identifier=Neb99c9c875314576b4462509c2c9af99 (<class 'rdflib.graph.Graph'>)>
g.add((s, p, ns_ind.CowMilk,))
g.add((s, p, ns_ind.WholeWheatFlour,))
<Graph identifier=Neb99c9c875314576b4462509c2c9af99 (<class 'rdflib.graph.Graph'>)>
To confirm what we've built so far, we can iterate through each of the (s, p, o,)
statements in the graph:
for s, p, o in g:
print(s, p, o)
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/CowMilk
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasCookTime 8
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/WholeWheatFlour
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/ChickenEgg
https://www.food.com/recipe/327593 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/heals/food/Recipe
Serialization as "Turtle" statements¶
First let's show how to serialize the graph as ttl
or turtle format.
This will be returned from RDF as a byte array, so we'll need to use a Unicode codec to convert the serialized graph into a string:
s = g.serialize(format="ttl")
print(s)
@prefix ind: <http://purl.org/heals/ingredient/> .
@prefix wtm: <http://purl.org/heals/food/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<https://www.food.com/recipe/327593> a wtm:Recipe ;
wtm:hasCookTime 8 ;
wtm:hasIngredient ind:ChickenEgg,
ind:CowMilk,
ind:WholeWheatFlour .
Similarly, we can serialize the graph as RDF statements to a file tmp.ttl
in the local directory:
g.serialize(destination="tmp.ttl", format="ttl", encoding="utf-8") ;
Try taking a look at the tmp.ttl
file.
Is it the same as the serialization shown above?
Serialization as JSON-LD¶
Next, let's serialize the graph in JSON-LD format, stored in the tmp.jsonld
file in the local directory:
data = g.serialize(
format="json-ld",
indent=2,
encoding="utf-8",
)
with open("tmp.jsonld", "wb") as f:
f.write(data)
Try taking a look at the tmp.jsonld
file.
Each entity, relation, and literal value has a full URL known as an IRI (internationalized resource locator) which identifies a resource used to define it.
We can make these JSON-LD files a bit more succinct by adding a context
that defines prefixes for each of the vocabularies used:
context = {
"@language": "en",
"wtm": "http://purl.org/heals/food/",
"ind": "http://purl.org/heals/ingredient/",
}
context
{'@language': 'en',
'wtm': 'http://purl.org/heals/food/',
'ind': 'http://purl.org/heals/ingredient/'}
Now we'll serialize again as JSON-LD, this time using the context:
data = g.serialize(
format="json-ld",
context=context,
indent=2,
encoding="utf-8",
)
with open("tmp.jsonld", "wb") as f:
f.write(data)
Open these two files and compare the difference.
Notice how the ttl
file is easier to read (for people), while the json-ld
file has all of the metadata explicitly linked and it easier for machines to read – even simply as a JSON file, not using any semantic technologies.
Exercises¶
Exercise 1:
By using ns_ind.AllPurposeFlour
to represent "flour"
as another possible ingredient, how would you extend the graph to represent the German Egg Pancakes https://www.food.com/recipe/406738 recipe?
Exercise 2:
The wtm:hasCookTime
predicate uses an xsd:integer
literal to represent cooking time in minutes.
That may be confusing to someone who is not familiar with this dataset.
Instead, represent the cooking time using an xsd:duration
literal.