Skip to content


To run this notebook in JupyterLab, load examples/ex1_1.ipynb

Leveraging the kglab abstraction layer

Now let's try the previous examples – i.e., building a simple recipe KG – again, but this time using the kglab library to make things a wee bit easier...

import kglab

namespaces = {
    "wtm": "",
    "ind": "",

kg = kglab.KnowledgeGraph(
    name = "A recipe KG example based on",
    namespaces = namespaces,

Once we have the kg object instantiated, we can use its short-cuts for the food-related vocabularies. Then construct the graph from a sequence of RDF statements:

import rdflib

node = rdflib.URIRef("")

kg.add(node, kg.get_ns("rdf").type, kg.get_ns("wtm").Recipe)
kg.add(node, kg.get_ns("wtm").hasCookTime, rdflib.Literal("PT8M", datatype=kg.get_ns("xsd").duration))
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").ChickenEgg)
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").CowMilk)
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").WholeWheatFlour)

Iterating through these statements shows their full URLs:

for s, p, o in kg.rdf_graph():
    print(s, p, o) PT8M

As an alternative way to examine these statements, we can serialize the graph to a string – in this case also using Turtle format:

s = kg.save_rdf_text(format="ttl")
@prefix ind: <> .
@prefix wtm: <> .
@prefix xsd: <> .

<> a wtm:Recipe ;
    wtm:hasCookTime "PT8M"^^xsd:duration ;
    wtm:hasIngredient ind:ChickenEgg,
        ind:WholeWheatFlour .

Overall, with the kglab library the KG serialization to multiple formats (Turtle, XML, JSON-LD, etc.) becomes much simpler:

kg.save_rdf("tmp.xml", format="xml")

Try opening these files to confirm their serialized contents.

Next, we'll use the Parquet format for columnar storage. Use of this technology has been especially effective for Big Data frameworks handling data management and analytics efficiently. It's simple to partitioned into multiple files (e.g., for distributed processing per partition) and the columns can be selectively decompressed on file reads (e.g., for predicate pushdown optimizations).


Let's compare the relative files sizes for these formats:

import pandas as pd
import os

file_paths = ["tmp.jsonld", "tmp.ttl", "tmp.xml", "tmp.parquet"]
file_sizes = [os.path.getsize(file_path) for file_path in file_paths]

df = pd.DataFrame({"file_path": file_paths, "file_size": file_sizes})
file_path file_size
0 tmp.jsonld 860
1 tmp.ttl 333
2 tmp.xml 671
3 tmp.parquet 3556

Parquet uses compression based on a "dictionary" approach, so it added overhead for small files such as this KG. We'll revisit this comparison across file formats again with a larger KG.


Exercise 1:

Using the kglab library, extend the graph by adding another recipe, such as German Egg Pancakes then serialize out to the three file formats again. How do the relative file sizes compare as the size of the graph grows?

Last update: 2021-04-10