Skip to content

Note

To run this notebook in JupyterLab, load examples/ex1_1.ipynb

Leveraging the kglab abstraction layer

Now let's try the previous examples – i.e., building a simple recipe KG – again, but this time using the kglab library to make things a wee bit easier...

import kglab

namespaces = {
    "wtm": "http://purl.org/heals/food/",
    "ind": "http://purl.org/heals/ingredient/",
    }

kg = kglab.KnowledgeGraph(
    name = "A recipe KG example based on Food.com",
    base_uri = "https://www.food.com/recipe/",
    language = "en",
    namespaces = namespaces,
    )

Once we have the kg object instantiated, we can use its short-cuts for the food-related vocabularies. Then construct the graph from a sequence of triples:

import rdflib
from rdflib.namespace import RDF, XSD

node = rdflib.URIRef("https://www.food.com/recipe/327593")

kg.add(node, RDF.type, kg.get_ns("wtm").Recipe)
kg.add(node, kg.get_ns("wtm").hasCookTime, rdflib.Literal("PT8M", datatype=XSD.duration))
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").ChickenEgg)
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").CowMilk)
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").WholeWheatFlour)

Iterating through these triples shows their full URLs:

for s, p, o in kg.rdf_graph():
    print(s, p, o)
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/CowMilk
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/ChickenEgg
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasCookTime PT8M
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/WholeWheatFlour
https://www.food.com/recipe/327593 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/heals/food/Recipe

As another way to exam the triples, we can serialize the graph to a string – in this case, using Turtle format:

s = kg.rdf_graph().serialize(format="ttl")
print(s.decode("utf-8"))
@prefix ind: <http://purl.org/heals/ingredient/> .
@prefix wtm: <http://purl.org/heals/food/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://www.food.com/recipe/327593> a wtm:Recipe ;
    wtm:hasCookTime "PT8M"^^xsd:duration ;
    wtm:hasIngredient ind:ChickenEgg,
        ind:CowMilk,
        ind:WholeWheatFlour .

Overall, with the kglab library the KG serialization to multiple formats (Turtle, XML, JSON-LD, etc.) becomes much simpler:

kg.save_rdf("tmp.ttl")
kg.save_rdf("tmp.xml", format="xml")
kg.save_jsonld("tmp.jsonld")

Try opening these files to confirm their serialized contents.

Next, we'll use the Parquet format for columnar storage. Use of this technology has been especially effective for Big Data frameworks handling data management and analytics efficiently. It's simple to partitioned into multiple files (e.g., for distributed processing per partition) and the columns can be selectively decompressed on file reads (e.g., for predicate pushdown optimizations).

kg.save_parquet("tmp.parquet")

Let's compare the relative files sizes for these formats:

import pandas as pd
import os

file_paths = ["tmp.jsonld", "tmp.ttl", "tmp.xml", "tmp.parquet"]
file_sizes = [os.path.getsize(file_path) for file_path in file_paths]

df = pd.DataFrame({"file_path": file_paths, "file_size": file_sizes})
df
file_path file_size
0 tmp.jsonld 864
1 tmp.ttl 344
2 tmp.xml 686
3 tmp.parquet 3555

Parquet uses compression based on a "dictionary" approach, so it added overhead for small files such as this KG. We'll revisit this comparison across file formats again with a larger KG.


Exercises

Exercise 1:

Using the kglab library, extend the graph by adding another recipe, such as German Egg Pancakes https://www.food.com/recipe/406738 then serialize out to the three file formats again. How do the relative file sizes compare as the size of the graph grows?


Last update: 2021-01-21