Leveraging the `kglab` abstraction layer¶

Now let's try the previous examples – i.e., building a simple recipe KG – again, but this time using the kglab library to make things a wee bit easier...

import kglab

namespaces = {
    "wtm": "http://purl.org/heals/food/",
    "ind": "http://purl.org/heals/ingredient/",
    }

kg = kglab.KnowledgeGraph(
    name = "A recipe KG example based on Food.com",
    namespaces = namespaces,
    )

Once we have the kg object instantiated, we can use its short-cuts for the food-related vocabularies. Then construct the graph from a sequence of RDF statements:

import rdflib

node = rdflib.URIRef("https://www.food.com/recipe/327593")

kg.add(node, kg.get_ns("rdf").type, kg.get_ns("wtm").Recipe)
kg.add(node, kg.get_ns("wtm").hasCookTime, rdflib.Literal("PT8M", datatype=kg.get_ns("xsd").duration))
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").ChickenEgg)
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").CowMilk)
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").WholeWheatFlour)

Iterating through these statements shows their full URLs:

for s, p, o in kg.rdf_graph():
    print(s, p, o)

https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/CowMilk
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasCookTime PT8M
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/WholeWheatFlour
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/ChickenEgg
https://www.food.com/recipe/327593 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/heals/food/Recipe

As an alternative way to examine these statements, we can serialize the graph to a string – in this case also using Turtle format:

s = kg.save_rdf_text(format="ttl")
print(s)

@prefix ind: <http://purl.org/heals/ingredient/> .
@prefix wtm: <http://purl.org/heals/food/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://www.food.com/recipe/327593> a wtm:Recipe ;
    wtm:hasCookTime "PT8M"^^xsd:duration ;
    wtm:hasIngredient ind:ChickenEgg,
        ind:CowMilk,
        ind:WholeWheatFlour .

Overall, with the kglab library the KG serialization to multiple formats (Turtle, XML, JSON-LD, etc.) becomes much simpler:

kg.save_rdf("tmp.ttl")

kg.save_rdf("tmp.xml", format="xml")

kg.save_jsonld("tmp.jsonld")

Try opening these files to confirm their serialized contents.

Next, we'll use the Parquet format for columnar storage. Use of this technology has been especially effective for Big Data frameworks handling data management and analytics efficiently. It's simple to partitioned into multiple files (e.g., for distributed processing per partition) and the columns can be selectively decompressed on file reads (e.g., for predicate pushdown optimizations).

kg.save_parquet("tmp.parquet")

Let's compare the relative files sizes for these formats:

import pandas as pd
import os

file_paths = ["tmp.jsonld", "tmp.ttl", "tmp.xml", "tmp.parquet"]
file_sizes = [os.path.getsize(file_path) for file_path in file_paths]

df = pd.DataFrame({"file_path": file_paths, "file_size": file_sizes})
df

	file_path	file_size
0	tmp.jsonld	860
1	tmp.ttl	333
2	tmp.xml	671
3	tmp.parquet	3724

Parquet uses compression based on a "dictionary" approach, so it added overhead for small files such as this KG. We'll revisit this comparison across file formats again with a larger KG.

Exercises¶

Exercise 1:

Using the kglab library, extend the graph by adding another recipe, such as German Egg Pancakes https://www.food.com/recipe/406738 then serialize out to the three file formats again. How do the relative file sizes compare as the size of the graph grows?

Last update: 2022-03-23

Leveraging the kglab abstraction layer¶

Exercises¶

Leveraging the `kglab` abstraction layer¶