Note
To run this notebook in JupyterLab, load examples/ex1_1.ipynb
Leveraging the kglab
abstraction layer¶
Now let's try the previous examples – i.e., building a simple recipe KG – again, but this time using the kglab
library to make things a wee bit easier...
import kglab
namespaces = {
"wtm": "http://purl.org/heals/food/",
"ind": "http://purl.org/heals/ingredient/",
}
kg = kglab.KnowledgeGraph(
name = "A recipe KG example based on Food.com",
namespaces = namespaces,
)
Once we have the kg
object instantiated, we can use its short-cuts for the food-related vocabularies.
Then construct the graph from a sequence of RDF statements:
import rdflib
node = rdflib.URIRef("https://www.food.com/recipe/327593")
kg.add(node, kg.get_ns("rdf").type, kg.get_ns("wtm").Recipe)
kg.add(node, kg.get_ns("wtm").hasCookTime, rdflib.Literal("PT8M", datatype=kg.get_ns("xsd").duration))
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").ChickenEgg)
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").CowMilk)
kg.add(node, kg.get_ns("wtm").hasIngredient, kg.get_ns("ind").WholeWheatFlour)
Iterating through these statements shows their full URLs:
for s, p, o in kg.rdf_graph():
print(s, p, o)
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/CowMilk
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasCookTime PT8M
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/WholeWheatFlour
https://www.food.com/recipe/327593 http://purl.org/heals/food/hasIngredient http://purl.org/heals/ingredient/ChickenEgg
https://www.food.com/recipe/327593 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/heals/food/Recipe
As an alternative way to examine these statements, we can serialize the graph to a string – in this case also using Turtle format:
s = kg.save_rdf_text(format="ttl")
print(s)
@prefix ind: <http://purl.org/heals/ingredient/> .
@prefix wtm: <http://purl.org/heals/food/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<https://www.food.com/recipe/327593> a wtm:Recipe ;
wtm:hasCookTime "PT8M"^^xsd:duration ;
wtm:hasIngredient ind:ChickenEgg,
ind:CowMilk,
ind:WholeWheatFlour .
Overall, with the kglab
library the KG serialization to multiple formats (Turtle, XML, JSON-LD, etc.) becomes much simpler:
kg.save_rdf("tmp.ttl")
kg.save_rdf("tmp.xml", format="xml")
kg.save_jsonld("tmp.jsonld")
Try opening these files to confirm their serialized contents.
Next, we'll use the Parquet format for columnar storage. Use of this technology has been especially effective for Big Data frameworks handling data management and analytics efficiently. It's simple to partitioned into multiple files (e.g., for distributed processing per partition) and the columns can be selectively decompressed on file reads (e.g., for predicate pushdown optimizations).
kg.save_parquet("tmp.parquet")
Let's compare the relative files sizes for these formats:
import pandas as pd
import os
file_paths = ["tmp.jsonld", "tmp.ttl", "tmp.xml", "tmp.parquet"]
file_sizes = [os.path.getsize(file_path) for file_path in file_paths]
df = pd.DataFrame({"file_path": file_paths, "file_size": file_sizes})
df
file_path | file_size | |
---|---|---|
0 | tmp.jsonld | 860 |
1 | tmp.ttl | 333 |
2 | tmp.xml | 671 |
3 | tmp.parquet | 3724 |
Parquet uses compression based on a "dictionary" approach, so it added overhead for small files such as this KG. We'll revisit this comparison across file formats again with a larger KG.
Exercises¶
Exercise 1:
Using the kglab
library, extend the graph by adding another recipe, such as German Egg Pancakes https://www.food.com/recipe/406738 then serialize out to the three file formats again.
How do the relative file sizes compare as the size of the graph grows?