Skip to content

Note

To run this notebook in JupyterLab, load examples/ex2_0.ipynb

Build a medium size KG from a CSV dataset

First let's initialize the KG object as we did previously:

import kglab

namespaces = {
    "wtm":  "http://purl.org/heals/food/",
    "ind":  "http://purl.org/heals/ingredient/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    }

kg = kglab.KnowledgeGraph(
    name = "A recipe KG example based on Food.com",
    base_uri = "https://www.food.com/recipe/",
    namespaces = namespaces,
    )

Here's a way to describe the namespaces that are available to use:

kg.describe_ns()
prefix namespace
0 dct http://purl.org/dc/terms/
1 owl http://www.w3.org/2002/07/owl#
2 prov http://www.w3.org/ns/prov#
3 rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
4 rdfs http://www.w3.org/2000/01/rdf-schema#
5 schema http://schema.org/
6 sh http://www.w3.org/ns/shacl#
7 xsd http://www.w3.org/2001/XMLSchema#
8 wtm http://purl.org/heals/food/
9 ind http://purl.org/heals/ingredient/
10 skos http://www.w3.org/2004/02/skos/core#

Next, we'll define a dictionary that maps (somewhat magically) from strings (i.e., "labels") to ingredients defined in the http://purl.org/heals/ingredient/ vocabulary:

common_ingredient = {
    "water": kg.get_ns("ind").Water,
    "salt": kg.get_ns("ind").Salt,
    "pepper": kg.get_ns("ind").BlackPepper,
    "black pepper": kg.get_ns("ind").BlackPepper,
    "dried basil": kg.get_ns("ind").Basil,

    "butter": kg.get_ns("ind").Butter,
    "milk": kg.get_ns("ind").CowMilk,
    "egg": kg.get_ns("ind").ChickenEgg,
    "eggs": kg.get_ns("ind").ChickenEgg,
    "bacon": kg.get_ns("ind").Bacon,

    "sugar": kg.get_ns("ind").WhiteSugar,
    "brown sugar": kg.get_ns("ind").BrownSugar,
    "honey": kg.get_ns("ind").Honey,
    "vanilla": kg.get_ns("ind").VanillaExtract,
    "vanilla extract": kg.get_ns("ind").VanillaExtract,

    "flour": kg.get_ns("ind").AllPurposeFlour,
    "all-purpose flour": kg.get_ns("ind").AllPurposeFlour,
    "whole wheat flour": kg.get_ns("ind").WholeWheatFlour,

    "olive oil": kg.get_ns("ind").OliveOil,
    "vinegar": kg.get_ns("ind").AppleCiderVinegar,

    "garlic": kg.get_ns("ind").Garlic,
    "garlic clove": kg.get_ns("ind").Garlic,
    "garlic cloves": kg.get_ns("ind").Garlic,

    "onion": kg.get_ns("ind").Onion,
    "onions": kg.get_ns("ind").Onion,
    "cabbage": kg.get_ns("ind").Cabbage,
    "carrot": kg.get_ns("ind").Carrot,
    "carrots": kg.get_ns("ind").Carrot,
    "celery": kg.get_ns("ind").Celery,
    "potato": kg.get_ns("ind").Potato,
    "potatoes": kg.get_ns("ind").Potato,
    "tomato": kg.get_ns("ind").Tomato,
    "tomatoes": kg.get_ns("ind").Tomato,

    "baking powder": kg.get_ns("ind").BakingPowder,
    "baking soda": kg.get_ns("ind").BakingSoda,
}

This is where use of NLP work to produce annotations begins to overlap with KG pratices.

Now let's load our dataset of recipes – the dat/recipes.csv file in CSV format – into a pandas dataframe:

import pandas as pd

df = pd.read_csv("../dat/recipes.csv")
df.head()
id name minutes tags description ingredients
0 164636 1 1 1 tempura batter 5 ['15-minutes-or-less', 'time-to-make', 'course... i use this everytime i make onion rings, hot p... ['egg', 'flour', 'water']
1 144841 2 step pound cake for a kitchen aide mixer 110 ['time-to-make', 'course', 'preparation', 'occ... this recipe was published in a southern living... ['flour', 'sugar', 'butter', 'milk', 'eggs', '...
2 189437 40 second omelet 25 ['30-minutes-or-less', 'time-to-make', 'course... you'll need an "inverted pancake turner" for t... ['eggs', 'water', 'butter']
3 19104 all purpose dinner crepes batter 90 ['weeknight', 'time-to-make', 'course', 'main-... this basic crepe recipe can be used for all yo... ['eggs', 'salt', 'flour', 'milk', 'butter']
4 64793 amish friendship starter 14405 ['weeknight', 'time-to-make', 'course', 'cuisi... this recipe was given to me years ago by a fri... ['sugar', 'flour', 'milk']

Then iterate over the rows in the dataframe, representing a recipe in the KG for each row:

import rdflib

for index, row in df.iterrows():
    recipe_id = row["id"]
    node = rdflib.URIRef("https://www.food.com/recipe/{}".format(recipe_id))
    kg.add(node, kg.get_ns("rdf").type, kg.get_ns("wtm").Recipe)

    recipe_name = row["name"]
    kg.add(node, kg.get_ns("skos").definition, rdflib.Literal(recipe_name))

    cook_time = row["minutes"]
    cook_time_literal = "PT{}M".format(int(cook_time))
    code_time_node = rdflib.Literal(cook_time_literal, datatype=kg.get_ns("xsd").duration)
    kg.add(node, kg.get_ns("wtm").hasCookTime, code_time_node)

    ind_list = eval(row["ingredients"])

    for ind in ind_list:
        ingredient = ind.strip()
        ingredient_obj = common_ingredient[ingredient]
        kg.add(node, kg.get_ns("wtm").hasIngredient, ingredient_obj)

Notice how the xsd:duration literal is now getting used to represent cooking times.

We've structured this example such that each of the recipes in the CSV file has a known representation for all of its ingredients. There are nearly 250K recipes in the full dataset from https://food.com/ so the common_ingredient dictionary would need to be extended quite a lot to handle all of those possible ingredients.

At this stage, our graph has grown by a couple orders of magnitude, so its visualization should be more interesting now. Let's take a look:

VIS_STYLE = {
    "wtm": {
        "color": "orange",
        "size": 20,
    },
    "ind":{
        "color": "blue",
        "size": 35,
    },
}

subgraph = kglab.SubgraphTensor(kg)
pyvis_graph = subgraph.build_pyvis_graph(notebook=True, style=VIS_STYLE)

pyvis_graph.force_atlas_2based()
pyvis_graph.show("tmp.fig01.html")

png

Given the defaults for this kind of visualization, there's likely a dense center mass of orange (recipes) at the center, with a close cluster of common ingredients (dark blue), surrounded by less common ingredients and cooking times (light blue).

Performance analysis of serialization methods

Let's serialize this recipe KG constructed from the CSV dataset to a local TTL file, while measuring the time and disk space required:

import time

write_times = []

t0 = time.time()
kg.save_rdf("tmp.ttl")
write_times.append(round((time.time() - t0) * 1000.0, 2))

Let's also serialize the KG into the other formats that we've been using, to compare relative sizes for a medium size KG:

t0 = time.time()
kg.save_rdf("tmp.xml", format="xml")
write_times.append(round((time.time() - t0) * 1000.0, 2))

t0 = time.time()
kg.save_jsonld("tmp.jsonld")
write_times.append(round((time.time() - t0) * 1000.0, 2))

t0 = time.time()
kg.save_parquet("tmp.parquet")
write_times.append(round((time.time() - t0) * 1000.0, 2))
import pandas as pd
import os

file_paths = ["tmp.ttl", "tmp.xml", "tmp.jsonld", "tmp.parquet"]
file_sizes = [os.path.getsize(file_path) for file_path in file_paths]

df = pd.DataFrame({"file_path": file_paths, "file_size": file_sizes, "write_time": write_times})
df["ms_per_byte"] = df["write_time"] / df["file_size"]
df
file_path file_size write_time ms_per_byte
0 tmp.ttl 56780 239.19 0.004213
1 tmp.xml 159397 78.95 0.000495
2 tmp.jsonld 131850 112.60 0.000854
3 tmp.parquet 14545 106.94 0.007352

Notice the relative sizes and times? Parquet provides for compression in a way that works well with RDF. The same KG stored as a Parquet file is ~10% the size of the same KG stored as JSON-LD. Also the XML version is quite large.

Looking at the write times, Parquet is relatively fast (after its first invocation) and its reads are faster. The eponymous Turtle format is human-readable although relatively slow. XML is fast to write, but much larger on disk and difficult to read. JSON-LD is interesting in that any JSON library can read and use these files, without needing semantic technologies, per se; however, it's also large on disk.


Exercises

Exercise 1:

Select another ingredient in the http://purl.org/heals/ingredient/ vocabulary that is not in the common_ingredient dictionary, for which you can find at least one simple recipe within https://food.com/ searches. Then add this recipe into the KG.


Last update: 2021-04-10