Skip to content


To run this notebook in JupyterLab, load examples/ex0_0.ipynb

Data Sources

Throughout this tutorial we'll work with data in the dat subdirectory:

!ls -goh ../dat
total 109896
-rw-r--r--   1     36K Jan 26 16:38 acq.ttl
-rw-r--r--   1     40M Jan 15 09:57 all_ind.csv
-rw-r--r--   1     24K Jan 15 09:57 data_prep.ipynb
drwxr-xr-x  10    320B Nov  6 03:10 food_com
-rw-r--r--   1    1.0K Feb 10 14:54 gorm.ttl
-rw-r--r--   1    2.7K Jan 15 09:57 nom.ttl
drwxr-xr-x  13    416B Nov 25 10:02 psl
-rwxr-xr-x   1    128K Jan 15 09:57 recipes.csv
-rw-r--r--   1     65K Jan 15 09:57 recipes.ttl
drwxr-xr-x   5    160B Jan 15 09:57 titanic
-rw-r--r--   1     14M Jan 15 09:57 weatherAUS.csv

In particular, we'll work with a series of progressive examples based on the dat/recipes.csv CSV file. This data comes from a Kaggle dataset that describes metadata about

" Recipes and Interactions"
Shuyang Li
Kaggle (2019)

One of the simpler recipes in that dataset is "anytime crepes" at

  • id: 327593
  • minutes: 8
  • ingredients: "['egg', 'milk', 'whole wheat flour']"

The tutorial begins by showing how to represent the metadata for this recipe in a knowledge graph, then gradually builds up more and more information about this collection of recipes.

To start, let's load and examine the CSV data:

import pandas as pd

df = pd.read_csv("../dat/recipes.csv")
id name minutes tags description ingredients
0 164636 1 1 1 tempura batter 5 ['15-minutes-or-less', 'time-to-make', 'course... i use this everytime i make onion rings, hot p... ['egg', 'flour', 'water']
1 144841 2 step pound cake for a kitchen aide mixer 110 ['time-to-make', 'course', 'preparation', 'occ... this recipe was published in a southern living... ['flour', 'sugar', 'butter', 'milk', 'eggs', '...
2 189437 40 second omelet 25 ['30-minutes-or-less', 'time-to-make', 'course... you'll need an "inverted pancake turner" for t... ['eggs', 'water', 'butter']
3 19104 all purpose dinner crepes batter 90 ['weeknight', 'time-to-make', 'course', 'main-... this basic crepe recipe can be used for all yo... ['eggs', 'salt', 'flour', 'milk', 'butter']
4 64793 amish friendship starter 14405 ['weeknight', 'time-to-make', 'course', 'cuisi... this recipe was given to me years ago by a fri... ['sugar', 'flour', 'milk']

Now let's drill down to the metadata for the "anytime crepes" recipe

recipe_row = df[df["name"] == "anytime crepes"].iloc[0]
id                                                        327593
name                                              anytime crepes
minutes                                                        8
tags           ['15-minutes-or-less', 'time-to-make', 'course...
description    from my friend linda, this is an oh-so-easy-an...
ingredients                 ['egg', 'milk', 'whole wheat flour']
Name: 8, dtype: object

Given that we have a rich source of linked data to use, next we need to focus on knowledge representation. We'll use the FoodOn ontology (see below) to represent recipes, making use of two of its controlled vocabularies:

The first one defines an entity called Recipe which has the full URL of and we'll use that to represent our recipe data from the dataset.

It's a common practice to abbreviate the first part of the URL for a controlled vocabulary with a prefix. In this case we'll use the prefix conventions used in previous publications related to this ontology:

URL prefix wtm: ind:

Now let's represent the data using this ontology, starting with the three ingredients for the anytime crepes recipe:

ingredients = eval(recipe_row["ingredients"])
['egg', 'milk', 'whole wheat flour']

These ingredients become represented, respectively, as:

  • ind:ChickenEgg
  • ind:CowMilk
  • ind:WholeWheatFlour

Ontology Sources

We'll use several different sources for data and ontology throughout the kglab tutorial, although most of it focuses on progressive examples that use FoodOn.

FoodOn – subtitled "a farm to fork ontology" – takes a comprehensive view of the data and metadata involved in our food supply, beginning with seed genomics, micronutrients, the biology of food alergies, etc. This work is predicated on leveraging large knowledge graphs to represent the different areas of science, technology, business, public policy, etc.:

The need to represent knowledge about food is central to many human activities including agriculture, medicine, food safety inspection, shopping patterns, and sustainable development. FoodOn is an ontology – a controlled vocabulary which can be used by both people and computers – to name all parts of animals, plants, and fungai which can bear a food role for humans and domesticated animals, as well as derived food products and the processes used to make them.

For more details, see:

For primary sources, see: [vardeman2014ceur], [sam2014odp], [dooley2018npj], [hitzler2018]

We'll work through several examples of representation, although here's an example of what a full recipe in FoodOn would look like: owl:NamedIndividual a wtm:Recipe ; rdf:about ind:BananaBlueberryAlmondFlourMuffin ; wtm:hasIngredient ind:AlmondMeal ; wtm:hasIngredient ind:AppleCiderVinegar ; wtm:hasIngredient ind:BakingSoda ; wtm:hasIngredient ind:Banana ; wtm:hasIngredient ind:Blueberry ; wtm:hasIngredient ind:ChickenEgg ; wtm:hasIngredient ind:Honey ; wtm:isRecommendedForCourse wtm:Dessert ; wtm:isRecommendedForMeal wtm:Breakfast ; wtm:isRecommendedForMeal wtm:Snack ; wtm:hasCookTime "PT60M"^^xsd:duration ; wtm:hasCookingTemperature "350"^^xsd:integer ; wtm:serves "4"^^xsd:integer ; rdfs:label "banana blueberry almond flour muffin" ; skos:definition "a banana blueberry muffin made with almond flour" ; skos:scopeNote "recipe" ; prov:wasDerivedFrom .

Graph Size Comparisons

One frequently asked question is about the size of the graphs that we're using in the kglab tutorial. The short answer: "No, these aren't trivial graphs."

We'll start out with small examples, to show the basics for how to construct an RDF graph.

Most of the examples here will use a knowledge graph with ~300 nodes and ~2000 edges. This is a non-trivial size, especially when you start working with some graph algorithms. Again, this tutorial has learning as its main intent, and this size of graph is ideal for running queries, validation, graph algorithms, visualization, etc., with the kinds of compute and memory resources available on contemporary laptops.

In other words, we prioritize datasets that are large enough for examples to illustrate common use cases, though small enough for learners to understand.

  • 10^6 or more nodes are needed for deep learning
  • 10^8 can run on contemporary laptops
  • larger graphs require hardware accelerators (e.g., GPUs) or cloud-based clusters

The full recipes.tsv dataset includes nearly 250,000 recipes. In some of the later examples, we'll work with that entire dataset – which is definitely non-trivial.

Last update: 2021-04-10