Data Sources¶

Throughout this tutorial we'll work with data in the dat subdirectory:

!ls -goh ../dat

total 110000
-rw-r--r--   1     36K Jan 26  2021 acq.ttl
-rw-r--r--@  1     40M Jan 15  2021 all_ind.csv
-rw-r--r--   1     24K Jan 15  2021 data_prep.ipynb
drwxr-xr-x  10    320B Nov  6  2020 [34mfood_com[m[m
-rw-r--r--   1    3.0K Apr 24  2021 gorm.jsonld
-rw-r--r--   1    3.8K Apr 25  2021 gorm.parquet
-rw-r--r--   1    1.0K Feb 13  2021 gorm.ttl
-rw-r--r--   1    5.0K Apr 24  2021 nom.jsonld
-rw-r--r--   1    4.9K Apr 25  2021 nom.parquet
-rw-r--r--   1    2.7K Jan 15  2021 nom.ttl
drwxr-xr-x  13    416B Nov 25  2020 [34mpsl[m[m
-rwxr-xr-x   1    128K Jan 15  2021 [31mrecipes.csv[m[m
-rw-r--r--   1     65K Jan 15  2021 recipes.ttl
-rw-r--r--   1     26K May  9  2021 roam.json
drwxr-xr-x   5    160B Jan 15  2021 [34mtitanic[m[m
-rw-r--r--   1     14M Jan 15  2021 weatherAUS.csv

In particular, we'll work with a series of progressive examples based on the dat/recipes.csv CSV file. This data comes from a Kaggle dataset that describes metadata about Food.com:

"Food.com Recipes and Interactions"
Shuyang Li
Kaggle (2019)
https://doi.org/10.34740/kaggle/dsv/783630

One of the simpler recipes in that dataset is "anytime crepes" at https://www.food.com/recipe/327593

id: 327593
minutes: 8
ingredients: "['egg', 'milk', 'whole wheat flour']"

The tutorial begins by showing how to represent the metadata for this recipe in a knowledge graph, then gradually builds up more and more information about this collection of recipes.

To start, let's load and examine the CSV data:

from os.path import dirname
import os
import pandas as pd

df = pd.read_csv(dirname(os.getcwd()) + "/dat/recipes.csv")
df.head()

	id	name	minutes	tags	description	ingredients
0	164636	1 1 1 tempura batter	5	['15-minutes-or-less', 'time-to-make', 'course...	i use this everytime i make onion rings, hot p...	['egg', 'flour', 'water']
1	144841	2 step pound cake for a kitchen aide mixer	110	['time-to-make', 'course', 'preparation', 'occ...	this recipe was published in a southern living...	['flour', 'sugar', 'butter', 'milk', 'eggs', '...
2	189437	40 second omelet	25	['30-minutes-or-less', 'time-to-make', 'course...	you'll need an "inverted pancake turner" for t...	['eggs', 'water', 'butter']
3	19104	all purpose dinner crepes batter	90	['weeknight', 'time-to-make', 'course', 'main-...	this basic crepe recipe can be used for all yo...	['eggs', 'salt', 'flour', 'milk', 'butter']
4	64793	amish friendship starter	14405	['weeknight', 'time-to-make', 'course', 'cuisi...	this recipe was given to me years ago by a fri...	['sugar', 'flour', 'milk']

Now let's drill down to the metadata for the "anytime crepes" recipe

recipe_row = df[df["name"] == "anytime crepes"].iloc[0]
recipe_row

id                                                        327593
name                                              anytime crepes
minutes                                                        8
tags           ['15-minutes-or-less', 'time-to-make', 'course...
description    from my friend linda, this is an oh-so-easy-an...
ingredients                 ['egg', 'milk', 'whole wheat flour']
Name: 8, dtype: object

Given that we have a rich source of linked data to use, next we need to focus on knowledge representation. We'll use the FoodOn ontology (see below) to represent recipes, making use of two of its controlled vocabularies:

The first one defines an entity called Recipe which has the full URL of http://purl.org/heals/food/Recipe and we'll use that to represent our recipe data from the Food.com dataset.

It's a common practice to abbreviate the first part of the URL for a controlled vocabulary with a prefix. In this case we'll use the prefix conventions used in previous publications related to this ontology:

URL	prefix
http://purl.org/heals/food/	`wtm:`
http://purl.org/heals/ingredient/	`ind:`

Now let's represent the data using this ontology, starting with the three ingredients for the anytime crepes recipe:

ingredients = eval(recipe_row["ingredients"])
ingredients

['egg', 'milk', 'whole wheat flour']

These ingredients become represented, respectively, as:

ind:ChickenEgg
ind:CowMilk
ind:WholeWheatFlour

Ontology Sources¶

We'll use several different sources for data and ontology throughout the kglab tutorial, although most of it focuses on progressive examples that use FoodOn.

FoodOn – subtitled "a farm to fork ontology" – takes a comprehensive view of the data and metadata involved in our food supply, beginning with seed genomics, micronutrients, the biology of food alergies, etc. This work is predicated on leveraging large knowledge graphs to represent the different areas of science, technology, business, public policy, etc.:

The need to represent knowledge about food is central to many human activities including agriculture, medicine, food safety inspection, shopping patterns, and sustainable development. FoodOn is an ontology – a controlled vocabulary which can be used by both people and computers – to name all parts of animals, plants, and fungai which can bear a food role for humans and domesticated animals, as well as derived food products and the processes used to make them.

For more details, see:

For primary sources, see: [vardeman2014ceur], [sam2014odp], [dooley2018npj], [hitzler2018]

We'll work through several examples of representation, although here's an example of what a full recipe in FoodOn would look like: owl:NamedIndividual a wtm:Recipe ; rdf:about ind:BananaBlueberryAlmondFlourMuffin ; wtm:hasIngredient ind:AlmondMeal ; wtm:hasIngredient ind:AppleCiderVinegar ; wtm:hasIngredient ind:BakingSoda ; wtm:hasIngredient ind:Banana ; wtm:hasIngredient ind:Blueberry ; wtm:hasIngredient ind:ChickenEgg ; wtm:hasIngredient ind:Honey ; wtm:isRecommendedForCourse wtm:Dessert ; wtm:isRecommendedForMeal wtm:Breakfast ; wtm:isRecommendedForMeal wtm:Snack ; wtm:hasCookTime "PT60M"^^xsd:duration ; wtm:hasCookingTemperature "350"^^xsd:integer ; wtm:serves "4"^^xsd:integer ; rdfs:label "banana blueberry almond flour muffin" ; skos:definition "a banana blueberry muffin made with almond flour" ; skos:scopeNote "recipe" ; prov:wasDerivedFrom https://www.allrecipes.com/recipe/238012/banana-blueberry-almond-flour-muffins-gluten-free/?internalSource=hub%20recipe&referringContentType=Search .

Graph Size Comparisons¶

One frequently asked question is about the size of the graphs that we're using in the kglab tutorial. The short answer: "No, these aren't trivial graphs."

We'll start out with small examples, to show the basics for how to construct an RDF graph.

Most of the examples here will use a knowledge graph with ~300 nodes and ~2000 edges. This is a non-trivial size, especially when you start working with some graph algorithms. Again, this tutorial has learning as its main intent, and this size of graph is ideal for running queries, validation, graph algorithms, visualization, etc., with the kinds of compute and memory resources available on contemporary laptops.

In other words, we prioritize datasets that are large enough for examples to illustrate common use cases, though small enough for learners to understand.

10^6 or more nodes are needed for deep learning
10^8 can run on contemporary laptops
larger graphs require hardware accelerators (e.g., GPUs) or cloud-based clusters

The full recipes.tsv dataset includes nearly 250,000 recipes. In some of the later examples, we'll work with that entire dataset – which is definitely non-trivial.

Last update: 2022-03-23