Note
To run this notebook in JupyterLab, load examples/ex0_0.ipynb
Data Sources¶
Throughout this tutorial we'll work with data in the dat
subdirectory:
!ls -goh ../dat
total 110000
-rw-r--r-- 1 36K Jan 26 2021 acq.ttl
-rw-r--r--@ 1 40M Jan 15 2021 all_ind.csv
-rw-r--r-- 1 24K Jan 15 2021 data_prep.ipynb
drwxr-xr-x 10 320B Nov 6 2020 [34mfood_com[m[m
-rw-r--r-- 1 3.0K Apr 24 2021 gorm.jsonld
-rw-r--r-- 1 3.8K Apr 25 2021 gorm.parquet
-rw-r--r-- 1 1.0K Feb 13 2021 gorm.ttl
-rw-r--r-- 1 5.0K Apr 24 2021 nom.jsonld
-rw-r--r-- 1 4.9K Apr 25 2021 nom.parquet
-rw-r--r-- 1 2.7K Jan 15 2021 nom.ttl
drwxr-xr-x 13 416B Nov 25 2020 [34mpsl[m[m
-rwxr-xr-x 1 128K Jan 15 2021 [31mrecipes.csv[m[m
-rw-r--r-- 1 65K Jan 15 2021 recipes.ttl
-rw-r--r-- 1 26K May 9 2021 roam.json
drwxr-xr-x 5 160B Jan 15 2021 [34mtitanic[m[m
-rw-r--r-- 1 14M Jan 15 2021 weatherAUS.csv
In particular, we'll work with a series of progressive examples based on the
dat/recipes.csv
CSV file.
This data comes from a
Kaggle dataset
that describes metadata about Food.com:
"Food.com Recipes and Interactions"
Shuyang Li
Kaggle (2019)
https://doi.org/10.34740/kaggle/dsv/783630
One of the simpler recipes in that dataset is "anytime crepes"
at https://www.food.com/recipe/327593
- id: 327593
- minutes: 8
- ingredients:
"['egg', 'milk', 'whole wheat flour']"
The tutorial begins by showing how to represent the metadata for this recipe in a knowledge graph, then gradually builds up more and more information about this collection of recipes.
To start, let's load and examine the CSV data:
from os.path import dirname
import os
import pandas as pd
df = pd.read_csv(dirname(os.getcwd()) + "/dat/recipes.csv")
df.head()
id | name | minutes | tags | description | ingredients | |
---|---|---|---|---|---|---|
0 | 164636 | 1 1 1 tempura batter | 5 | ['15-minutes-or-less', 'time-to-make', 'course... | i use this everytime i make onion rings, hot p... | ['egg', 'flour', 'water'] |
1 | 144841 | 2 step pound cake for a kitchen aide mixer | 110 | ['time-to-make', 'course', 'preparation', 'occ... | this recipe was published in a southern living... | ['flour', 'sugar', 'butter', 'milk', 'eggs', '... |
2 | 189437 | 40 second omelet | 25 | ['30-minutes-or-less', 'time-to-make', 'course... | you'll need an "inverted pancake turner" for t... | ['eggs', 'water', 'butter'] |
3 | 19104 | all purpose dinner crepes batter | 90 | ['weeknight', 'time-to-make', 'course', 'main-... | this basic crepe recipe can be used for all yo... | ['eggs', 'salt', 'flour', 'milk', 'butter'] |
4 | 64793 | amish friendship starter | 14405 | ['weeknight', 'time-to-make', 'course', 'cuisi... | this recipe was given to me years ago by a fri... | ['sugar', 'flour', 'milk'] |
Now let's drill down to the metadata for the "anytime crepes"
recipe
recipe_row = df[df["name"] == "anytime crepes"].iloc[0]
recipe_row
id 327593
name anytime crepes
minutes 8
tags ['15-minutes-or-less', 'time-to-make', 'course...
description from my friend linda, this is an oh-so-easy-an...
ingredients ['egg', 'milk', 'whole wheat flour']
Name: 8, dtype: object
Given that we have a rich source of linked data to use, next we need to focus on knowledge representation. We'll use the FoodOn ontology (see below) to represent recipes, making use of two of its controlled vocabularies:
The first one defines an entity called Recipe
which has the full URL of http://purl.org/heals/food/Recipe and we'll use that to represent our recipe data from the Food.com dataset.
It's a common practice to abbreviate the first part of the URL for a controlled vocabulary with a prefix. In this case we'll use the prefix conventions used in previous publications related to this ontology:
URL | prefix |
---|---|
http://purl.org/heals/food/ | wtm: |
http://purl.org/heals/ingredient/ | ind: |
Now let's represent the data using this ontology, starting with the three ingredients for the anytime crepes recipe:
ingredients = eval(recipe_row["ingredients"])
ingredients
['egg', 'milk', 'whole wheat flour']
These ingredients become represented, respectively, as:
ind:ChickenEgg
ind:CowMilk
ind:WholeWheatFlour
Ontology Sources¶
We'll use several different sources for data and ontology throughout the kglab tutorial, although most of it focuses on progressive examples that use FoodOn.
FoodOn – subtitled "a farm to fork ontology" – takes a comprehensive view of the data and metadata involved in our food supply, beginning with seed genomics, micronutrients, the biology of food alergies, etc. This work is predicated on leveraging large knowledge graphs to represent the different areas of science, technology, business, public policy, etc.:
The need to represent knowledge about food is central to many human activities including agriculture, medicine, food safety inspection, shopping patterns, and sustainable development. FoodOn is an ontology – a controlled vocabulary which can be used by both people and computers – to name all parts of animals, plants, and fungai which can bear a food role for humans and domesticated animals, as well as derived food products and the processes used to make them.
For more details, see:
- https://foodon.org/design/foodon-relations/
- https://foodkg.github.io/docs/ontologyDocumentation/Ingredient/doc/index-en.html
- https://foodkg.github.io/foodkg.html
- https://github.com/foodkg/foodkg.github.io
For primary sources, see: [vardeman2014ceur], [sam2014odp], [dooley2018npj], [hitzler2018]
We'll work through several examples of representation, although here's an example of what a full recipe in FoodOn would look like: owl:NamedIndividual a wtm:Recipe ; rdf:about ind:BananaBlueberryAlmondFlourMuffin ; wtm:hasIngredient ind:AlmondMeal ; wtm:hasIngredient ind:AppleCiderVinegar ; wtm:hasIngredient ind:BakingSoda ; wtm:hasIngredient ind:Banana ; wtm:hasIngredient ind:Blueberry ; wtm:hasIngredient ind:ChickenEgg ; wtm:hasIngredient ind:Honey ; wtm:isRecommendedForCourse wtm:Dessert ; wtm:isRecommendedForMeal wtm:Breakfast ; wtm:isRecommendedForMeal wtm:Snack ; wtm:hasCookTime "PT60M"^^xsd:duration ; wtm:hasCookingTemperature "350"^^xsd:integer ; wtm:serves "4"^^xsd:integer ; rdfs:label "banana blueberry almond flour muffin" ; skos:definition "a banana blueberry muffin made with almond flour" ; skos:scopeNote "recipe" ; prov:wasDerivedFrom https://www.allrecipes.com/recipe/238012/banana-blueberry-almond-flour-muffins-gluten-free/?internalSource=hub%20recipe&referringContentType=Search .
Graph Size Comparisons¶
One frequently asked question is about the size of the graphs that we're using in the kglab tutorial. The short answer: "No, these aren't trivial graphs."
We'll start out with small examples, to show the basics for how to construct an RDF graph.
Most of the examples here will use a knowledge graph with ~300 nodes and ~2000 edges. This is a non-trivial size, especially when you start working with some graph algorithms. Again, this tutorial has learning as its main intent, and this size of graph is ideal for running queries, validation, graph algorithms, visualization, etc., with the kinds of compute and memory resources available on contemporary laptops.
In other words, we prioritize datasets that are large enough for examples to illustrate common use cases, though small enough for learners to understand.
- 10^6 or more nodes are needed for deep learning
- 10^8 can run on contemporary laptops
- larger graphs require hardware accelerators (e.g., GPUs) or cloud-based clusters
The full recipes.tsv
dataset includes nearly 250,000 recipes. In some of the later examples, we'll work with that entire dataset – which is definitely non-trivial.