To run this notebook in JupyterLab, load
Throughout this tutorial we'll work with data in the
!ls -goh ../dat
total 109896 -rw-r--r-- 1 36K Jan 26 16:38 acq.ttl -rw-r--r-- 1 40M Jan 15 09:57 all_ind.csv -rw-r--r-- 1 24K Jan 15 09:57 data_prep.ipynb drwxr-xr-x 10 320B Nov 6 03:10 [34mfood_com[m[m -rw-r--r-- 1 1.0K Feb 10 14:54 gorm.ttl -rw-r--r-- 1 2.7K Jan 15 09:57 nom.ttl drwxr-xr-x 13 416B Nov 25 10:02 [34mpsl[m[m -rwxr-xr-x 1 128K Jan 15 09:57 [31mrecipes.csv[m[m -rw-r--r-- 1 65K Jan 15 09:57 recipes.ttl drwxr-xr-x 5 160B Jan 15 09:57 [34mtitanic[m[m -rw-r--r-- 1 14M Jan 15 09:57 weatherAUS.csv
"Food.com Recipes and Interactions"
One of the simpler recipes in that dataset is
"anytime crepes" at https://www.food.com/recipe/327593
- id: 327593
- minutes: 8
"['egg', 'milk', 'whole wheat flour']"
The tutorial begins by showing how to represent the metadata for this recipe in a knowledge graph, then gradually builds up more and more information about this collection of recipes.
To start, let's load and examine the CSV data:
import pandas as pd df = pd.read_csv("../dat/recipes.csv") df.head()
|0||164636||1 1 1 tempura batter||5||['15-minutes-or-less', 'time-to-make', 'course...||i use this everytime i make onion rings, hot p...||['egg', 'flour', 'water']|
|1||144841||2 step pound cake for a kitchen aide mixer||110||['time-to-make', 'course', 'preparation', 'occ...||this recipe was published in a southern living...||['flour', 'sugar', 'butter', 'milk', 'eggs', '...|
|2||189437||40 second omelet||25||['30-minutes-or-less', 'time-to-make', 'course...||you'll need an "inverted pancake turner" for t...||['eggs', 'water', 'butter']|
|3||19104||all purpose dinner crepes batter||90||['weeknight', 'time-to-make', 'course', 'main-...||this basic crepe recipe can be used for all yo...||['eggs', 'salt', 'flour', 'milk', 'butter']|
|4||64793||amish friendship starter||14405||['weeknight', 'time-to-make', 'course', 'cuisi...||this recipe was given to me years ago by a fri...||['sugar', 'flour', 'milk']|
Now let's drill down to the metadata for the
"anytime crepes" recipe
recipe_row = df[df["name"] == "anytime crepes"].iloc recipe_row
id 327593 name anytime crepes minutes 8 tags ['15-minutes-or-less', 'time-to-make', 'course... description from my friend linda, this is an oh-so-easy-an... ingredients ['egg', 'milk', 'whole wheat flour'] Name: 8, dtype: object
Given that we have a rich source of linked data to use, next we need to focus on knowledge representation. We'll use the FoodOn ontology (see below) to represent recipes, making use of two of its controlled vocabularies:
The first one defines an entity called
Recipe which has the full URL of http://purl.org/heals/food/Recipe and we'll use that to represent our recipe data from the Food.com dataset.
It's a common practice to abbreviate the first part of the URL for a controlled vocabulary with a prefix. In this case we'll use the prefix conventions used in previous publications related to this ontology:
Now let's represent the data using this ontology, starting with the three ingredients for the anytime crepes recipe:
ingredients = eval(recipe_row["ingredients"]) ingredients
['egg', 'milk', 'whole wheat flour']
These ingredients become represented, respectively, as:
We'll use several different sources for data and ontology throughout the kglab tutorial, although most of it focuses on progressive examples that use FoodOn.
FoodOn – subtitled "a farm to fork ontology" – takes a comprehensive view of the data and metadata involved in our food supply, beginning with seed genomics, micronutrients, the biology of food alergies, etc. This work is predicated on leveraging large knowledge graphs to represent the different areas of science, technology, business, public policy, etc.:
The need to represent knowledge about food is central to many human activities including agriculture, medicine, food safety inspection, shopping patterns, and sustainable development. FoodOn is an ontology – a controlled vocabulary which can be used by both people and computers – to name all parts of animals, plants, and fungai which can bear a food role for humans and domesticated animals, as well as derived food products and the processes used to make them.
For more details, see:
We'll work through several examples of representation, although here's an example of what a full recipe in FoodOn would look like: owl:NamedIndividual a wtm:Recipe ; rdf:about ind:BananaBlueberryAlmondFlourMuffin ; wtm:hasIngredient ind:AlmondMeal ; wtm:hasIngredient ind:AppleCiderVinegar ; wtm:hasIngredient ind:BakingSoda ; wtm:hasIngredient ind:Banana ; wtm:hasIngredient ind:Blueberry ; wtm:hasIngredient ind:ChickenEgg ; wtm:hasIngredient ind:Honey ; wtm:isRecommendedForCourse wtm:Dessert ; wtm:isRecommendedForMeal wtm:Breakfast ; wtm:isRecommendedForMeal wtm:Snack ; wtm:hasCookTime "PT60M"^^xsd:duration ; wtm:hasCookingTemperature "350"^^xsd:integer ; wtm:serves "4"^^xsd:integer ; rdfs:label "banana blueberry almond flour muffin" ; skos:definition "a banana blueberry muffin made with almond flour" ; skos:scopeNote "recipe" ; prov:wasDerivedFrom https://www.allrecipes.com/recipe/238012/banana-blueberry-almond-flour-muffins-gluten-free/?internalSource=hub%20recipe&referringContentType=Search .
Graph Size Comparisons¶
One frequently asked question is about the size of the graphs that we're using in the kglab tutorial. The short answer: "No, these aren't trivial graphs."
We'll start out with small examples, to show the basics for how to construct an RDF graph.
Most of the examples here will use a knowledge graph with ~300 nodes and ~2000 edges. This is a non-trivial size, especially when you start working with some graph algorithms. Again, this tutorial has learning as its main intent, and this size of graph is ideal for running queries, validation, graph algorithms, visualization, etc., with the kinds of compute and memory resources available on contemporary laptops.
In other words, we prioritize datasets that are large enough for examples to illustrate common use cases, though small enough for learners to understand.
- 10^6 or more nodes are needed for deep learning
- 10^8 can run on contemporary laptops
- larger graphs require hardware accelerators (e.g., GPUs) or cloud-based clusters
recipes.tsv dataset includes nearly 250,000 recipes. In some of the later examples, we'll work with that entire dataset – which is definitely non-trivial.