Note
To run this notebook in JupyterLab, load examples/ex8_0.ipynb
Vector embedding with gensim
¶
Let's make use of deep learning through a technique called embedding – to analyze the relatedness of the labels used for recipe ingredients.
Among the most closely related ingredients:
- Some are very close synonyms and should be consolidated to improve data quality
- Others are interesting other ingredients that pair frequently, useful for recommendations
On the one hand, this approach is quite helpful for analyzing the NLP annotations that go into a knowledge graph.
On the other hand it can be used along with SKOS
or similar vocabularies for ontology-based discovery within the graph, e.g., for advanced search UI.
Curating annotations¶
We'll be working with the labels for ingredients that go into our KG. Looking at the raw data, there are many cases where slightly different spellings are being used for the same entity.
As a first step let's define a list of synonyms to substitute, prior to running the vector embedding. This will help produce better quality results.
Note that this kind of work comes of the general heading of curating annotations ... which is what we spend so much time doing in KG work. It's similar to how data preparation is ~80% of the workload for data science teams, and for good reason.
SYNONYMS = {
"pepper": "black pepper",
"black pepper": "black pepper",
"egg": "egg",
"eggs": "egg",
"vanilla": "vanilla",
"vanilla extract": "vanilla",
"flour": "flour",
"all-purpose flour": "flour",
"onions": "onion",
"onion": "onion",
"carrots": "carrot",
"carrot": "carrot",
"potatoes": "potato",
"potato": "potato",
"tomatoes": "tomato",
"fresh tomatoes": "tomato",
"fresh tomato": "tomato",
"garlic": "garlic",
"garlic clove": "garlic",
"garlic cloves": "garlic",
}
Analyze ingredient labels from 250K recipes¶
import csv
MAX_ROW = 250000 # 231638
max_context = 0
min_context = 1000
recipes = []
vocab = set()
with open("../dat/all_ind.csv", "r") as f:
reader = csv.reader(f)
next(reader, None) # remove file header
for i, row in enumerate(reader):
id = row[0]
ind_set = set()
# substitute synonyms
for ind in set(eval(row[3])):
if ind in SYNONYMS:
ind_set.add(SYNONYMS[ind])
else:
ind_set.add(ind)
if len(ind_set) > 1:
recipes.append([id, ind_set])
vocab.update(ind_set)
max_context = max(max_context, len(ind_set))
min_context = min(min_context, len(ind_set))
if i > MAX_ROW:
break
print("max context: {} unique ingredients per recipe".format(max_context))
print("min context: {} unique ingredients per recipe".format(min_context))
print("vocab size", len(list(vocab)))
max context: 43 unique ingredients per recipe
min context: 2 unique ingredients per recipe
vocab size 14931
Since we've performed this data preparation work, let's use pickle
to save this larger superset of the recipes dataset to the tmp.pkl
file:
import pickle
pickle.dump(recipes, open("tmp.pkl", "wb"))
recipes[:3]
[['137739',
{'butter',
'honey',
'mexican seasoning',
'mixed spice',
'olive oil',
'salt',
'winter squash'}],
['31490',
{'cheese',
'egg',
'milk',
'prepared pizza crust',
'salt and pepper',
'sausage patty'}],
['112140',
{'cheddar cheese',
'chili powder',
'diced tomatoes',
'ground beef',
'ground cumin',
'kidney beans',
'lettuce',
'rotel tomatoes',
'salt',
'tomato paste',
'tomato soup',
'water',
'yellow onions'}]]
Then we can restore the pickled Python data structure for usage later in other use cases. The output shows the first few entries, to illustrated the format.
Now reshape this data into a vector of vectors of ingredients per recipe, to use for training a word2vec vector embedding model:
vectors = []
for id, ind_set in recipes:
v = []
for ind in ind_set:
v.append(ind)
vectors.append(v)
vectors[:3]
[['mexican seasoning',
'mixed spice',
'salt',
'honey',
'winter squash',
'butter',
'olive oil'],
['milk',
'prepared pizza crust',
'egg',
'cheese',
'sausage patty',
'salt and pepper'],
['ground cumin',
'water',
'tomato soup',
'diced tomatoes',
'yellow onions',
'ground beef',
'lettuce',
'salt',
'rotel tomatoes',
'tomato paste',
'chili powder',
'kidney beans',
'cheddar cheese']]
We'll use the Word2Vec
implementation in the gensim
library (i.e., deep learning) to train an embedding model.
This approach tends to work best if the training data has at least 100K rows.
Let's also show how to serialize the word2vec results, saving them to the tmp.w2v
file so they could be restored later for other use cases.
NB: there is work in progress which will replace gensim
with pytorch
instead.
import gensim
MIN_COUNT = 2
model_path = "tmp.w2v"
model = gensim.models.Word2Vec(vectors, min_count=MIN_COUNT, window=max_context)
model.save(model_path)
The get_related()
function takes any ingredient as input, using the embedding model to find the most similar other ingredients – along with calculating levenshtein
edit distances (string similarity) among these labels. Then it calculates percentiles for both metrics in numpy
and returns the results as a pandas
DataFrame.
import numpy as np
import pandas as pd
import pylev
def get_related (model, query, n=20, granularity=100):
"""return a DataFrame of the closely related items"""
try:
bins = np.linspace(0, 1, num=granularity, endpoint=True)
v = sorted(
model.wv.most_similar(positive=[query], topn=n),
key=lambda x: x[1],
reverse=True
)
df = pd.DataFrame(v, columns=["ingredient", "similarity"])
s = df["similarity"]
quantiles = s.quantile(bins, interpolation="nearest")
df["sim_pct"] = np.digitize(s, quantiles) - 1
df["levenshtein"] = [ pylev.levenshtein(d, query) / len(query) for d in df["ingredient"] ]
s = df["levenshtein"]
quantiles = s.quantile(bins, interpolation="nearest")
df["lev_pct"] = granularity - np.digitize(s, quantiles)
return df
except KeyError:
return pd.DataFrame(columns=["ingredient", "similarity", "percentile"])
Let's try this with dried basil
as the ingredient to query, and review the top 50
most similar other ingredients returned as the DataFrame df
:
pd.set_option("max_rows", None)
df = get_related(model, "dried basil", n=50)
df
ingredient | similarity | sim_pct | levenshtein | lev_pct | |
---|---|---|---|---|---|
0 | dried basil leaves | 0.711843 | 99 | 0.636364 | 76 |
1 | dried rosemary | 0.646414 | 97 | 0.636364 | 76 |
2 | dry basil | 0.629536 | 95 | 0.272727 | 98 |
3 | dried italian seasoning | 0.613078 | 93 | 1.272727 | 28 |
4 | fresh basil | 0.596147 | 91 | 0.363636 | 94 |
5 | dried marjoram | 0.588682 | 89 | 0.636364 | 76 |
6 | italian herb seasoning | 0.565614 | 87 | 1.454545 | 16 |
7 | italian seasoning | 0.554528 | 85 | 1.090909 | 46 |
8 | dried parsley | 0.552430 | 83 | 0.454545 | 90 |
9 | dried sweet basil leaves | 0.543795 | 81 | 1.181818 | 36 |
10 | dried rosemary leaves | 0.522495 | 79 | 1.181818 | 36 |
11 | dried whole thyme | 0.510558 | 77 | 1.000000 | 58 |
12 | dried parsley flakes | 0.506447 | 75 | 1.000000 | 58 |
13 | basil | 0.483062 | 73 | 0.545455 | 88 |
14 | dried thyme leaves | 0.473144 | 71 | 1.000000 | 58 |
15 | mild italian sausage | 0.469430 | 69 | 1.363636 | 24 |
16 | italian-style tomatoes | 0.461798 | 67 | 1.727273 | 10 |
17 | part-skim mozzarella cheese | 0.461220 | 65 | 2.090909 | 0 |
18 | italian spices | 0.460701 | 63 | 1.090909 | 46 |
19 | cooked pearl barley | 0.459609 | 61 | 1.272727 | 28 |
20 | white kidney beans | 0.453172 | 59 | 1.181818 | 36 |
21 | dried oregano leaves | 0.452838 | 57 | 1.090909 | 46 |
22 | reduced-fat mozzarella cheese | 0.452544 | 55 | 2.090909 | 0 |
23 | dry oregano | 0.452370 | 53 | 0.818182 | 70 |
24 | dried summer savory | 0.451501 | 51 | 1.090909 | 46 |
25 | parmesan rind | 0.450826 | 49 | 0.909091 | 68 |
26 | quick-cooking barley | 0.445577 | 47 | 1.454545 | 16 |
27 | canned tomato sauce | 0.443889 | 45 | 1.272727 | 28 |
28 | dried thyme | 0.443561 | 43 | 0.454545 | 90 |
29 | cooked pasta | 0.437754 | 41 | 0.636364 | 76 |
30 | italian-style tomato paste | 0.435926 | 39 | 1.909091 | 6 |
31 | italian-style diced tomatoes | 0.435351 | 37 | 2.090909 | 0 |
32 | italian sausage | 0.434764 | 35 | 1.000000 | 58 |
33 | orzo pasta | 0.434725 | 33 | 0.636364 | 76 |
34 | button mushroom | 0.432255 | 31 | 1.181818 | 36 |
35 | italian sausages | 0.430821 | 29 | 1.090909 | 46 |
36 | dried red pepper flakes | 0.429260 | 27 | 1.454545 | 16 |
37 | lasagna noodles | 0.429071 | 25 | 1.181818 | 36 |
38 | italian seasoning mix | 0.428897 | 23 | 1.454545 | 16 |
39 | dried sage | 0.428746 | 21 | 0.363636 | 94 |
40 | sweet italian sausage | 0.427985 | 19 | 1.545455 | 14 |
41 | hot pepper flakes | 0.423853 | 17 | 1.272727 | 28 |
42 | rubbed sage | 0.423643 | 15 | 0.727273 | 72 |
43 | italian style breadcrumbs | 0.422995 | 13 | 1.909091 | 6 |
44 | fresh basil leaves | 0.422796 | 11 | 1.000000 | 58 |
45 | fresh mushrooms | 0.422383 | 9 | 1.090909 | 46 |
46 | t-bone type lamb chops | 0.420398 | 7 | 1.727273 | 10 |
47 | dry lentils | 0.420370 | 5 | 0.727273 | 72 |
48 | instant minced garlic | 0.419762 | 3 | 1.363636 | 24 |
49 | dried tarragon | 0.417663 | 1 | 0.636364 | 76 |
Note how some of the most similar items, based on vector embedding, are synonyms or special forms of our query dried basil
ingredient: dried basil leaves
, dry basil
, dried sweet basil leaves
, etc. These tend to rank high in terms of levenshtein distance too.
Let's plot the similarity measures:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use("ggplot")
df["similarity"].plot(alpha=0.75, rot=0)
plt.show()
Notice the inflection points at approximately 0.57
and again at 0.47
in that plot.
We could use some statistical techniques (e.g., clustering) to segment the similarities into a few groups:
- highest similarity – potential synonyms for the query
- mid-range similarity – potential hypernyms and hyponyms for the query
- long-tail similarity – other ingredients that pair well with the query
In this example, below a threshold of the 75th percentile for vector embedding similarity, the related ingredients are less about being synonyms and more about other foods that pair well with basil.
Let's define another function rank_related()
which ranks the related ingredients based on a combination of these two metrics.
This uses a cheap approximation of a pareto archive for the ranking -- which comes in handing for recommender systems and custom search applications that must combine multiple ranking metrics:
from kglab import root_mean_square
def rank_related (df):
df2 = df.copy(deep=True)
df2["related"] = df2.apply(lambda row: root_mean_square([row[2], row[4]]), axis=1)
return df2.sort_values(by=["related"], ascending=False)
rank_related(df)
ingredient | similarity | sim_pct | levenshtein | lev_pct | related | |
---|---|---|---|---|---|---|
2 | dry basil | 0.629536 | 95 | 0.272727 | 98 | 96.511657 |
4 | fresh basil | 0.596147 | 91 | 0.363636 | 94 | 92.512161 |
0 | dried basil leaves | 0.711843 | 99 | 0.636364 | 76 | 88.252479 |
1 | dried rosemary | 0.646414 | 97 | 0.636364 | 76 | 87.134953 |
8 | dried parsley | 0.552430 | 83 | 0.454545 | 90 | 86.570780 |
5 | dried marjoram | 0.588682 | 89 | 0.636364 | 76 | 82.755664 |
13 | basil | 0.483062 | 73 | 0.545455 | 88 | 80.848624 |
28 | dried thyme | 0.443561 | 43 | 0.454545 | 90 | 70.530135 |
3 | dried italian seasoning | 0.613078 | 93 | 1.272727 | 28 | 68.676779 |
7 | italian seasoning | 0.554528 | 85 | 1.090909 | 46 | 68.341056 |
11 | dried whole thyme | 0.510558 | 77 | 1.000000 | 58 | 68.165240 |
39 | dried sage | 0.428746 | 21 | 0.363636 | 94 | 68.106534 |
12 | dried parsley flakes | 0.506447 | 75 | 1.000000 | 58 | 67.041032 |
14 | dried thyme leaves | 0.473144 | 71 | 1.000000 | 58 | 64.826692 |
9 | dried sweet basil leaves | 0.543795 | 81 | 1.181818 | 36 | 62.677747 |
6 | italian herb seasoning | 0.565614 | 87 | 1.454545 | 16 | 62.549980 |
23 | dry oregano | 0.452370 | 53 | 0.818182 | 70 | 62.084620 |
10 | dried rosemary leaves | 0.522495 | 79 | 1.181818 | 36 | 61.388110 |
29 | cooked pasta | 0.437754 | 41 | 0.636364 | 76 | 61.061444 |
25 | parmesan rind | 0.450826 | 49 | 0.909091 | 68 | 59.266348 |
33 | orzo pasta | 0.434725 | 33 | 0.636364 | 76 | 58.587541 |
18 | italian spices | 0.460701 | 63 | 1.090909 | 46 | 55.158861 |
49 | dried tarragon | 0.417663 | 1 | 0.636364 | 76 | 53.744767 |
42 | rubbed sage | 0.423643 | 15 | 0.727273 | 72 | 52.004807 |
21 | dried oregano leaves | 0.452838 | 57 | 1.090909 | 46 | 51.792857 |
15 | mild italian sausage | 0.469430 | 69 | 1.363636 | 24 | 51.657526 |
47 | dry lentils | 0.420370 | 5 | 0.727273 | 72 | 51.034302 |
20 | white kidney beans | 0.453172 | 59 | 1.181818 | 36 | 48.872283 |
24 | dried summer savory | 0.451501 | 51 | 1.090909 | 46 | 48.564390 |
32 | italian sausage | 0.434764 | 35 | 1.000000 | 58 | 47.900939 |
16 | italian-style tomatoes | 0.461798 | 67 | 1.727273 | 10 | 47.900939 |
19 | cooked pearl barley | 0.459609 | 61 | 1.272727 | 28 | 47.460510 |
17 | part-skim mozzarella cheese | 0.461220 | 65 | 2.090909 | 0 | 45.961941 |
44 | fresh basil leaves | 0.422796 | 11 | 1.000000 | 58 | 41.743263 |
22 | reduced-fat mozzarella cheese | 0.452544 | 55 | 2.090909 | 0 | 38.890873 |
35 | italian sausages | 0.430821 | 29 | 1.090909 | 46 | 38.451268 |
27 | canned tomato sauce | 0.443889 | 45 | 1.272727 | 28 | 37.476659 |
26 | quick-cooking barley | 0.445577 | 47 | 1.454545 | 16 | 35.106979 |
34 | button mushroom | 0.432255 | 31 | 1.181818 | 36 | 33.593154 |
45 | fresh mushrooms | 0.422383 | 9 | 1.090909 | 46 | 33.143627 |
37 | lasagna noodles | 0.429071 | 25 | 1.181818 | 36 | 30.991934 |
30 | italian-style tomato paste | 0.435926 | 39 | 1.909091 | 6 | 27.901613 |
31 | italian-style diced tomatoes | 0.435351 | 37 | 2.090909 | 0 | 26.162951 |
41 | hot pepper flakes | 0.423853 | 17 | 1.272727 | 28 | 23.162470 |
36 | dried red pepper flakes | 0.429260 | 27 | 1.454545 | 16 | 22.192341 |
38 | italian seasoning mix | 0.428897 | 23 | 1.454545 | 16 | 19.811613 |
48 | instant minced garlic | 0.419762 | 3 | 1.363636 | 24 | 17.102631 |
40 | sweet italian sausage | 0.427985 | 19 | 1.545455 | 14 | 16.688319 |
43 | italian style breadcrumbs | 0.422995 | 13 | 1.909091 | 6 | 10.124228 |
46 | t-bone type lamb chops | 0.420398 | 7 | 1.727273 | 10 | 8.631338 |
Notice how the "synonym" cases tend to move up to the top now?
Meanwhile while the "pairs well with" are in the lower half of the ranked list: fresh mushrooms
, italian turkey sausage
, cooked spaghetti
, white kidney beans
, etc.
Exercises¶
Exercise 1:
Build a report for a human-in-the-loop reviewer, using the rank_related()
function while iterating over vocab
to make algorithmic suggestions for possible synonyms.
Exercise 2:
How would you make algorithmic suggestions for a reviewer about which ingredients could be related to a query, e.g., using the skos:broader
and skos:narrower
relations in the skos
vocabulary to represent hypernyms and hyponyms respectively?
This could extend the KG to provide a kind of thesaurus about recipe ingredients.