Skip to content

Note

To run this notebook in JupyterLab, load examples/ex8_0.ipynb

Vector embedding with gensim

Let's make use of deep learning through a technique called embedding – to analyze the relatedness of the labels used for recipe ingredients.

Among the most closely related ingredients:

  • Some are very close synonyms and should be consolidated to improve data quality
  • Others are interesting other ingredients that pair frequently, useful for recommendations

On the one hand, this approach is quite helpful for analyzing the NLP annotations that go into a knowledge graph. On the other hand it can be used along with SKOS or similar vocabularies for ontology-based discovery within the graph, e.g., for advanced search UI.

Curating annotations

We'll be working with the labels for ingredients that go into our KG. Looking at the raw data, there are many cases where slightly different spellings are being used for the same entity.

As a first step let's define a list of synonyms to substitute, prior to running the vector embedding. This will help produce better quality results.

Note that this kind of work comes of the general heading of curating annotations ... which is what we spend so much time doing in KG work. It's similar to how data preparation is ~80% of the workload for data science teams, and for good reason.

SYNONYMS = {
    "pepper": "black pepper",
    "black pepper": "black pepper",

    "egg": "egg",
    "eggs": "egg",

    "vanilla": "vanilla",
    "vanilla extract": "vanilla",

    "flour": "flour",
    "all-purpose flour": "flour",

    "onions": "onion",
    "onion": "onion",

    "carrots": "carrot",
    "carrot": "carrot",

    "potatoes": "potato",
    "potato": "potato",

    "tomatoes": "tomato",
    "fresh tomatoes": "tomato",
    "fresh tomato": "tomato",

    "garlic": "garlic",
    "garlic clove": "garlic",
    "garlic cloves": "garlic",
}

Analyze ingredient labels from 250K recipes

import csv

MAX_ROW = 250000 # 231638

max_context = 0
min_context = 1000

recipes = []
vocab = set()

with open("../dat/all_ind.csv", "r") as f:
    reader = csv.reader(f)
    next(reader, None) # remove file header

    for i, row in enumerate(reader):
        id = row[0]
        ind_set = set()

        # substitute synonyms
        for ind in set(eval(row[3])):
            if ind in SYNONYMS:
                ind_set.add(SYNONYMS[ind])
            else:
                ind_set.add(ind)

        if len(ind_set) > 1:
            recipes.append([id, ind_set])
            vocab.update(ind_set)

            max_context = max(max_context, len(ind_set))
            min_context = min(min_context, len(ind_set))

        if i > MAX_ROW:
            break

print("max context: {} unique ingredients per recipe".format(max_context))
print("min context: {} unique ingredients per recipe".format(min_context))
print("vocab size", len(list(vocab)))
max context: 43 unique ingredients per recipe
min context: 2 unique ingredients per recipe
vocab size 14931

Since we've performed this data preparation work, let's use pickle to save this larger superset of the recipes dataset to the tmp.pkl file:

import pickle

pickle.dump(recipes, open("tmp.pkl", "wb"))

recipes[:3]
[['137739',
  {'butter',
   'honey',
   'mexican seasoning',
   'mixed spice',
   'olive oil',
   'salt',
   'winter squash'}],
 ['31490',
  {'cheese',
   'egg',
   'milk',
   'prepared pizza crust',
   'salt and pepper',
   'sausage patty'}],
 ['112140',
  {'cheddar cheese',
   'chili powder',
   'diced tomatoes',
   'ground beef',
   'ground cumin',
   'kidney beans',
   'lettuce',
   'rotel tomatoes',
   'salt',
   'tomato paste',
   'tomato soup',
   'water',
   'yellow onions'}]]

Then we can restore the pickled Python data structure for usage later in other use cases. The output shows the first few entries, to illustrated the format.

Now reshape this data into a vector of vectors of ingredients per recipe, to use for training a word2vec vector embedding model:

vectors = []

for id, ind_set in recipes:
    v = []

    for ind in ind_set:
        v.append(ind)

    vectors.append(v)

vectors[:3]
[['mexican seasoning',
  'mixed spice',
  'salt',
  'honey',
  'winter squash',
  'butter',
  'olive oil'],
 ['milk',
  'prepared pizza crust',
  'egg',
  'cheese',
  'sausage patty',
  'salt and pepper'],
 ['ground cumin',
  'water',
  'tomato soup',
  'diced tomatoes',
  'yellow onions',
  'ground beef',
  'lettuce',
  'salt',
  'rotel tomatoes',
  'tomato paste',
  'chili powder',
  'kidney beans',
  'cheddar cheese']]

We'll use the Word2Vec implementation in the gensim library (i.e., deep learning) to train an embedding model. This approach tends to work best if the training data has at least 100K rows.

Let's also show how to serialize the word2vec results, saving them to the tmp.w2v file so they could be restored later for other use cases.

NB: there is work in progress which will replace gensim with pytorch instead.

import gensim

MIN_COUNT = 2
model_path = "tmp.w2v"

model = gensim.models.Word2Vec(vectors, min_count=MIN_COUNT, window=max_context)
model.save(model_path)

The get_related() function takes any ingredient as input, using the embedding model to find the most similar other ingredients – along with calculating levenshtein edit distances (string similarity) among these labels. Then it calculates percentiles for both metrics in numpy and returns the results as a pandas DataFrame.

import numpy as np
import pandas as pd
import pylev

def get_related (model, query, n=20, granularity=100):
    """return a DataFrame of the closely related items"""
    try:
        bins = np.linspace(0, 1, num=granularity, endpoint=True)

        v = sorted(
            model.wv.most_similar(positive=[query], topn=n), 
            key=lambda x: x[1], 
            reverse=True
        )

        df = pd.DataFrame(v, columns=["ingredient", "similarity"])

        s = df["similarity"]
        quantiles = s.quantile(bins, interpolation="nearest")
        df["sim_pct"] = np.digitize(s, quantiles) - 1

        df["levenshtein"] = [ pylev.levenshtein(d, query) / len(query) for d in df["ingredient"] ]
        s = df["levenshtein"]
        quantiles = s.quantile(bins, interpolation="nearest")
        df["lev_pct"] = granularity - np.digitize(s, quantiles)

        return df
    except KeyError:
        return pd.DataFrame(columns=["ingredient", "similarity", "percentile"])

Let's try this with dried basil as the ingredient to query, and review the top 50 most similar other ingredients returned as the DataFrame df:

pd.set_option("max_rows", None)

df = get_related(model, "dried basil", n=50)
df
ingredient similarity sim_pct levenshtein lev_pct
0 dried basil leaves 0.711843 99 0.636364 76
1 dried rosemary 0.646414 97 0.636364 76
2 dry basil 0.629536 95 0.272727 98
3 dried italian seasoning 0.613078 93 1.272727 28
4 fresh basil 0.596147 91 0.363636 94
5 dried marjoram 0.588682 89 0.636364 76
6 italian herb seasoning 0.565614 87 1.454545 16
7 italian seasoning 0.554528 85 1.090909 46
8 dried parsley 0.552430 83 0.454545 90
9 dried sweet basil leaves 0.543795 81 1.181818 36
10 dried rosemary leaves 0.522495 79 1.181818 36
11 dried whole thyme 0.510558 77 1.000000 58
12 dried parsley flakes 0.506447 75 1.000000 58
13 basil 0.483062 73 0.545455 88
14 dried thyme leaves 0.473144 71 1.000000 58
15 mild italian sausage 0.469430 69 1.363636 24
16 italian-style tomatoes 0.461798 67 1.727273 10
17 part-skim mozzarella cheese 0.461220 65 2.090909 0
18 italian spices 0.460701 63 1.090909 46
19 cooked pearl barley 0.459609 61 1.272727 28
20 white kidney beans 0.453172 59 1.181818 36
21 dried oregano leaves 0.452838 57 1.090909 46
22 reduced-fat mozzarella cheese 0.452544 55 2.090909 0
23 dry oregano 0.452370 53 0.818182 70
24 dried summer savory 0.451501 51 1.090909 46
25 parmesan rind 0.450826 49 0.909091 68
26 quick-cooking barley 0.445577 47 1.454545 16
27 canned tomato sauce 0.443889 45 1.272727 28
28 dried thyme 0.443561 43 0.454545 90
29 cooked pasta 0.437754 41 0.636364 76
30 italian-style tomato paste 0.435926 39 1.909091 6
31 italian-style diced tomatoes 0.435351 37 2.090909 0
32 italian sausage 0.434764 35 1.000000 58
33 orzo pasta 0.434725 33 0.636364 76
34 button mushroom 0.432255 31 1.181818 36
35 italian sausages 0.430821 29 1.090909 46
36 dried red pepper flakes 0.429260 27 1.454545 16
37 lasagna noodles 0.429071 25 1.181818 36
38 italian seasoning mix 0.428897 23 1.454545 16
39 dried sage 0.428746 21 0.363636 94
40 sweet italian sausage 0.427985 19 1.545455 14
41 hot pepper flakes 0.423853 17 1.272727 28
42 rubbed sage 0.423643 15 0.727273 72
43 italian style breadcrumbs 0.422995 13 1.909091 6
44 fresh basil leaves 0.422796 11 1.000000 58
45 fresh mushrooms 0.422383 9 1.090909 46
46 t-bone type lamb chops 0.420398 7 1.727273 10
47 dry lentils 0.420370 5 0.727273 72
48 instant minced garlic 0.419762 3 1.363636 24
49 dried tarragon 0.417663 1 0.636364 76

Note how some of the most similar items, based on vector embedding, are synonyms or special forms of our query dried basil ingredient: dried basil leaves, dry basil, dried sweet basil leaves, etc. These tend to rank high in terms of levenshtein distance too.

Let's plot the similarity measures:

import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use("ggplot")

df["similarity"].plot(alpha=0.75, rot=0)
plt.show()

png

Notice the inflection points at approximately 0.56 and again at 0.47 in that plot. We could use some statistical techniques (e.g., clustering) to segment the similarities into a few groups:

  • highest similarity – potential synonyms for the query
  • mid-range similarity – potential hypernyms and hyponyms for the query
  • long-tail similarity – other ingredients that pair well with the query

In this example, below a threshold of the 75th percentile for vector embedding similarity, the related ingredients are less about being synonyms and more about other foods that pair well with basil.

Let's define another function rank_related() which ranks the related ingredients based on a combination of these two metrics. This uses a cheap approximation of a pareto archive for the ranking -- which comes in handy for recommender systems and custom search applications that must combine multiple ranking metrics:

from kglab import root_mean_square

def rank_related (df):
    df2 = df.copy(deep=True)
    df2["related"] = df2.apply(lambda row: root_mean_square([row[2], row[4]]), axis=1)
    return df2.sort_values(by=["related"], ascending=False)
rank_related(df)
ingredient similarity sim_pct levenshtein lev_pct related
2 dry basil 0.629536 95 0.272727 98 96.511657
4 fresh basil 0.596147 91 0.363636 94 92.512161
0 dried basil leaves 0.711843 99 0.636364 76 88.252479
1 dried rosemary 0.646414 97 0.636364 76 87.134953
8 dried parsley 0.552430 83 0.454545 90 86.570780
5 dried marjoram 0.588682 89 0.636364 76 82.755664
13 basil 0.483062 73 0.545455 88 80.848624
28 dried thyme 0.443561 43 0.454545 90 70.530135
3 dried italian seasoning 0.613078 93 1.272727 28 68.676779
7 italian seasoning 0.554528 85 1.090909 46 68.341056
11 dried whole thyme 0.510558 77 1.000000 58 68.165240
39 dried sage 0.428746 21 0.363636 94 68.106534
12 dried parsley flakes 0.506447 75 1.000000 58 67.041032
14 dried thyme leaves 0.473144 71 1.000000 58 64.826692
9 dried sweet basil leaves 0.543795 81 1.181818 36 62.677747
6 italian herb seasoning 0.565614 87 1.454545 16 62.549980
23 dry oregano 0.452370 53 0.818182 70 62.084620
10 dried rosemary leaves 0.522495 79 1.181818 36 61.388110
29 cooked pasta 0.437754 41 0.636364 76 61.061444
25 parmesan rind 0.450826 49 0.909091 68 59.266348
33 orzo pasta 0.434725 33 0.636364 76 58.587541
18 italian spices 0.460701 63 1.090909 46 55.158861
49 dried tarragon 0.417663 1 0.636364 76 53.744767
42 rubbed sage 0.423643 15 0.727273 72 52.004807
21 dried oregano leaves 0.452838 57 1.090909 46 51.792857
15 mild italian sausage 0.469430 69 1.363636 24 51.657526
47 dry lentils 0.420370 5 0.727273 72 51.034302
20 white kidney beans 0.453172 59 1.181818 36 48.872283
24 dried summer savory 0.451501 51 1.090909 46 48.564390
32 italian sausage 0.434764 35 1.000000 58 47.900939
16 italian-style tomatoes 0.461798 67 1.727273 10 47.900939
19 cooked pearl barley 0.459609 61 1.272727 28 47.460510
17 part-skim mozzarella cheese 0.461220 65 2.090909 0 45.961941
44 fresh basil leaves 0.422796 11 1.000000 58 41.743263
22 reduced-fat mozzarella cheese 0.452544 55 2.090909 0 38.890873
35 italian sausages 0.430821 29 1.090909 46 38.451268
27 canned tomato sauce 0.443889 45 1.272727 28 37.476659
26 quick-cooking barley 0.445577 47 1.454545 16 35.106979
34 button mushroom 0.432255 31 1.181818 36 33.593154
45 fresh mushrooms 0.422383 9 1.090909 46 33.143627
37 lasagna noodles 0.429071 25 1.181818 36 30.991934
30 italian-style tomato paste 0.435926 39 1.909091 6 27.901613
31 italian-style diced tomatoes 0.435351 37 2.090909 0 26.162951
41 hot pepper flakes 0.423853 17 1.272727 28 23.162470
36 dried red pepper flakes 0.429260 27 1.454545 16 22.192341
38 italian seasoning mix 0.428897 23 1.454545 16 19.811613
48 instant minced garlic 0.419762 3 1.363636 24 17.102631
40 sweet italian sausage 0.427985 19 1.545455 14 16.688319
43 italian style breadcrumbs 0.422995 13 1.909091 6 10.124228
46 t-bone type lamb chops 0.420398 7 1.727273 10 8.631338

Notice how the "synonym" cases tend to move up to the top now? Meanwhile while the "pairs well with" are in the lower half of the ranked list: fresh mushrooms, italian turkey sausage, cooked spaghetti, white kidney beans, etc.


Exercises

Exercise 1:

Build a report for a human-in-the-loop reviewer, using the rank_related() function while iterating over vocab to make algorithmic suggestions for possible synonyms.

Exercise 2:

How would you make algorithmic suggestions for a reviewer about which ingredients could be related to a query, e.g., using the skos:broader and skos:narrower relations in the skos vocabulary to represent hypernyms and hyponyms respectively? This could extend the KG to provide a kind of thesaurus about recipe ingredients.


Last update: 2021-04-10