Skip to content

Note

To run this notebook in JupyterLab, load examples/ex8_0.ipynb

Vector embedding with gensim

Let's make use of deep learning through a technique called embedding – to analyze the relatedness of the labels used for recipe ingredients.

Among the most closely related ingredients:

  • Some are very close synonyms and should be consolidated to improve data quality
  • Others are interesting other ingredients that pair frequently, useful for recommendations

On the one hand, this approach is quite helpful for analyzing the NLP annotations that go into a knowledge graph. On the other hand it can be used along with SKOS or similar vocabularies for ontology-based discovery within the graph, e.g., for advanced search UI.

Curating annotations

We'll be working with the labels for ingredients that go into our KG. Looking at the raw data, there are many cases where slightly different spellings are being used for the same entity.

As a first step let's define a list of synonyms to substitute, prior to running the vector embedding. This will help produce better quality results.

Note that this kind of work comes of the general heading of curating annotations ... which is what we spend so much time doing in KG work. It's similar to how data preparation is ~80% of the workload for data science teams, and for good reason.

SYNONYMS = {
    "pepper": "black pepper",
    "black pepper": "black pepper",

    "egg": "egg",
    "eggs": "egg",

    "vanilla": "vanilla",
    "vanilla extract": "vanilla",

    "flour": "flour",
    "all-purpose flour": "flour",

    "onions": "onion",
    "onion": "onion",

    "carrots": "carrot",
    "carrot": "carrot",

    "potatoes": "potato",
    "potato": "potato",

    "tomatoes": "tomato",
    "fresh tomatoes": "tomato",
    "fresh tomato": "tomato",

    "garlic": "garlic",
    "garlic clove": "garlic",
    "garlic cloves": "garlic",
}

Analyze ingredient labels from 250K recipes

import csv

MAX_ROW = 250000 # 231638

max_context = 0
min_context = 1000

recipes = []
vocab = set()

with open("../dat/all_ind.csv", "r") as f:
    reader = csv.reader(f)
    next(reader, None) # remove file header

    for i, row in enumerate(reader):
        id = row[0]
        ind_set = set()

        # substitute synonyms
        for ind in set(eval(row[3])):
            if ind in SYNONYMS:
                ind_set.add(SYNONYMS[ind])
            else:
                ind_set.add(ind)

        if len(ind_set) > 1:
            recipes.append([id, ind_set])
            vocab.update(ind_set)

            max_context = max(max_context, len(ind_set))
            min_context = min(min_context, len(ind_set))

        if i > MAX_ROW:
            break

print("max context: {} unique ingredients per recipe".format(max_context))
print("min context: {} unique ingredients per recipe".format(min_context))
print("vocab size", len(list(vocab)))
max context: 43 unique ingredients per recipe
min context: 2 unique ingredients per recipe
vocab size 14931

Since we've performed this data preparation work, let's use pickle to save this larger superset of the recipes dataset to the tmp.pkl file:

import pickle

pickle.dump(recipes, open("tmp.pkl", "wb"))

recipes[:3]
[['137739',
  {'butter',
   'honey',
   'mexican seasoning',
   'mixed spice',
   'olive oil',
   'salt',
   'winter squash'}],
 ['31490',
  {'cheese',
   'egg',
   'milk',
   'prepared pizza crust',
   'salt and pepper',
   'sausage patty'}],
 ['112140',
  {'cheddar cheese',
   'chili powder',
   'diced tomatoes',
   'ground beef',
   'ground cumin',
   'kidney beans',
   'lettuce',
   'rotel tomatoes',
   'salt',
   'tomato paste',
   'tomato soup',
   'water',
   'yellow onions'}]]

Then we can restore the pickled Python data structure for usage later in other use cases. The output shows the first few entries, to illustrated the format.

Now reshape this data into a vector of vectors of ingredients per recipe, to use for training a word2vec vector embedding model:

vectors = [
    [
        ind
        for ind in ind_set
    ]
    for id, ind_set in recipes
]

vectors[:3]
[['mexican seasoning',
  'winter squash',
  'butter',
  'honey',
  'olive oil',
  'salt',
  'mixed spice'],
 ['milk',
  'prepared pizza crust',
  'egg',
  'cheese',
  'salt and pepper',
  'sausage patty'],
 ['rotel tomatoes',
  'chili powder',
  'tomato soup',
  'tomato paste',
  'lettuce',
  'water',
  'ground cumin',
  'diced tomatoes',
  'ground beef',
  'yellow onions',
  'cheddar cheese',
  'salt',
  'kidney beans']]

We'll use the Word2Vec implementation in the gensim library (i.e., deep learning) to train an embedding model. This approach tends to work best if the training data has at least 100K rows.

Let's also show how to serialize the word2vec results, saving them to the tmp.w2v file so they could be restored later for other use cases.

NB: there is work in progress which will replace gensim with pytorch instead.

import gensim

MIN_COUNT = 2
model_path = "tmp.w2v"

model = gensim.models.Word2Vec(vectors, min_count=MIN_COUNT, window=max_context)
model.save(model_path)

The get_related() function takes any ingredient as input, using the embedding model to find the most similar other ingredients – along with calculating levenshtein edit distances (string similarity) among these labels. Then it calculates percentiles for both metrics in numpy and returns the results as a pandas DataFrame.

import numpy as np
import pandas as pd
import pylev

def term_ratio (target, description):
    d_set = set(description.split(" "))
    num_inter = len(d_set.intersection(target))
    return num_inter / float(len(target))


def get_related (model, query, target, n=20, granularity=100):
    """return a DataFrame of the closely related items"""
    try:
        bins = np.linspace(0, 1, num=granularity, endpoint=True)

        v = sorted(
            model.wv.most_similar(positive=[query], topn=n), 
            key=lambda x: x[1], 
            reverse=True
        )

        df = pd.DataFrame(v, columns=["ingredient", "similarity"])

        s = df["similarity"]
        quantiles = s.quantile(bins, interpolation="nearest")
        df["sim_pct"] = np.digitize(s, quantiles) - 1

        df["levenshtein"] = [ pylev.levenshtein(d, query) / len(query) for d in df["ingredient"] ]
        s = df["levenshtein"]
        quantiles = s.quantile(bins, interpolation="nearest")
        df["lev_pct"] = granularity - np.digitize(s, quantiles)

        df["term_ratio"] = [ term_ratio(target, d) for d in df["ingredient"] ]

        return df
    except KeyError:
        return pd.DataFrame(columns=["ingredient", "similarity", "percentile"])

Let's try this with dried basil as the ingredient to query, and review the top 50 most similar other ingredients returned as the DataFrame df:

pd.set_option("max_rows", None)

target = set([ "basil" ])

df = get_related(model, "dried basil", target, n=50)
df
ingredient similarity sim_pct levenshtein lev_pct term_ratio
0 dried basil leaves 0.680018 99 0.636364 78 1.0
1 dry basil 0.640774 97 0.272727 98 1.0
2 dried italian seasoning 0.638644 95 1.272727 30 0.0
3 dried rosemary 0.601175 93 0.636364 78 0.0
4 dried sweet basil leaves 0.588954 91 1.181818 32 1.0
5 fresh basil 0.581067 89 0.363636 96 1.0
6 dried marjoram 0.570613 87 0.636364 78 0.0
7 italian seasoning 0.558167 85 1.090909 44 0.0
8 dried parsley 0.543460 83 0.454545 92 0.0
9 dried rosemary leaves 0.519119 81 1.181818 32 0.0
10 italian herb seasoning 0.505750 79 1.454545 20 0.0
11 dried parsley flakes 0.504157 77 1.000000 56 0.0
12 quick-cooking barley 0.480627 75 1.454545 20 0.0
13 dried thyme leaves 0.479252 73 1.000000 56 0.0
14 basil 0.478175 71 0.545455 90 1.0
15 italian sausages 0.478026 69 1.090909 44 0.0
16 italian spices 0.474901 67 1.090909 44 0.0
17 dried italian herb seasoning 0.473978 65 1.636364 14 0.0
18 part-skim mozzarella cheese 0.473019 63 2.090909 0 0.0
19 dry oregano 0.471215 61 0.818182 70 0.0
20 lasagna noodles 0.466840 59 1.181818 32 0.0
21 italian sausage 0.466552 57 1.000000 56 0.0
22 cooked spaghetti 0.460726 55 1.090909 44 0.0
23 fresh basil leaves 0.456533 53 1.000000 56 1.0
24 herbed croutons 0.456307 51 1.000000 56 0.0
25 white kidney beans 0.455640 49 1.181818 32 0.0
26 dried thyme 0.454236 47 0.454545 92 0.0
27 fat-free ricotta cheese 0.453549 45 1.818182 8 0.0
28 ditalini 0.442861 43 0.727273 76 0.0
29 sliced mushrooms 0.442224 41 1.000000 56 0.0
30 cooked pasta 0.438589 39 0.636364 78 0.0
31 sweet italian sausage 0.438178 37 1.545455 16 0.0
32 italian turkey sausage 0.437979 35 1.545455 16 0.0
33 fresh mushrooms 0.437678 33 1.090909 44 0.0
34 dried red pepper flakes 0.436070 31 1.454545 20 0.0
35 reduced-fat ricotta cheese 0.435862 29 1.909091 4 0.0
36 italian style breadcrumbs 0.433100 27 1.909091 4 0.0
37 fresh basil leaf 0.432709 25 0.818182 70 1.0
38 italian-style tomatoes 0.431849 23 1.727273 10 0.0
39 ziti pasta 0.430601 21 0.636364 78 0.0
40 italian-style diced tomatoes 0.430432 19 2.090909 0 0.0
41 hot italian sausage link 0.430252 17 1.727273 10 0.0
42 cannellini beans 0.428287 15 1.181818 32 0.0
43 dried whole thyme 0.428099 13 1.000000 56 0.0
44 small shell pasta 0.426476 11 1.181818 32 0.0
45 thin spaghetti 0.425472 9 1.090909 44 0.0
46 dried tarragon 0.425135 7 0.636364 78 0.0
47 cheese ravioli 0.425128 5 0.818182 70 0.0
48 italian seasoning mix 0.424475 3 1.454545 20 0.0
49 herb seasoning mix 0.424261 1 1.363636 28 0.0

Note how some of the most similar items, based on vector embedding, are synonyms or special forms of our query dried basil ingredient: dried basil leaves, dry basil, dried sweet basil leaves, etc. These tend to rank high in terms of levenshtein distance too.

Let's plot the similarity measures:

import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use("ggplot")

df["similarity"].plot(alpha=0.75, rot=0)
plt.show()

png

Notice the inflection points at approximately 0.56 and again at 0.47 in that plot. We could use some statistical techniques (e.g., clustering) to segment the similarities into a few groups:

  • highest similarity – potential synonyms for the query
  • mid-range similarity – potential hypernyms and hyponyms for the query
  • long-tail similarity – other ingredients that pair well with the query

In this example, below a threshold of the 75th percentile for vector embedding similarity, the related ingredients are less about being synonyms and more about other foods that pair well with basil.

Let's define another function rank_related() which ranks the related ingredients based on a combination of these two metrics. This uses a cheap approximation of a pareto archive for the ranking -- which comes in handy for recommender systems and custom search applications that must combine multiple ranking metrics:

from kglab import root_mean_square

def rank_related (df):
    df2 = df.copy(deep=True)
    df2["related"] = df2.apply(lambda row: root_mean_square([ row[2], row[4] ]), axis=1)
    return df2.sort_values(by=["related"], ascending=False)
df = rank_related(df)
df
ingredient similarity sim_pct levenshtein lev_pct term_ratio related
1 dry basil 0.640774 97 0.272727 98 1.0 97.501282
5 fresh basil 0.581067 89 0.363636 96 1.0 92.566193
0 dried basil leaves 0.680018 99 0.636364 78 1.0 89.120705
8 dried parsley 0.543460 83 0.454545 92 0.0 87.615638
3 dried rosemary 0.601175 93 0.636364 78 0.0 85.828317
6 dried marjoram 0.570613 87 0.636364 78 0.0 82.622636
14 basil 0.478175 71 0.545455 90 1.0 81.058621
26 dried thyme 0.454236 47 0.454545 92 0.0 73.051352
2 dried italian seasoning 0.638644 95 1.272727 30 0.0 70.445014
4 dried sweet basil leaves 0.588954 91 1.181818 32 1.0 68.209237
7 italian seasoning 0.558167 85 1.090909 44 0.0 67.679391
11 dried parsley flakes 0.504157 77 1.000000 56 0.0 67.323844
19 dry oregano 0.471215 61 0.818182 70 0.0 65.654398
13 dried thyme leaves 0.479252 73 1.000000 56 0.0 65.057667
28 ditalini 0.442861 43 0.727273 76 0.0 61.745445
30 cooked pasta 0.438589 39 0.636364 78 0.0 61.664414
9 dried rosemary leaves 0.519119 81 1.181818 32 0.0 61.583277
15 italian sausages 0.478026 69 1.090909 44 0.0 57.866225
10 italian herb seasoning 0.505750 79 1.454545 20 0.0 57.623780
39 ziti pasta 0.430601 21 0.636364 78 0.0 57.118298
16 italian spices 0.474901 67 1.090909 44 0.0 56.678920
21 italian sausage 0.466552 57 1.000000 56 0.0 56.502212
46 dried tarragon 0.425135 7 0.636364 78 0.0 55.375988
12 quick-cooking barley 0.480627 75 1.454545 20 0.0 54.886246
23 fresh basil leaves 0.456533 53 1.000000 56 1.0 54.520638
24 herbed croutons 0.456307 51 1.000000 56 0.0 53.558379
37 fresh basil leaf 0.432709 25 0.818182 70 1.0 52.559490
22 cooked spaghetti 0.460726 55 1.090909 44 0.0 49.804618
47 cheese ravioli 0.425128 5 0.818182 70 0.0 49.623583
29 sliced mushrooms 0.442224 41 1.000000 56 0.0 49.076471
20 lasagna noodles 0.466840 59 1.181818 32 0.0 47.460510
17 dried italian herb seasoning 0.473978 65 1.636364 14 0.0 47.015955
18 part-skim mozzarella cheese 0.473019 63 2.090909 0 0.0 44.547727
25 white kidney beans 0.455640 49 1.181818 32 0.0 41.382363
43 dried whole thyme 0.428099 13 1.000000 56 0.0 40.650953
33 fresh mushrooms 0.437678 33 1.090909 44 0.0 38.890873
27 fat-free ricotta cheese 0.453549 45 1.818182 8 0.0 32.318725
45 thin spaghetti 0.425472 9 1.090909 44 0.0 31.756889
31 sweet italian sausage 0.438178 37 1.545455 16 0.0 28.504386
32 italian turkey sausage 0.437979 35 1.545455 16 0.0 27.212130
34 dried red pepper flakes 0.436070 31 1.454545 20 0.0 26.086395
42 cannellini beans 0.428287 15 1.181818 32 0.0 24.989998
44 small shell pasta 0.426476 11 1.181818 32 0.0 23.926972
35 reduced-fat ricotta cheese 0.435862 29 1.909091 4 0.0 20.700242
49 herb seasoning mix 0.424261 1 1.363636 28 0.0 19.811613
36 italian style breadcrumbs 0.433100 27 1.909091 4 0.0 19.300259
38 italian-style tomatoes 0.431849 23 1.727273 10 0.0 17.734148
48 italian seasoning mix 0.424475 3 1.454545 20 0.0 14.300350
41 hot italian sausage link 0.430252 17 1.727273 10 0.0 13.946326
40 italian-style diced tomatoes 0.430432 19 2.090909 0 0.0 13.435029

Notice how the "synonym" cases tend to move up to the top now? Meanwhile while the "pairs well with" are in the lower half of the ranked list: fresh mushrooms, italian turkey sausage, cooked spaghetti, white kidney beans, etc.

df.loc[ (df["related"] >= 50) & (df["term_ratio"] > 0) ]
ingredient similarity sim_pct levenshtein lev_pct term_ratio related
1 dry basil 0.640774 97 0.272727 98 1.0 97.501282
5 fresh basil 0.581067 89 0.363636 96 1.0 92.566193
0 dried basil leaves 0.680018 99 0.636364 78 1.0 89.120705
14 basil 0.478175 71 0.545455 90 1.0 81.058621
4 dried sweet basil leaves 0.588954 91 1.181818 32 1.0 68.209237
23 fresh basil leaves 0.456533 53 1.000000 56 1.0 54.520638
37 fresh basil leaf 0.432709 25 0.818182 70 1.0 52.559490

Exercises

Exercise 1:

Build a report for a human-in-the-loop reviewer, using the rank_related() function while iterating over vocab to make algorithmic suggestions for possible synonyms.

Exercise 2:

How would you make algorithmic suggestions for a reviewer about which ingredients could be related to a query, e.g., using the skos:broader and skos:narrower relations in the skos vocabulary to represent hypernyms and hyponyms respectively? This could extend the KG to provide a kind of thesaurus about recipe ingredients.


Last update: 2021-05-09