Note

To run this notebook in JupyterLab, load examples/ex8_0.ipynb

Vector embedding with `gensim`¶

Let's make use of deep learning through a technique called embedding – to analyze the relatedness of the labels used for recipe ingredients.

Among the most closely related ingredients:

Some are very close synonyms and should be consolidated to improve data quality
Others are interesting other ingredients that pair frequently, useful for recommendations

On the one hand, this approach is quite helpful for analyzing the NLP annotations that go into a knowledge graph. On the other hand it can be used along with SKOS or similar vocabularies for ontology-based discovery within the graph, e.g., for advanced search UI.

Curating annotations¶

We'll be working with the labels for ingredients that go into our KG. Looking at the raw data, there are many cases where slightly different spellings are being used for the same entity.

As a first step let's define a list of synonyms to substitute, prior to running the vector embedding. This will help produce better quality results.

Note that this kind of work comes of the general heading of curating annotations ... which is what we spend so much time doing in KG work. It's similar to how data preparation is ~80% of the workload for data science teams, and for good reason.

SYNONYMS = {
    "pepper": "black pepper",
    "black pepper": "black pepper",

    "egg": "egg",
    "eggs": "egg",

    "vanilla": "vanilla",
    "vanilla extract": "vanilla",

    "flour": "flour",
    "all-purpose flour": "flour",

    "onions": "onion",
    "onion": "onion",

    "carrots": "carrot",
    "carrot": "carrot",

    "potatoes": "potato",
    "potato": "potato",

    "tomatoes": "tomato",
    "fresh tomatoes": "tomato",
    "fresh tomato": "tomato",

    "garlic": "garlic",
    "garlic clove": "garlic",
    "garlic cloves": "garlic",
}

Analyze ingredient labels from 250K recipes¶

from os.path import dirname
import csv
import os

MAX_ROW = 250000 # 231638

max_context = 0
min_context = 1000

recipes = []
vocab = set()

with open(dirname(os.getcwd()) + "/dat/all_ind.csv", "r") as f:
    reader = csv.reader(f)
    next(reader, None) # remove file header

    for i, row in enumerate(reader):
        id = row[0]
        ind_set = set()

        # substitute synonyms
        for ind in set(eval(row[3])):
            if ind in SYNONYMS:
                ind_set.add(SYNONYMS[ind])
            else:
                ind_set.add(ind)

        if len(ind_set) > 1:
            recipes.append([id, ind_set])
            vocab.update(ind_set)

            max_context = max(max_context, len(ind_set))
            min_context = min(min_context, len(ind_set))

        if i > MAX_ROW:
            break

print("max context: {} unique ingredients per recipe".format(max_context))
print("min context: {} unique ingredients per recipe".format(min_context))
print("vocab size", len(list(vocab)))

max context: 43 unique ingredients per recipe
min context: 2 unique ingredients per recipe
vocab size 14931

Since we've performed this data preparation work, let's use pickle to save this larger superset of the recipes dataset to the tmp.pkl file:

import pickle

pickle.dump(recipes, open("tmp.pkl", "wb"))

recipes[:3]

[['137739',
  {'butter',
   'honey',
   'mexican seasoning',
   'mixed spice',
   'olive oil',
   'salt',
   'winter squash'}],
 ['31490',
  {'cheese',
   'egg',
   'milk',
   'prepared pizza crust',
   'salt and pepper',
   'sausage patty'}],
 ['112140',
  {'cheddar cheese',
   'chili powder',
   'diced tomatoes',
   'ground beef',
   'ground cumin',
   'kidney beans',
   'lettuce',
   'rotel tomatoes',
   'salt',
   'tomato paste',
   'tomato soup',
   'water',
   'yellow onions'}]]

Then we can restore the pickled Python data structure for usage later in other use cases. The output shows the first few entries, to illustrated the format.

Now reshape this data into a vector of vectors of ingredients per recipe, to use for training a word2vec vector embedding model:

vectors = [
    [
        ind
        for ind in ind_set
    ]
    for id, ind_set in recipes
]

vectors[:3]

[['mexican seasoning',
  'winter squash',
  'salt',
  'olive oil',
  'mixed spice',
  'honey',
  'butter'],
 ['cheese',
  'sausage patty',
  'milk',
  'salt and pepper',
  'egg',
  'prepared pizza crust'],
 ['lettuce',
  'rotel tomatoes',
  'diced tomatoes',
  'water',
  'yellow onions',
  'tomato soup',
  'ground cumin',
  'salt',
  'cheddar cheese',
  'kidney beans',
  'ground beef',
  'tomato paste',
  'chili powder']]

We'll use the Word2Vec implementation in the gensim library (i.e., deep learning) to train an embedding model. This approach tends to work best if the training data has at least 100K rows.

Let's also show how to serialize the word2vec results, saving them to the tmp.w2v file so they could be restored later for other use cases.

!pip install gensim
!pip install pylev

Requirement already satisfied: gensim in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (4.1.2)
Requirement already satisfied: numpy>=1.17.0 in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (from gensim) (1.21.2)
Requirement already satisfied: scipy>=0.18.1 in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (from gensim) (1.7.1)
Requirement already satisfied: smart-open>=1.8.1 in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (from gensim) (5.2.1)
Requirement already satisfied: pylev in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (1.4.0)

import gensim

MIN_COUNT = 2
model_path = "tmp.w2v"

model = gensim.models.Word2Vec(vectors, min_count=MIN_COUNT, window=max_context)
model.save(model_path)

The get_related() function takes any ingredient as input, using the embedding model to find the most similar other ingredients – along with calculating levenshtein edit distances (string similarity) among these labels. Then it calculates percentiles for both metrics in numpy and returns the results as a pandas DataFrame.

import numpy as np
import pandas as pd
import pylev

def term_ratio (target, description):
    d_set = set(description.split(" "))
    num_inter = len(d_set.intersection(target))
    return num_inter / float(len(target))


def get_related (model, query, target, n=20, granularity=100):
    """return a DataFrame of the closely related items"""
    try:
        bins = np.linspace(0, 1, num=granularity, endpoint=True)

        v = sorted(
            model.wv.most_similar(positive=[query], topn=n), 
            key=lambda x: x[1], 
            reverse=True
        )

        df = pd.DataFrame(v, columns=["ingredient", "similarity"])

        s = df["similarity"]
        quantiles = s.quantile(bins, interpolation="nearest")
        df["sim_pct"] = np.digitize(s, quantiles) - 1

        df["levenshtein"] = [ pylev.levenshtein(d, query) / len(query) for d in df["ingredient"] ]
        s = df["levenshtein"]
        quantiles = s.quantile(bins, interpolation="nearest")
        df["lev_pct"] = granularity - np.digitize(s, quantiles)

        df["term_ratio"] = [ term_ratio(target, d) for d in df["ingredient"] ]

        return df
    except KeyError:
        return pd.DataFrame(columns=["ingredient", "similarity", "percentile"])

Let's try this with dried basil as the ingredient to query, and review the top 50 most similar other ingredients returned as the DataFrame df:

target = set([ "basil" ])

df = get_related(model, "dried basil", target, n=50)
df

	ingredient	similarity	sim_pct	levenshtein	lev_pct	term_ratio
0	dried basil leaves	0.677376	99	0.636364	78	1.0
1	dry basil	0.618740	97	0.272727	98	1.0
2	dried rosemary	0.605385	95	0.636364	78	0.0
3	dried sweet basil leaves	0.593628	93	1.181818	34	1.0
4	dried italian seasoning	0.581919	91	1.272727	30	0.0
5	fresh basil	0.574703	89	0.363636	96	1.0
6	italian herb seasoning	0.548939	87	1.454545	16	0.0
7	italian seasoning	0.538376	85	1.090909	48	0.0
8	dried marjoram	0.533788	83	0.636364	78	0.0
9	dried parsley	0.518489	81	0.454545	92	0.0
10	dried rosemary leaves	0.509467	79	1.181818	34	0.0
11	dried parsley flakes	0.508277	77	1.000000	58	0.0
12	italian seasoning mix	0.471910	75	1.454545	16	0.0
13	basil	0.470737	73	0.545455	90	1.0
14	italian spices	0.464688	71	1.090909	48	0.0
15	italian sausage	0.453568	69	1.000000	58	0.0
16	fresh basil leaves	0.451666	67	1.000000	58	1.0
17	quick-cooking barley	0.451440	65	1.454545	16	0.0
18	fresh basil leaf	0.450319	63	0.818182	72	1.0
19	italian sausages	0.448237	61	1.090909	48	0.0
20	globe eggplants	0.447171	59	1.181818	34	0.0
21	dried thyme leaves	0.445578	57	1.000000	58	0.0
22	hunts tomato paste	0.445030	55	1.363636	24	0.0
23	dried red pepper flakes	0.442688	53	1.454545	16	0.0
24	dried whole thyme	0.442518	51	1.000000	58	0.0
25	button mushroom	0.441169	49	1.181818	34	0.0
26	fresh mushrooms	0.440132	47	1.090909	48	0.0
27	lasagna noodles	0.432072	45	1.181818	34	0.0
28	mild italian sausage	0.430874	43	1.363636	24	0.0
29	sliced mushrooms	0.429246	41	1.000000	58	0.0
30	herb seasoning mix	0.424856	39	1.363636	24	0.0
31	dry oregano	0.423101	37	0.818182	72	0.0
32	pitted black olives	0.418710	35	1.181818	34	0.0
33	rubbed sage	0.418236	33	0.727273	76	0.0
34	frozen chopped spinach	0.416980	31	1.545455	12	0.0
35	dried italian herb seasoning	0.415891	29	1.636364	8	0.0
36	dried thyme	0.413673	27	0.454545	92	0.0
37	part-skim mozzarella cheese	0.411310	25	2.090909	0	0.0
38	chianti wine	0.410693	23	0.909091	70	0.0
39	thin spaghetti	0.409562	21	1.090909	48	0.0
40	italian style breadcrumbs	0.409037	19	1.909091	4	0.0
41	hot pepper flakes	0.405416	17	1.272727	30	0.0
42	sweet italian sausage	0.405204	15	1.545455	12	0.0
43	orzo pasta	0.404851	13	0.636364	78	0.0
44	dried tarragon	0.403677	11	0.636364	78	0.0
45	yolk-free wide egg noodles	0.403116	9	1.909091	4	0.0
46	ziti pasta	0.402639	7	0.636364	78	0.0
47	cannellini beans	0.401280	5	1.181818	34	0.0
48	contadina diced tomatoes	0.400817	3	1.636364	8	0.0
49	italian-style diced tomatoes	0.398550	1	2.090909	0	0.0

Note how some of the most similar items, based on vector embedding, are synonyms or special forms of our query dried basil ingredient: dried basil leaves, dry basil, dried sweet basil leaves, etc. These tend to rank high in terms of levenshtein distance too.

Let's plot the similarity measures:

!pip install matplotlib

Requirement already satisfied: matplotlib in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (3.4.3)
Requirement already satisfied: pillow>=6.2.0 in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (from matplotlib) (8.3.2)
Requirement already satisfied: cycler>=0.10 in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: pyparsing>=2.2.1 in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (from matplotlib) (2.4.7)
Requirement already satisfied: numpy>=1.16 in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (from matplotlib) (1.21.2)
Requirement already satisfied: python-dateutil>=2.7 in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (from matplotlib) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: six in /Users/paco/src/kglab/venv/lib/python3.7/site-packages (from cycler>=0.10->matplotlib) (1.15.0)

import matplotlib
import matplotlib.pyplot as plt

matplotlib.style.use("ggplot")

df["similarity"].plot(alpha=0.75, rot=0)
plt.show()

png

Notice the inflection points at approximately 0.56 and again at 0.47 in that plot. We could use some statistical techniques (e.g., clustering) to segment the similarities into a few groups:

highest similarity – potential synonyms for the query
mid-range similarity – potential hypernyms and hyponyms for the query
long-tail similarity – other ingredients that pair well with the query

In this example, below a threshold of the 75th percentile for vector embedding similarity, the related ingredients are less about being synonyms and more about other foods that pair well with basil.

Let's define another function rank_related() which ranks the related ingredients based on a combination of these two metrics. This uses a cheap approximation of a pareto archive for the ranking -- which comes in handy for recommender systems and custom search applications that must combine multiple ranking metrics:

from kglab import root_mean_square

def rank_related (df):
    df2 = df.copy(deep=True)
    df2["related"] = df2.apply(lambda row: root_mean_square([ row[2], row[4] ]), axis=1)
    return df2.sort_values(by=["related"], ascending=False)

df = rank_related(df)
df

	ingredient	similarity	sim_pct	levenshtein	lev_pct	term_ratio	related
1	dry basil	0.618740	97	0.272727	98	1.0	97.501282
5	fresh basil	0.574703	89	0.363636	96	1.0	92.566193
0	dried basil leaves	0.677376	99	0.636364	78	1.0	89.120705
2	dried rosemary	0.605385	95	0.636364	78	0.0	86.916627
9	dried parsley	0.518489	81	0.454545	92	0.0	86.674679
13	basil	0.470737	73	0.545455	90	1.0	81.942053
8	dried marjoram	0.533788	83	0.636364	78	0.0	80.538811
3	dried sweet basil leaves	0.593628	93	1.181818	34	1.0	70.017855
7	italian seasoning	0.538376	85	1.090909	48	0.0	69.025358
11	dried parsley flakes	0.508277	77	1.000000	58	0.0	68.165240
36	dried thyme	0.413673	27	0.454545	92	0.0	67.797493
4	dried italian seasoning	0.581919	91	1.272727	30	0.0	67.753229
18	fresh basil leaf	0.450319	63	0.818182	72	1.0	67.649834
15	italian sausage	0.453568	69	1.000000	58	0.0	63.737744
16	fresh basil leaves	0.451666	67	1.000000	58	1.0	62.661791
6	italian herb seasoning	0.548939	87	1.454545	16	0.0	62.549980
10	dried rosemary leaves	0.509467	79	1.181818	34	0.0	60.815294
14	italian spices	0.464688	71	1.090909	48	0.0	60.601155
33	rubbed sage	0.418236	33	0.727273	76	0.0	58.587541
21	dried thyme leaves	0.445578	57	1.000000	58	0.0	57.502174
31	dry oregano	0.423101	37	0.818182	72	0.0	57.240720
43	orzo pasta	0.404851	13	0.636364	78	0.0	55.915114
44	dried tarragon	0.403677	11	0.636364	78	0.0	55.700090
46	ziti pasta	0.402639	7	0.636364	78	0.0	55.375988
19	italian sausages	0.448237	61	1.090909	48	0.0	54.886246
24	dried whole thyme	0.442518	51	1.000000	58	0.0	54.612270
12	italian seasoning mix	0.471910	75	1.454545	16	0.0	54.226377
38	chianti wine	0.410693	23	0.909091	70	0.0	52.100864
29	sliced mushrooms	0.429246	41	1.000000	58	0.0	50.224496
20	globe eggplants	0.447171	59	1.181818	34	0.0	48.150805
26	fresh mushrooms	0.440132	47	1.090909	48	0.0	47.502632
17	quick-cooking barley	0.451440	65	1.454545	16	0.0	47.333920
22	hunts tomato paste	0.445030	55	1.363636	24	0.0	42.432299
25	button mushroom	0.441169	49	1.181818	34	0.0	42.172266
27	lasagna noodles	0.432072	45	1.181818	34	0.0	39.881073
23	dried red pepper flakes	0.442688	53	1.454545	16	0.0	39.147158
39	thin spaghetti	0.409562	21	1.090909	48	0.0	37.047267
28	mild italian sausage	0.430874	43	1.363636	24	0.0	34.820971
32	pitted black olives	0.418710	35	1.181818	34	0.0	34.503623
30	herb seasoning mix	0.424856	39	1.363636	24	0.0	32.380550
41	hot pepper flakes	0.405416	17	1.272727	30	0.0	24.382371
47	cannellini beans	0.401280	5	1.181818	34	0.0	24.300206
34	frozen chopped spinach	0.416980	31	1.545455	12	0.0	23.505319
35	dried italian herb seasoning	0.415891	29	1.636364	8	0.0	21.272047
37	part-skim mozzarella cheese	0.411310	25	2.090909	0	0.0	17.677670
40	italian style breadcrumbs	0.409037	19	1.909091	4	0.0	13.729530
42	sweet italian sausage	0.405204	15	1.545455	12	0.0	13.583078
45	yolk-free wide egg noodles	0.403116	9	1.909091	4	0.0	6.964194
48	contadina diced tomatoes	0.400817	3	1.636364	8	0.0	6.041523
49	italian-style diced tomatoes	0.398550	1	2.090909	0	0.0	0.707107

Notice how the "synonym" cases tend to move up to the top now? Meanwhile while the "pairs well with" are in the lower half of the ranked list: fresh mushrooms, italian turkey sausage, cooked spaghetti, white kidney beans, etc.

df.loc[ (df["related"] >= 50) & (df["term_ratio"] > 0) ]

	ingredient	similarity	sim_pct	levenshtein	lev_pct	term_ratio	related
1	dry basil	0.618740	97	0.272727	98	1.0	97.501282
5	fresh basil	0.574703	89	0.363636	96	1.0	92.566193
0	dried basil leaves	0.677376	99	0.636364	78	1.0	89.120705
13	basil	0.470737	73	0.545455	90	1.0	81.942053
3	dried sweet basil leaves	0.593628	93	1.181818	34	1.0	70.017855
18	fresh basil leaf	0.450319	63	0.818182	72	1.0	67.649834
16	fresh basil leaves	0.451666	67	1.000000	58	1.0	62.661791

Exercises¶

Exercise 1:

Build a report for a human-in-the-loop reviewer, using the rank_related() function while iterating over vocab to make algorithmic suggestions for possible synonyms.

Exercise 2:

How would you make algorithmic suggestions for a reviewer about which ingredients could be related to a query, e.g., using the skos:broader and skos:narrower relations in the skos vocabulary to represent hypernyms and hyponyms respectively? This could extend the KG to provide a kind of thesaurus about recipe ingredients.

Last update: 2022-03-23

Vector embedding with gensim¶

Curating annotations¶

Analyze ingredient labels from 250K recipes¶

Exercises¶

Vector embedding with `gensim`¶