Note
To run this notebook in JupyterLab, load examples/explain_algo.ipynb
Explain PyTextRank: the algorithm¶
Let's look at the TextRank algorithm used to build a graph from a raw text, and then from that extract the top-ranked phrases. This work is based on "TextRank: Bringing Order into Text", Rada Mihalcea, Paul Tarau, Empirical Methods in Natural Language Processing (2004).
First we perform some basic housekeeping for Jupyter, then load spaCy
with a language model for English ...
import warnings
warnings.filterwarnings("ignore")
import spacy
nlp = spacy.load("en_core_web_sm")
Now, to get started, we'll create some text to use.
#text = "When Ada was twelve years old, this future 'Lady Fairy', as Charles Babbage affectionately called her, decided she wanted to fly. Ada Byron went about the project methodically, thoughtfully, with imagination and passion. Her first step, in February 1828, was to construct wings. She investigated different material and sizes. She considered various materials for the wings: paper, oilsilk, wires, and feathers. She examined the anatomy of birds to determine the right proportion between the wings and the body. She decided to write a book, Flyology, illustrating, with plates, some of her findings. She decided what equipment she would need; for example, a compass, to 'cut across the country by the most direct road', so that she could surmount mountains, rivers, and valleys. Her final step was to integrate steam with the 'art of flying."
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."
doc = nlp(text)
How many sentences are in the parsed document and where are their boundaries?
from icecream import ic
for sent in doc.sents:
ic(sent.start, sent.end)
ic| sent.start: 0, sent.end: 13
ic| sent.start: 13, sent.end: 33
ic| sent.start: 33, sent.end: 61
ic| sent.start: 61, sent.end: 91
What are the raw noun chunks in the parsed document?
for chunk in doc.noun_chunks:
ic(chunk.text)
ic| chunk.text: 'Compatibility'
ic| chunk.text: 'systems'
ic| chunk.text: 'linear constraints'
ic| chunk.text: 'the set'
ic| chunk.text: 'natural numbers'
ic| chunk.text: 'Criteria'
ic| chunk.text: 'compatibility'
ic| chunk.text: 'a system'
ic| chunk.text: 'linear Diophantine equations'
ic| chunk.text: 'strict inequations'
ic| chunk.text: 'nonstrict inequations'
ic| chunk.text: 'Upper bounds'
ic| chunk.text: 'components'
ic| chunk.text: 'a minimal set'
ic| chunk.text: 'solutions'
ic| chunk.text: 'algorithms'
ic| chunk.text: 'construction'
ic| chunk.text: 'minimal generating sets'
ic| chunk.text: 'solutions'
ic| chunk.text: 'all types'
ic| chunk.text: 'systems'
ic| chunk.text: 'These criteria'
ic| chunk.text: 'the corresponding algorithms'
ic| chunk.text: 'a minimal supporting set'
ic| chunk.text: 'solutions'
ic| chunk.text: 'all the considered types systems'
ic| chunk.text: 'systems'
ic| chunk.text: 'mixed types'
Also, does spaCy
detect any named entities?
for ent in doc.ents:
ic(ent.text, ent.label_, ent.start, ent.end)
ic| ent.text: 'Diophantine'
ent.label_: 'GPE'
ent.start: 21
ent.end: 22
Given those details about the parsed document, next we use NetworkX to manage an in-memory graph...
import networkx as nx
def increment_edge (graph, node0, node1):
ic(node0, node1)
if graph.has_edge(node0, node1):
graph[node0][node1]["weight"] += 1.0
else:
graph.add_edge(node0, node1, weight=1.0)
Then construct a graph, sentence by sentence, based on the spaCy part-of-speech tags tags:
POS_KEPT = ["ADJ", "NOUN", "PROPN", "VERB"]
def link_sentence (doc, sent, lemma_graph, seen_lemma):
visited_tokens = []
visited_nodes = []
for i in range(sent.start, sent.end):
token = doc[i]
if token.pos_ in POS_KEPT:
key = (token.lemma_, token.pos_)
if key not in seen_lemma:
seen_lemma[key] = set([token.i])
else:
seen_lemma[key].add(token.i)
node_id = list(seen_lemma.keys()).index(key)
if not node_id in lemma_graph:
lemma_graph.add_node(node_id)
ic(visited_tokens, visited_nodes)
ic(list(range(len(visited_tokens) - 1, -1, -1)))
for prev_token in range(len(visited_tokens) - 1, -1, -1):
ic(prev_token, (token.i - visited_tokens[prev_token]))
if (token.i - visited_tokens[prev_token]) <= 3:
increment_edge(lemma_graph, node_id, visited_nodes[prev_token])
else:
break
ic(token.i, token.text, token.lemma_, token.pos_, visited_tokens, visited_nodes)
visited_tokens.append(token.i)
visited_nodes.append(node_id)
Now iterate through the sentences to construct the lemma graph...
lemma_graph = nx.Graph()
seen_lemma = {}
for sent in doc.sents:
link_sentence(doc, sent, lemma_graph, seen_lemma)
#break # only test one sentence
ic| visited_tokens: [], visited_nodes: []
ic| list(range(len(visited_tokens) - 1, -1, -1)): []
ic| token.i: 0
token.text: 'Compatibility'
token.lemma_: 'compatibility'
token.pos_: 'NOUN'
visited_tokens: []
visited_nodes: []
ic| visited_tokens: [0], visited_nodes: [0]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [0]
ic| prev_token: 0, token.i - visited_tokens[prev_token]: 2
ic| node0: 1, node1: 0
ic| token.i: 2
token.text: 'systems'
token.lemma_: 'system'
token.pos_: 'NOUN'
visited_tokens: [0]
visited_nodes: [0]
ic| visited_tokens: [0, 2], visited_nodes: [0, 1]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [1, 0]
ic| prev_token: 1, token.i - visited_tokens[prev_token]: 2
ic| node0: 2, node1: 1
ic| prev_token: 0, token.i - visited_tokens[prev_token]: 4
ic| token.i: 4
token.text: 'linear'
token.lemma_: 'linear'
token.pos_: 'ADJ'
visited_tokens: [0, 2]
visited_nodes: [0, 1]
ic| visited_tokens: [0, 2, 4], visited_nodes: [0, 1, 2]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [2, 1, 0]
ic| prev_token: 2, token.i - visited_tokens[prev_token]: 1
ic| node0: 3, node1: 2
ic| prev_token: 1, token.i - visited_tokens[prev_token]: 3
ic| node0: 3, node1: 1
ic| prev_token: 0, token.i - visited_tokens[prev_token]: 5
ic| token.i: 5
token.text: 'constraints'
token.lemma_: 'constraint'
token.pos_: 'NOUN'
visited_tokens: [0, 2, 4]
visited_nodes: [0, 1, 2]
ic| visited_tokens: [0, 2, 4, 5], visited_nodes: [0, 1, 2, 3]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [3, 2, 1, 0]
ic| prev_token: 3, token.i - visited_tokens[prev_token]: 3
ic| node0: 4, node1: 3
ic| prev_token: 2, token.i - visited_tokens[prev_token]: 4
ic| token.i: 8
token.text: 'set'
token.lemma_: 'set'
token.pos_: 'NOUN'
visited_tokens: [0, 2, 4, 5]
visited_nodes: [0, 1, 2, 3]
ic| visited_tokens: [0, 2, 4, 5, 8], visited_nodes: [0, 1, 2, 3, 4]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [4, 3, 2, 1, 0]
ic| prev_token: 4, token.i - visited_tokens[prev_token]: 2
ic| node0: 5, node1: 4
ic| prev_token: 3, token.i - visited_tokens[prev_token]: 5
ic| token.i: 10
token.text: 'natural'
token.lemma_: 'natural'
token.pos_: 'ADJ'
visited_tokens: [0, 2, 4, 5, 8]
visited_nodes: [0, 1, 2, 3, 4]
ic| visited_tokens: [0, 2, 4, 5, 8, 10]
visited_nodes: [0, 1, 2, 3, 4, 5]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [5, 4, 3, 2, 1, 0]
ic| prev_token: 5, token.i - visited_tokens[prev_token]: 1
ic| node0: 6, node1: 5
ic| prev_token: 4, token.i - visited_tokens[prev_token]: 3
ic| node0: 6, node1: 4
ic| prev_token: 3, token.i - visited_tokens[prev_token]: 6
ic| token.i: 11
token.text: 'numbers'
token.lemma_: 'number'
token.pos_: 'NOUN'
visited_tokens: [0, 2, 4, 5, 8, 10]
visited_nodes: [0, 1, 2, 3, 4, 5]
ic| visited_tokens: [], visited_nodes: []
ic| list(range(len(visited_tokens) - 1, -1, -1)): []
ic| token.i: 13
token.text: 'Criteria'
token.lemma_: 'criterion'
token.pos_: 'NOUN'
visited_tokens: []
visited_nodes: []
ic| visited_tokens: [13], visited_nodes: [7]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [0]
ic| prev_token: 0, token.i - visited_tokens[prev_token]: 2
ic| node0: 0, node1: 7
ic| token.i: 15
token.text: 'compatibility'
token.lemma_: 'compatibility'
token.pos_: 'NOUN'
visited_tokens: [13]
visited_nodes: [7]
ic| visited_tokens: [13, 15], visited_nodes: [7, 0]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [1, 0]
ic| prev_token: 1, token.i - visited_tokens[prev_token]: 3
ic| node0: 1, node1: 0
ic| prev_token: 0, token.i - visited_tokens[prev_token]: 5
ic| token.i: 18
token.text: 'system'
token.lemma_: 'system'
token.pos_: 'NOUN'
visited_tokens: [13, 15]
visited_nodes: [7, 0]
ic| visited_tokens: [13, 15, 18], visited_nodes: [7, 0, 1]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [2, 1, 0]
ic| prev_token: 2, token.i - visited_tokens[prev_token]: 2
ic| node0: 2, node1: 1
ic| prev_token: 1, token.i - visited_tokens[prev_token]: 5
ic| token.i: 20
token.text: 'linear'
token.lemma_: 'linear'
token.pos_: 'ADJ'
visited_tokens: [13, 15, 18]
visited_nodes: [7, 0, 1]
ic| visited_tokens: [13, 15, 18, 20], visited_nodes: [7, 0, 1, 2]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [3, 2, 1, 0]
ic| prev_token: 3, token.i - visited_tokens[prev_token]: 1
ic| node0: 8, node1: 2
ic| prev_token: 2, token.i - visited_tokens[prev_token]: 3
ic| node0: 8, node1: 1
ic| prev_token: 1, token.i - visited_tokens[prev_token]: 6
ic| token.i: 21
token.text: 'Diophantine'
token.lemma_: 'Diophantine'
token.pos_: 'PROPN'
visited_tokens: [13, 15, 18, 20]
visited_nodes: [7, 0, 1, 2]
ic| visited_tokens: [13, 15, 18, 20, 21]
visited_nodes: [7, 0, 1, 2, 8]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [4, 3, 2, 1, 0]
ic| prev_token: 4, token.i - visited_tokens[prev_token]: 1
ic| node0: 9, node1: 8
ic| prev_token: 3, token.i - visited_tokens[prev_token]: 2
ic| node0: 9, node1: 2
ic| prev_token: 2, token.i - visited_tokens[prev_token]: 4
ic| token.i: 22
token.text: 'equations'
token.lemma_: 'equation'
token.pos_: 'NOUN'
visited_tokens: [13, 15, 18, 20, 21]
visited_nodes: [7, 0, 1, 2, 8]
ic| visited_tokens: [13, 15, 18, 20, 21, 22]
visited_nodes: [7, 0, 1, 2, 8, 9]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [5, 4, 3, 2, 1, 0]
ic| prev_token: 5, token.i - visited_tokens[prev_token]: 2
ic| node0: 10, node1: 9
ic| prev_token: 4, token.i - visited_tokens[prev_token]: 3
ic| node0: 10, node1: 8
ic| prev_token: 3, token.i - visited_tokens[prev_token]: 4
ic| token.i: 24
token.text: 'strict'
token.lemma_: 'strict'
token.pos_: 'ADJ'
visited_tokens: [13, 15, 18, 20, 21, 22]
visited_nodes: [7, 0, 1, 2, 8, 9]
ic| visited_tokens: [13, 15, 18, 20, 21, 22, 24]
visited_nodes: [7, 0, 1, 2, 8, 9, 10]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 6, token.i - visited_tokens[prev_token]: 1
ic| node0: 11, node1: 10
ic| prev_token: 5, token.i - visited_tokens[prev_token]: 3
ic| node0: 11, node1: 9
ic| prev_token: 4, token.i - visited_tokens[prev_token]: 4
ic| token.i: 25
token.text: 'inequations'
token.lemma_: 'inequation'
token.pos_: 'NOUN'
visited_tokens: [13, 15, 18, 20, 21, 22, 24]
visited_nodes: [7, 0, 1, 2, 8, 9, 10]
ic| visited_tokens: [13, 15, 18, 20, 21, 22, 24, 25]
visited_nodes: [7, 0, 1, 2, 8, 9, 10, 11]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 7, token.i - visited_tokens[prev_token]: 3
ic| node0: 12, node1: 11
ic| prev_token: 6, token.i - visited_tokens[prev_token]: 4
ic| token.i: 28
token.text: 'nonstrict'
token.lemma_: 'nonstrict'
token.pos_: 'ADJ'
visited_tokens: [13, 15, 18, 20, 21, 22, 24, 25]
visited_nodes: [7, 0, 1, 2, 8, 9, 10, 11]
ic| visited_tokens: [13, 15, 18, 20, 21, 22, 24, 25, 28]
visited_nodes: [7, 0, 1, 2, 8, 9, 10, 11, 12]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 8, token.i - visited_tokens[prev_token]: 1
ic| node0: 11, node1: 12
ic| prev_token: 7, token.i - visited_tokens[prev_token]: 4
ic| token.i: 29
token.text: 'inequations'
token.lemma_: 'inequation'
token.pos_: 'NOUN'
visited_tokens: [13, 15, 18, 20, 21, 22, 24, 25, 28]
visited_nodes: [7, 0, 1, 2, 8, 9, 10, 11, 12]
ic| visited_tokens: [13, 15, 18, 20, 21, 22, 24, 25, 28, 29]
visited_nodes: [7, 0, 1, 2, 8, 9, 10, 11, 12, 11]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 9, token.i - visited_tokens[prev_token]: 2
ic| node0: 13, node1: 11
ic| prev_token: 8, token.i - visited_tokens[prev_token]: 3
ic| node0: 13, node1: 12
ic| prev_token: 7, token.i - visited_tokens[prev_token]: 6
ic| token.i: 31
token.text: 'considered'
token.lemma_: 'consider'
token.pos_: 'VERB'
visited_tokens: [13, 15, 18, 20, 21, 22, 24, 25, 28, 29]
visited_nodes: [7, 0, 1, 2, 8, 9, 10, 11, 12, 11]
ic| visited_tokens: [], visited_nodes: []
ic| list(range(len(visited_tokens) - 1, -1, -1)): []
ic| token.i: 33
token.text: 'Upper'
token.lemma_: 'Upper'
token.pos_: 'PROPN'
visited_tokens: []
visited_nodes: []
ic| visited_tokens: [33], visited_nodes: [14]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [0]
ic| prev_token: 0, token.i - visited_tokens[prev_token]: 1
ic| node0: 15, node1: 14
ic| token.i: 34
token.text: 'bounds'
token.lemma_: 'bound'
token.pos_: 'NOUN'
visited_tokens: [33]
visited_nodes: [14]
ic| visited_tokens: [33, 34], visited_nodes: [14, 15]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [1, 0]
ic| prev_token: 1, token.i - visited_tokens[prev_token]: 2
ic| node0: 16, node1: 15
ic| prev_token: 0, token.i - visited_tokens[prev_token]: 3
ic| node0: 16, node1: 14
ic| token.i: 36
token.text: 'components'
token.lemma_: 'component'
token.pos_: 'NOUN'
visited_tokens: [33, 34]
visited_nodes: [14, 15]
ic| visited_tokens: [33, 34, 36], visited_nodes: [14, 15, 16]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [2, 1, 0]
ic| prev_token: 2, token.i - visited_tokens[prev_token]: 3
ic| node0: 17, node1: 16
ic| prev_token: 1, token.i - visited_tokens[prev_token]: 5
ic| token.i: 39
token.text: 'minimal'
token.lemma_: 'minimal'
token.pos_: 'ADJ'
visited_tokens: [33, 34, 36]
visited_nodes: [14, 15, 16]
ic| visited_tokens: [33, 34, 36, 39], visited_nodes: [14, 15, 16, 17]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [3, 2, 1, 0]
ic| prev_token: 3, token.i - visited_tokens[prev_token]: 1
ic| node0: 4, node1: 17
ic| prev_token: 2, token.i - visited_tokens[prev_token]: 4
ic| token.i: 40
token.text: 'set'
token.lemma_: 'set'
token.pos_: 'NOUN'
visited_tokens: [33, 34, 36, 39]
visited_nodes: [14, 15, 16, 17]
ic| visited_tokens: [33, 34, 36, 39, 40]
visited_nodes: [14, 15, 16, 17, 4]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [4, 3, 2, 1, 0]
ic| prev_token: 4, token.i - visited_tokens[prev_token]: 2
ic| node0: 18, node1: 4
ic| prev_token: 3, token.i - visited_tokens[prev_token]: 3
ic| node0: 18, node1: 17
ic| prev_token: 2, token.i - visited_tokens[prev_token]: 6
ic| token.i: 42
token.text: 'solutions'
token.lemma_: 'solution'
token.pos_: 'NOUN'
visited_tokens: [33, 34, 36, 39, 40]
visited_nodes: [14, 15, 16, 17, 4]
ic| visited_tokens: [33, 34, 36, 39, 40, 42]
visited_nodes: [14, 15, 16, 17, 4, 18]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [5, 4, 3, 2, 1, 0]
ic| prev_token: 5, token.i - visited_tokens[prev_token]: 2
ic| node0: 19, node1: 18
ic| prev_token: 4, token.i - visited_tokens[prev_token]: 4
ic| token.i: 44
token.text: 'algorithms'
token.lemma_: 'algorithm'
token.pos_: 'NOUN'
visited_tokens: [33, 34, 36, 39, 40, 42]
visited_nodes: [14, 15, 16, 17, 4, 18]
ic| visited_tokens: [33, 34, 36, 39, 40, 42, 44]
visited_nodes: [14, 15, 16, 17, 4, 18, 19]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 6, token.i - visited_tokens[prev_token]: 2
ic| node0: 20, node1: 19
ic| prev_token: 5, token.i - visited_tokens[prev_token]: 4
ic| token.i: 46
token.text: 'construction'
token.lemma_: 'construction'
token.pos_: 'NOUN'
visited_tokens: [33, 34, 36, 39, 40, 42, 44]
visited_nodes: [14, 15, 16, 17, 4, 18, 19]
ic| visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 7, token.i - visited_tokens[prev_token]: 2
ic| node0: 17, node1: 20
ic| prev_token: 6, token.i - visited_tokens[prev_token]: 4
ic| token.i: 48
token.text: 'minimal'
token.lemma_: 'minimal'
token.pos_: 'ADJ'
visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20]
ic| visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 8, token.i - visited_tokens[prev_token]: 1
ic| node0: 21, node1: 17
ic| prev_token: 7, token.i - visited_tokens[prev_token]: 3
ic| node0: 21, node1: 20
ic| prev_token: 6, token.i - visited_tokens[prev_token]: 5
ic| token.i: 49
token.text: 'generating'
token.lemma_: 'generating'
token.pos_: 'NOUN'
visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17]
ic| visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48, 49]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17, 21]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 9, token.i - visited_tokens[prev_token]: 1
ic| node0: 4, node1: 21
ic| prev_token: 8, token.i - visited_tokens[prev_token]: 2
ic| node0: 4, node1: 17
ic| prev_token: 7, token.i - visited_tokens[prev_token]: 4
ic| token.i: 50
token.text: 'sets'
token.lemma_: 'set'
token.pos_: 'NOUN'
visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48, 49]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17, 21]
ic| visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 10, token.i - visited_tokens[prev_token]: 2
ic| node0: 18, node1: 4
ic| prev_token: 9, token.i - visited_tokens[prev_token]: 3
ic| node0: 18, node1: 21
ic| prev_token: 8, token.i - visited_tokens[prev_token]: 4
ic| token.i: 52
token.text: 'solutions'
token.lemma_: 'solution'
token.pos_: 'NOUN'
visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4]
ic| visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 11, token.i - visited_tokens[prev_token]: 3
ic| node0: 22, node1: 18
ic| prev_token: 10, token.i - visited_tokens[prev_token]: 5
ic| token.i: 55
token.text: 'types'
token.lemma_: 'type'
token.pos_: 'NOUN'
visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18]
ic| visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52, 55]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18, 22]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 12, token.i - visited_tokens[prev_token]: 2
ic| node0: 1, node1: 22
ic| prev_token: 11, token.i - visited_tokens[prev_token]: 5
ic| token.i: 57
token.text: 'systems'
token.lemma_: 'system'
token.pos_: 'NOUN'
visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52, 55]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18, 22]
ic| visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52, 55, 57]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18, 22, 1]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 13, token.i - visited_tokens[prev_token]: 2
ic| node0: 23, node1: 1
ic| prev_token: 12, token.i - visited_tokens[prev_token]: 4
ic| token.i: 59
token.text: 'given'
token.lemma_: 'give'
token.pos_: 'VERB'
visited_tokens: [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52, 55, 57]
visited_nodes: [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18, 22, 1]
ic| visited_tokens: [], visited_nodes: []
ic| list(range(len(visited_tokens) - 1, -1, -1)): []
ic| token.i: 62
token.text: 'criteria'
token.lemma_: 'criterion'
token.pos_: 'NOUN'
visited_tokens: []
visited_nodes: []
ic| visited_tokens: [62], visited_nodes: [7]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [0]
ic| prev_token: 0, token.i - visited_tokens[prev_token]: 3
ic| node0: 24, node1: 7
ic| token.i: 65
token.text: 'corresponding'
token.lemma_: 'correspond'
token.pos_: 'VERB'
visited_tokens: [62]
visited_nodes: [7]
ic| visited_tokens: [62, 65], visited_nodes: [7, 24]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [1, 0]
ic| prev_token: 1, token.i - visited_tokens[prev_token]: 1
ic| node0: 19, node1: 24
ic| prev_token: 0, token.i - visited_tokens[prev_token]: 4
ic| token.i: 66
token.text: 'algorithms'
token.lemma_: 'algorithm'
token.pos_: 'NOUN'
visited_tokens: [62, 65]
visited_nodes: [7, 24]
ic| visited_tokens: [62, 65, 66], visited_nodes: [7, 24, 19]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [2, 1, 0]
ic| prev_token: 2, token.i - visited_tokens[prev_token]: 2
ic| node0: 25, node1: 19
ic| prev_token: 1, token.i - visited_tokens[prev_token]: 3
ic| node0: 25, node1: 24
ic| prev_token: 0, token.i - visited_tokens[prev_token]: 6
ic| token.i: 68
token.text: 'constructing'
token.lemma_: 'construct'
token.pos_: 'VERB'
visited_tokens: [62, 65, 66]
visited_nodes: [7, 24, 19]
ic| visited_tokens: [62, 65, 66, 68], visited_nodes: [7, 24, 19, 25]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [3, 2, 1, 0]
ic| prev_token: 3, token.i - visited_tokens[prev_token]: 2
ic| node0: 17, node1: 25
ic| prev_token: 2, token.i - visited_tokens[prev_token]: 4
ic| token.i: 70
token.text: 'minimal'
token.lemma_: 'minimal'
token.pos_: 'ADJ'
visited_tokens: [62, 65, 66, 68]
visited_nodes: [7, 24, 19, 25]
ic| visited_tokens: [62, 65, 66, 68, 70]
visited_nodes: [7, 24, 19, 25, 17]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [4, 3, 2, 1, 0]
ic| prev_token: 4, token.i - visited_tokens[prev_token]: 1
ic| node0: 26, node1: 17
ic| prev_token: 3, token.i - visited_tokens[prev_token]: 3
ic| node0: 26, node1: 25
ic| prev_token: 2, token.i - visited_tokens[prev_token]: 5
ic| token.i: 71
token.text: 'supporting'
token.lemma_: 'support'
token.pos_: 'VERB'
visited_tokens: [62, 65, 66, 68, 70]
visited_nodes: [7, 24, 19, 25, 17]
ic| visited_tokens: [62, 65, 66, 68, 70, 71]
visited_nodes: [7, 24, 19, 25, 17, 26]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [5, 4, 3, 2, 1, 0]
ic| prev_token: 5, token.i - visited_tokens[prev_token]: 1
ic| node0: 4, node1: 26
ic| prev_token: 4, token.i - visited_tokens[prev_token]: 2
ic| node0: 4, node1: 17
ic| prev_token: 3, token.i - visited_tokens[prev_token]: 4
ic| token.i: 72
token.text: 'set'
token.lemma_: 'set'
token.pos_: 'NOUN'
visited_tokens: [62, 65, 66, 68, 70, 71]
visited_nodes: [7, 24, 19, 25, 17, 26]
ic| visited_tokens: [62, 65, 66, 68, 70, 71, 72]
visited_nodes: [7, 24, 19, 25, 17, 26, 4]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 6, token.i - visited_tokens[prev_token]: 2
ic| node0: 18, node1: 4
ic| prev_token: 5, token.i - visited_tokens[prev_token]: 3
ic| node0: 18, node1: 26
ic| prev_token: 4, token.i - visited_tokens[prev_token]: 4
ic| token.i: 74
token.text: 'solutions'
token.lemma_: 'solution'
token.pos_: 'NOUN'
visited_tokens: [62, 65, 66, 68, 70, 71, 72]
visited_nodes: [7, 24, 19, 25, 17, 26, 4]
ic| visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 7, token.i - visited_tokens[prev_token]: 3
ic| node0: 27, node1: 18
ic| prev_token: 6, token.i - visited_tokens[prev_token]: 5
ic| token.i: 77
token.text: 'used'
token.lemma_: 'use'
token.pos_: 'VERB'
visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18]
ic| visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 8, token.i - visited_tokens[prev_token]: 2
ic| node0: 28, node1: 27
ic| prev_token: 7, token.i - visited_tokens[prev_token]: 5
ic| token.i: 79
token.text: 'solving'
token.lemma_: 'solve'
token.pos_: 'VERB'
visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27]
ic| visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 9, token.i - visited_tokens[prev_token]: 3
ic| node0: 13, node1: 28
ic| prev_token: 8, token.i - visited_tokens[prev_token]: 5
ic| token.i: 82
token.text: 'considered'
token.lemma_: 'consider'
token.pos_: 'VERB'
visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28]
ic| visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 10, token.i - visited_tokens[prev_token]: 1
ic| node0: 22, node1: 13
ic| prev_token: 9, token.i - visited_tokens[prev_token]: 4
ic| token.i: 83
token.text: 'types'
token.lemma_: 'type'
token.pos_: 'NOUN'
visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13]
ic| visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 11, token.i - visited_tokens[prev_token]: 1
ic| node0: 1, node1: 22
ic| prev_token: 10, token.i - visited_tokens[prev_token]: 2
ic| node0: 1, node1: 13
ic| prev_token: 9, token.i - visited_tokens[prev_token]: 5
ic| token.i: 84
token.text: 'systems'
token.lemma_: 'system'
token.pos_: 'NOUN'
visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22]
ic| visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 12, token.i - visited_tokens[prev_token]: 2
ic| node0: 1, node1: 1
ic| prev_token: 11, token.i - visited_tokens[prev_token]: 3
ic| node0: 1, node1: 22
ic| prev_token: 10, token.i - visited_tokens[prev_token]: 4
ic| token.i: 86
token.text: 'systems'
token.lemma_: 'system'
token.pos_: 'NOUN'
visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1]
ic| visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84, 86]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1, 1]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 13, token.i - visited_tokens[prev_token]: 2
ic| node0: 29, node1: 1
ic| prev_token: 12, token.i - visited_tokens[prev_token]: 4
ic| token.i: 88
token.text: 'mixed'
token.lemma_: 'mixed'
token.pos_: 'ADJ'
visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84, 86]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1, 1]
ic| visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84, 86, 88]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1, 1, 29]
ic| list(range(len(visited_tokens) - 1, -1, -1)): [14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
ic| prev_token: 14, token.i - visited_tokens[prev_token]: 1
ic| node0: 22, node1: 29
ic| prev_token: 13, token.i - visited_tokens[prev_token]: 3
ic| node0: 22, node1: 1
ic| prev_token: 12, token.i - visited_tokens[prev_token]: 5
ic| token.i: 89
token.text: 'types'
token.lemma_: 'type'
token.pos_: 'NOUN'
visited_tokens: [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84, 86, 88]
visited_nodes: [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1, 1, 29]
ic(seen_lemma)
ic| seen_lemma: {('Diophantine', 'PROPN'): {21},
('Upper', 'PROPN'): {33},
('algorithm', 'NOUN'): {66, 44},
('bound', 'NOUN'): {34},
('compatibility', 'NOUN'): {0, 15},
('component', 'NOUN'): {36},
('consider', 'VERB'): {82, 31},
('constraint', 'NOUN'): {5},
('construct', 'VERB'): {68},
('construction', 'NOUN'): {46},
('correspond', 'VERB'): {65},
('criterion', 'NOUN'): {13, 62},
('equation', 'NOUN'): {22},
('generating', 'NOUN'): {49},
('give', 'VERB'): {59},
('inequation', 'NOUN'): {25, 29},
('linear', 'ADJ'): {4, 20},
('minimal', 'ADJ'): {48, 70, 39},
('mixed', 'ADJ'): {88},
('natural', 'ADJ'): {10},
('nonstrict', 'ADJ'): {28},
('number', 'NOUN'): {11},
('set', 'NOUN'): {8, 40, 72, 50},
('solution', 'NOUN'): {42, 52, 74},
('solve', 'VERB'): {79},
('strict', 'ADJ'): {24},
('support', 'VERB'): {71},
('system', 'NOUN'): {2, 18, 84, 86, 57},
('type', 'NOUN'): {89, 83, 55},
('use', 'VERB'): {77}}
{('compatibility', 'NOUN'): {0, 15},
('system', 'NOUN'): {2, 18, 57, 84, 86},
('linear', 'ADJ'): {4, 20},
('constraint', 'NOUN'): {5},
('set', 'NOUN'): {8, 40, 50, 72},
('natural', 'ADJ'): {10},
('number', 'NOUN'): {11},
('criterion', 'NOUN'): {13, 62},
('Diophantine', 'PROPN'): {21},
('equation', 'NOUN'): {22},
('strict', 'ADJ'): {24},
('inequation', 'NOUN'): {25, 29},
('nonstrict', 'ADJ'): {28},
('consider', 'VERB'): {31, 82},
('Upper', 'PROPN'): {33},
('bound', 'NOUN'): {34},
('component', 'NOUN'): {36},
('minimal', 'ADJ'): {39, 48, 70},
('solution', 'NOUN'): {42, 52, 74},
('algorithm', 'NOUN'): {44, 66},
('construction', 'NOUN'): {46},
('generating', 'NOUN'): {49},
('type', 'NOUN'): {55, 83, 89},
('give', 'VERB'): {59},
('correspond', 'VERB'): {65},
('construct', 'VERB'): {68},
('support', 'VERB'): {71},
('use', 'VERB'): {77},
('solve', 'VERB'): {79},
('mixed', 'ADJ'): {88}}
Let's visualize the lemma graph, and for that first we need to collect a dictionary of the labels.
labels = {}
keys = list(seen_lemma.keys())
for i in range(len(seen_lemma)):
labels[i] = keys[i][0].lower()
labels
{0: 'compatibility',
1: 'system',
2: 'linear',
3: 'constraint',
4: 'set',
5: 'natural',
6: 'number',
7: 'criterion',
8: 'diophantine',
9: 'equation',
10: 'strict',
11: 'inequation',
12: 'nonstrict',
13: 'consider',
14: 'upper',
15: 'bound',
16: 'component',
17: 'minimal',
18: 'solution',
19: 'algorithm',
20: 'construction',
21: 'generating',
22: 'type',
23: 'give',
24: 'correspond',
25: 'construct',
26: 'support',
27: 'use',
28: 'solve',
29: 'mixed'}
Then use matplotlib
to visualize the lemma graph:
%matplotlib inline
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(9, 9))
pos = nx.spring_layout(lemma_graph)
nx.draw(lemma_graph, pos=pos, with_labels=False, font_weight="bold")
nx.draw_networkx_labels(lemma_graph, pos, labels);
Now to run the algorithm, we use PageRank
– a form of
eigenvector centrality
– to calculate ranks for each of the nodes in the lemma graph.
ranks = nx.pagerank(lemma_graph)
ranks
{0: 0.025190055141357165,
1: 0.09709174565608479,
2: 0.03658656272138432,
3: 0.022947339381092696,
4: 0.07548767636963792,
5: 0.01884004785394172,
6: 0.01884004785394172,
7: 0.019767433918161055,
8: 0.03093250736456608,
9: 0.031636552282656216,
10: 0.0250439175297852,
11: 0.03969617593153302,
12: 0.02513276636673567,
13: 0.0390375393827704,
14: 0.02428614673389346,
15: 0.02428614673389346,
16: 0.031629446298645975,
17: 0.06334806476862227,
18: 0.061826419749828485,
19: 0.03201021345587308,
20: 0.02404712231242087,
21: 0.029468555366439973,
22: 0.04816979699201436,
23: 0.010894627731045426,
24: 0.026930354700088012,
25: 0.03165915652710971,
26: 0.029382686223731833,
27: 0.019263849959229938,
28: 0.01982332786706688,
29: 0.016743716826448173}
for node_id, rank in sorted(ranks.items(), key=lambda x: x[1], reverse=True):
ic(node_id, rank, labels[node_id])
ic| node_id: 1, rank: 0.09709174565608479, labels[node_id]: 'system'
ic| node_id: 4, rank: 0.07548767636963792, labels[node_id]: 'set'
ic| node_id: 17, rank: 0.06334806476862227, labels[node_id]: 'minimal'
ic| node_id: 18
rank: 0.061826419749828485
labels[node_id]: 'solution'
ic| node_id: 22, rank: 0.04816979699201436, labels[node_id]: 'type'
ic| node_id: 11
rank: 0.03969617593153302
labels[node_id]: 'inequation'
ic| node_id: 13, rank: 0.0390375393827704, labels[node_id]: 'consider'
ic| node_id: 2, rank: 0.03658656272138432, labels[node_id]: 'linear'
ic| node_id: 19
rank: 0.03201021345587308
labels[node_id]: 'algorithm'
ic| node_id: 25
rank: 0.03165915652710971
labels[node_id]: 'construct'
ic| node_id: 9
rank: 0.031636552282656216
labels[node_id]: 'equation'
ic| node_id: 16
rank: 0.031629446298645975
labels[node_id]: 'component'
ic| node_id: 8
rank: 0.03093250736456608
labels[node_id]: 'diophantine'
ic| node_id: 21
rank: 0.029468555366439973
labels[node_id]: 'generating'
ic| node_id: 26
rank: 0.029382686223731833
labels[node_id]: 'support'
ic| node_id: 24
rank: 0.026930354700088012
labels[node_id]: 'correspond'
ic| node_id: 0
rank: 0.025190055141357165
labels[node_id]: 'compatibility'
ic| node_id: 12
rank: 0.02513276636673567
labels[node_id]: 'nonstrict'
ic| node_id: 10, rank: 0.0250439175297852, labels[node_id]: 'strict'
ic| node_id: 14, rank: 0.02428614673389346, labels[node_id]: 'upper'
ic| node_id: 15, rank: 0.02428614673389346, labels[node_id]: 'bound'
ic| node_id: 20
rank: 0.02404712231242087
labels[node_id]: 'construction'
ic| node_id: 3
rank: 0.022947339381092696
labels[node_id]: 'constraint'
ic| node_id: 28, rank: 0.01982332786706688, labels[node_id]: 'solve'
ic| node_id: 7
rank: 0.019767433918161055
labels[node_id]: 'criterion'
ic| node_id: 27, rank: 0.019263849959229938, labels[node_id]: 'use'
ic| node_id: 5, rank: 0.01884004785394172, labels[node_id]: 'natural'
ic| node_id: 6, rank: 0.01884004785394172, labels[node_id]: 'number'
ic| node_id: 29, rank: 0.016743716826448173, labels[node_id]: 'mixed'
ic| node_id: 23, rank: 0.010894627731045426, labels[node_id]: 'give'
Define a function to collect the top-ranked phrases from the lemma graph.
import math
def collect_phrases (chunk, phrases, counts):
chunk_len = chunk.end - chunk.start
sq_sum_rank = 0.0
non_lemma = 0
compound_key = set([])
for i in range(chunk.start, chunk.end):
token = doc[i]
key = (token.lemma_, token.pos_)
if key in seen_lemma:
node_id = list(seen_lemma.keys()).index(key)
rank = ranks[node_id]
sq_sum_rank += rank
compound_key.add(key)
ic(token.lemma_, token.pos_, node_id, rank)
else:
non_lemma += 1
# although the noun chunking is greedy, we discount the ranks using a
# point estimate based on the number of non-lemma tokens within a phrase
non_lemma_discount = chunk_len / (chunk_len + (2.0 * non_lemma) + 1.0)
# use root mean square (RMS) to normalize the contributions of all the tokens
phrase_rank = math.sqrt(sq_sum_rank / (chunk_len + non_lemma))
phrase_rank *= non_lemma_discount
# remove spurious punctuation
phrase = chunk.text.lower().replace("'", "")
# create a unique key for the the phrase based on its lemma components
compound_key = tuple(sorted(list(compound_key)))
if not compound_key in phrases:
phrases[compound_key] = set([ (phrase, phrase_rank) ])
counts[compound_key] = 1
else:
phrases[compound_key].add( (phrase, phrase_rank) )
counts[compound_key] += 1
ic(phrase_rank, chunk.text, chunk.start, chunk.end, chunk_len, counts[compound_key])
Collect the top-ranked phrases based on both the noun chunks and the named entities...
phrases = {}
counts = {}
for chunk in doc.noun_chunks:
collect_phrases(chunk, phrases, counts)
ic| token.lemma_: 'compatibility'
token.pos_: 'NOUN'
node_id: 0
rank: 0.025190055141357165
ic| phrase_rank: 0.07935687610622845
chunk.text: 'Compatibility'
chunk.start: 0
chunk.end: 1
chunk_len: 1
counts[compound_key]: 1
ic| token.lemma_: 'system'
token.pos_: 'NOUN'
node_id: 1
rank: 0.09709174565608479
ic| phrase_rank: 0.15579774200552843
chunk.text: 'systems'
chunk.start: 2
chunk.end: 3
chunk_len: 1
counts[compound_key]: 1
ic| token.lemma_: 'linear'
token.pos_: 'ADJ'
node_id: 2
rank: 0.03658656272138432
ic| token.lemma_: 'constraint'
token.pos_: 'NOUN'
node_id: 3
rank: 0.022947339381092696
ic| phrase_rank: 0.11502067650110856
chunk.text: 'linear constraints'
chunk.start: 4
chunk.end: 6
chunk_len: 2
counts[compound_key]: 1
ic| token.lemma_: 'set'
token.pos_: 'NOUN'
node_id: 4
rank: 0.07548767636963792
ic| phrase_rank: 0.06345084244027568
chunk.text: 'the set'
chunk.start: 7
chunk.end: 9
chunk_len: 2
counts[compound_key]: 1
ic| token.lemma_: 'natural'
token.pos_: 'ADJ'
node_id: 5
rank: 0.01884004785394172
ic| token.lemma_: 'number'
token.pos_: 'NOUN'
node_id: 6
rank: 0.01884004785394172
ic| phrase_rank: 0.09150603587606598
chunk.text: 'natural numbers'
chunk.start: 10
chunk.end: 12
chunk_len: 2
counts[compound_key]: 1
ic| token.lemma_: 'criterion'
token.pos_: 'NOUN'
node_id: 7
rank: 0.019767433918161055
ic| phrase_rank: 0.07029835332026109
chunk.text: 'Criteria'
chunk.start: 13
chunk.end: 14
chunk_len: 1
counts[compound_key]: 1
ic| token.lemma_: 'compatibility'
token.pos_: 'NOUN'
node_id: 0
rank: 0.025190055141357165
ic| phrase_rank: 0.07935687610622845
chunk.text: 'compatibility'
chunk.start: 15
chunk.end: 16
chunk_len: 1
counts[compound_key]: 2
ic| token.lemma_: 'system'
token.pos_: 'NOUN'
node_id: 1
rank: 0.09709174565608479
ic| phrase_rank: 0.07195989462882217
chunk.text: 'a system'
chunk.start: 17
chunk.end: 19
chunk_len: 2
counts[compound_key]: 2
ic| token.lemma_: 'linear'
token.pos_: 'ADJ'
node_id: 2
rank: 0.03658656272138432
ic| token.lemma_: 'Diophantine'
token.pos_: 'PROPN'
node_id: 8
rank: 0.03093250736456608
ic| token.lemma_: 'equation'
token.pos_: 'NOUN'
node_id: 9
rank: 0.031636552282656216
ic| phrase_rank: 0.13635130800294415
chunk.text: 'linear Diophantine equations'
chunk.start: 20
chunk.end: 23
chunk_len: 3
counts[compound_key]: 1
ic| token.lemma_: 'strict'
token.pos_: 'ADJ'
node_id: 10
rank: 0.0250439175297852
ic| token.lemma_: 'inequation'
token.pos_: 'NOUN'
node_id: 11
rank: 0.03969617593153302
ic| phrase_rank: 0.11994451815672316
chunk.text: 'strict inequations'
chunk.start: 24
chunk.end: 26
chunk_len: 2
counts[compound_key]: 1
ic| token.lemma_: 'nonstrict'
token.pos_: 'ADJ'
node_id: 12
rank: 0.02513276636673567
ic| token.lemma_: 'inequation'
token.pos_: 'NOUN'
node_id: 11
rank: 0.03969617593153302
ic| phrase_rank: 0.12002679543267614
chunk.text: 'nonstrict inequations'
chunk.start: 28
chunk.end: 30
chunk_len: 2
counts[compound_key]: 1
ic| token.lemma_: 'Upper'
token.pos_: 'PROPN'
node_id: 14
rank: 0.02428614673389346
ic| token.lemma_: 'bound'
token.pos_: 'NOUN'
node_id: 15
rank: 0.02428614673389346
ic| phrase_rank: 0.10389342131646997
chunk.text: 'Upper bounds'
chunk.start: 33
chunk.end: 35
chunk_len: 2
counts[compound_key]: 1
ic| token.lemma_: 'component'
token.pos_: 'NOUN'
node_id: 16
rank: 0.031629446298645975
ic| phrase_rank: 0.08892334662315343
chunk.text: 'components'
chunk.start: 36
chunk.end: 37
chunk_len: 1
counts[compound_key]: 1
ic| token.lemma_: 'minimal'
token.pos_: 'ADJ'
node_id: 17
rank: 0.06334806476862227
ic| token.lemma_: 'set'
token.pos_: 'NOUN'
node_id: 4
rank: 0.07548767636963792
ic| phrase_rank: 0.09315167105930662
chunk.text: 'a minimal set'
chunk.start: 38
chunk.end: 41
chunk_len: 3
counts[compound_key]: 1
ic| token.lemma_: 'solution'
token.pos_: 'NOUN'
node_id: 18
rank: 0.061826419749828485
ic| phrase_rank: 0.12432459506251015
chunk.text: 'solutions'
chunk.start: 42
chunk.end: 43
chunk_len: 1
counts[compound_key]: 1
ic| token.lemma_: 'algorithm'
token.pos_: 'NOUN'
node_id: 19
rank: 0.03201021345587308
ic| phrase_rank: 0.08945699169974514
chunk.text: 'algorithms'
chunk.start: 44
chunk.end: 45
chunk_len: 1
counts[compound_key]: 1
ic| token.lemma_: 'construction'
token.pos_: 'NOUN'
node_id: 20
rank: 0.02404712231242087
ic| phrase_rank: 0.07753567293901058
chunk.text: 'construction'
chunk.start: 46
chunk.end: 47
chunk_len: 1
counts[compound_key]: 1
ic| token.lemma_: 'minimal'
token.pos_: 'ADJ'
node_id: 17
rank: 0.06334806476862227
ic| token.lemma_: 'generating'
token.pos_: 'NOUN'
node_id: 21
rank: 0.029468555366439973
ic| token.lemma_: 'set'
token.pos_: 'NOUN'
node_id: 4
rank: 0.07548767636963792
ic| phrase_rank: 0.17764305670256655
chunk.text: 'minimal generating sets'
chunk.start: 48
chunk.end: 51
chunk_len: 3
counts[compound_key]: 1
ic| token.lemma_: 'solution'
token.pos_: 'NOUN'
node_id: 18
rank: 0.061826419749828485
ic| phrase_rank: 0.12432459506251015
chunk.text: 'solutions'
chunk.start: 52
chunk.end: 53
chunk_len: 1
counts[compound_key]: 2
ic| token.lemma_: 'type'
token.pos_: 'NOUN'
node_id: 22
rank: 0.04816979699201436
ic| phrase_rank: 0.050685854432712285
chunk.text: 'all types'
chunk.start: 54
chunk.end: 56
chunk_len: 2
counts[compound_key]: 1
ic| token.lemma_: 'system'
token.pos_: 'NOUN'
node_id: 1
rank: 0.09709174565608479
ic| phrase_rank: 0.15579774200552843
chunk.text: 'systems'
chunk.start: 57
chunk.end: 58
chunk_len: 1
counts[compound_key]: 3
ic| token.lemma_: 'criterion'
token.pos_: 'NOUN'
node_id: 7
rank: 0.019767433918161055
ic| phrase_rank: 0.03246941857043213
chunk.text: 'These criteria'
chunk.start: 61
chunk.end: 63
chunk_len: 2
counts[compound_key]: 2
ic| token.lemma_: 'correspond'
token.pos_: 'VERB'
node_id: 24
rank: 0.026930354700088012
ic| token.lemma_: 'algorithm'
token.pos_: 'NOUN'
node_id: 19
rank: 0.03201021345587308
ic| phrase_rank: 0.0606941966727262
chunk.text: 'the corresponding algorithms'
chunk.start: 64
chunk.end: 67
chunk_len: 3
counts[compound_key]: 1
ic| token.lemma_: 'minimal'
token.pos_: 'ADJ'
node_id: 17
rank: 0.06334806476862227
ic| token.lemma_: 'support'
token.pos_: 'VERB'
node_id: 26
rank: 0.029382686223731833
ic| token.lemma_: 'set'
token.pos_: 'NOUN'
node_id: 4
rank: 0.07548767636963792
ic| phrase_rank: 0.10481265770639074
chunk.text: 'a minimal supporting set'
chunk.start: 69
chunk.end: 73
chunk_len: 4
counts[compound_key]: 1
ic| token.lemma_: 'solution'
token.pos_: 'NOUN'
node_id: 18
rank: 0.061826419749828485
ic| phrase_rank: 0.12432459506251015
chunk.text: 'solutions'
chunk.start: 74
chunk.end: 75
chunk_len: 1
counts[compound_key]: 3
ic| token.lemma_: 'consider'
token.pos_: 'VERB'
node_id: 13
rank: 0.0390375393827704
ic| token.lemma_: 'type'
token.pos_: 'NOUN'
node_id: 22
rank: 0.04816979699201436
ic| token.lemma_: 'system'
token.pos_: 'NOUN'
node_id: 1
rank: 0.09709174565608479
ic| phrase_rank: 0.0811302044403381
chunk.text: 'all the considered types systems'
chunk.start: 80
chunk.end: 85
chunk_len: 5
counts[compound_key]: 1
ic| token.lemma_: 'system'
token.pos_: 'NOUN'
node_id: 1
rank: 0.09709174565608479
ic| phrase_rank: 0.15579774200552843
chunk.text: 'systems'
chunk.start: 86
chunk.end: 87
chunk_len: 1
counts[compound_key]: 4
ic| token.lemma_: 'mixed'
token.pos_: 'ADJ'
node_id: 29
rank: 0.016743716826448173
ic| token.lemma_: 'type'
token.pos_: 'NOUN'
node_id: 22
rank: 0.04816979699201436
ic| phrase_rank: 0.12010505939797736
chunk.text: 'mixed types'
chunk.start: 88
chunk.end: 90
chunk_len: 2
counts[compound_key]: 1
for ent in doc.ents:
collect_phrases(ent, phrases, counts)
ic| token.lemma_: 'Diophantine'
token.pos_: 'PROPN'
node_id: 8
rank: 0.03093250736456608
ic| phrase_rank: 0.08793819898736567
chunk.text: 'Diophantine'
chunk.start: 21
chunk.end: 22
chunk_len: 1
counts[compound_key]: 1
Since noun chunks can be expressed in different ways (e.g., they may have articles or prepositions), we need to find a minimum span for each phrase based on combinations of lemmas...
import operator
min_phrases = {}
for compound_key, rank_tuples in phrases.items():
l = list(rank_tuples)
l.sort(key=operator.itemgetter(1), reverse=True)
phrase, rank = l[0]
count = counts[compound_key]
min_phrases[phrase] = (rank, count)
Then let's examine the results of TextRank...
for phrase, (rank, count) in sorted(min_phrases.items(), key=lambda x: x[1][0], reverse=True):
ic(phrase, count, rank)
ic| phrase: 'minimal generating sets'
count: 1
rank: 0.17764305670256655
ic| phrase: 'systems', count: 4, rank: 0.15579774200552843
ic| phrase: 'linear diophantine equations'
count: 1
rank: 0.13635130800294415
ic| phrase: 'solutions', count: 3, rank: 0.12432459506251015
ic| phrase: 'mixed types', count: 1, rank: 0.12010505939797736
ic| phrase: 'nonstrict inequations'
count: 1
rank: 0.12002679543267614
ic| phrase: 'strict inequations', count: 1, rank: 0.11994451815672316
ic| phrase: 'linear constraints', count: 1, rank: 0.11502067650110856
ic| phrase: 'a minimal supporting set'
count: 1
rank: 0.10481265770639074
ic| phrase: 'upper bounds', count: 1, rank: 0.10389342131646997
ic| phrase: 'a minimal set', count: 1, rank: 0.09315167105930662
ic| phrase: 'natural numbers', count: 1, rank: 0.09150603587606598
ic| phrase: 'algorithms', count: 1, rank: 0.08945699169974514
ic| phrase: 'components', count: 1, rank: 0.08892334662315343
ic| phrase: 'diophantine', count: 1, rank: 0.08793819898736567
ic| phrase: 'all the considered types systems'
count: 1
rank: 0.0811302044403381
ic| phrase: 'compatibility', count: 2, rank: 0.07935687610622845
ic| phrase: 'construction', count: 1, rank: 0.07753567293901058
ic| phrase: 'criteria', count: 2, rank: 0.07029835332026109
ic| phrase: 'the set', count: 1, rank: 0.06345084244027568
ic| phrase: 'the corresponding algorithms'
count: 1
rank: 0.0606941966727262
ic| phrase: 'all types', count: 1, rank: 0.050685854432712285
Just for kicks, compare with raw results of the non-chunked lemma nodes...
for node_id, rank in sorted(ranks.items(), key=lambda x: x[1], reverse=True):
ic(labels[node_id], rank)
ic| labels[node_id]: 'system', rank: 0.09709174565608479
ic| labels[node_id]: 'set', rank: 0.07548767636963792
ic| labels[node_id]: 'minimal', rank: 0.06334806476862227
ic| labels[node_id]: 'solution', rank: 0.061826419749828485
ic| labels[node_id]: 'type', rank: 0.04816979699201436
ic| labels[node_id]: 'inequation', rank: 0.03969617593153302
ic| labels[node_id]: 'consider', rank: 0.0390375393827704
ic| labels[node_id]: 'linear', rank: 0.03658656272138432
ic| labels[node_id]: 'algorithm', rank: 0.03201021345587308
ic| labels[node_id]: 'construct', rank: 0.03165915652710971
ic| labels[node_id]: 'equation', rank: 0.031636552282656216
ic| labels[node_id]: 'component', rank: 0.031629446298645975
ic| labels[node_id]: 'diophantine', rank: 0.03093250736456608
ic| labels[node_id]: 'generating', rank: 0.029468555366439973
ic| labels[node_id]: 'support', rank: 0.029382686223731833
ic| labels[node_id]: 'correspond', rank: 0.026930354700088012
ic| labels[node_id]: 'compatibility', rank: 0.025190055141357165
ic| labels[node_id]: 'nonstrict', rank: 0.02513276636673567
ic| labels[node_id]: 'strict', rank: 0.0250439175297852
ic| labels[node_id]: 'upper', rank: 0.02428614673389346
ic| labels[node_id]: 'bound', rank: 0.02428614673389346
ic| labels[node_id]: 'construction', rank: 0.02404712231242087
ic| labels[node_id]: 'constraint', rank: 0.022947339381092696
ic| labels[node_id]: 'solve', rank: 0.01982332786706688
ic| labels[node_id]: 'criterion', rank: 0.019767433918161055
ic| labels[node_id]: 'use', rank: 0.019263849959229938
ic| labels[node_id]: 'natural', rank: 0.01884004785394172
ic| labels[node_id]: 'number', rank: 0.01884004785394172
ic| labels[node_id]: 'mixed', rank: 0.016743716826448173
ic| labels[node_id]: 'give', rank: 0.010894627731045426