Note
To run this notebook in JupyterLab, load examples/sample.ipynb
Getting Started¶
First, we'll import the required libraries and add the PyTextRank component into the spaCy
pipeline:
import pytextrank
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank");
Let's take a look at this pipeline now...
nlp.pipe_names
['tok2vec',
'tagger',
'parser',
'attribute_ruler',
'lemmatizer',
'ner',
'textrank']
We can examine the spaCy
pipeline in much greater detail...
nlp.analyze_pipes(pretty=True)
[1m
============================= Pipeline Overview =============================[0m
# Component Assigns Requires Scores Retokenizes
- --------------- ------------------- -------- ---------------- -----------
0 tok2vec doc.tensor False
1 tagger token.tag tag_acc False
2 parser token.dep dep_uas False
token.head dep_las
token.is_sent_start dep_las_per_type
doc.sents sents_p
sents_r
sents_f
3 attribute_ruler False
4 lemmatizer token.lemma lemma_acc False
5 ner doc.ents ents_f False
token.ent_iob ents_p
token.ent_type ents_r
ents_per_type
6 textrank False
[38;5;2m✔ No problems found.[0m
{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
'requires': [],
'scores': [],
'retokenizes': False},
'tagger': {'assigns': ['token.tag'],
'requires': [],
'scores': ['tag_acc'],
'retokenizes': False},
'parser': {'assigns': ['token.dep',
'token.head',
'token.is_sent_start',
'doc.sents'],
'requires': [],
'scores': ['dep_uas',
'dep_las',
'dep_las_per_type',
'sents_p',
'sents_r',
'sents_f'],
'retokenizes': False},
'attribute_ruler': {'assigns': [],
'requires': [],
'scores': [],
'retokenizes': False},
'lemmatizer': {'assigns': ['token.lemma'],
'requires': [],
'scores': ['lemma_acc'],
'retokenizes': False},
'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
'requires': [],
'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
'retokenizes': False},
'textrank': {'assigns': [],
'requires': [],
'scores': [],
'retokenizes': False}},
'problems': {'tok2vec': [],
'tagger': [],
'parser': [],
'attribute_ruler': [],
'lemmatizer': [],
'ner': [],
'textrank': []},
'attrs': {'token.is_sent_start': {'assigns': ['parser'], 'requires': []},
'doc.sents': {'assigns': ['parser'], 'requires': []},
'token.head': {'assigns': ['parser'], 'requires': []},
'token.ent_iob': {'assigns': ['ner'], 'requires': []},
'token.tag': {'assigns': ['tagger'], 'requires': []},
'token.lemma': {'assigns': ['lemmatizer'], 'requires': []},
'doc.tensor': {'assigns': ['tok2vec'], 'requires': []},
'doc.ents': {'assigns': ['ner'], 'requires': []},
'token.dep': {'assigns': ['parser'], 'requires': []},
'token.ent_type': {'assigns': ['ner'], 'requires': []}}}
Next, let's load some text from a document:
from icecream import ic
import pathlib
text = pathlib.Path("../dat/mih.txt").read_text()
text
'Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n'
Then run the spaCy
pipeline...
doc = nlp(text)
len(doc)
92
Now we can access the PyTextRank component within the spaCy
pipeline, and use it to get more information for post-processing of the document.
For example, let's see what the elapsed time in milliseconds was for the TextRank processing:
tr = doc._.textrank
ic(tr.elapsed_time);
ic| tr.elapsed_time: 2.915620803833008
Let's examine the top-ranked phrases in the document
for phrase in doc._.phrases:
ic(phrase.rank, phrase.count, phrase.text)
ic(phrase.chunks)
ic| phrase.rank: 0.18359439311764025
phrase.count: 1
phrase.text: 'mixed types'
ic| phrase.chunks: [mixed types]
ic| phrase.rank: 0.1784796193107821
phrase.count: 3
phrase.text: 'systems'
ic| phrase.chunks: [systems, systems, systems]
ic| phrase.rank: 0.15037838042245094
phrase.count: 1
phrase.text: 'minimal generating sets'
ic| phrase.chunks: [minimal generating sets]
ic| phrase.rank: 0.14740065982407313
phrase.count: 1
phrase.text: 'nonstrict inequations'
ic| phrase.chunks: [nonstrict inequations]
ic| phrase.rank: 0.13946027725597837
phrase.count: 1
phrase.text: 'strict inequations'
ic| phrase.chunks: [strict inequations]
ic| phrase.rank: 0.1195023546245721
phrase.count: 1
phrase.text: 'linear Diophantine equations'
ic| phrase.chunks: [linear Diophantine equations]
ic| phrase.rank: 0.11450088293222845
phrase.count: 1
phrase.text: 'natural numbers'
ic| phrase.chunks: [natural numbers]
ic| phrase.rank: 0.10780718173686318
phrase.count: 3
phrase.text: 'solutions'
ic| phrase.chunks: [solutions, solutions, solutions]
ic| phrase.rank: 0.10529828014583348
phrase.count: 1
phrase.text: 'linear constraints'
ic| phrase.chunks: [linear constraints]
ic| phrase.rank: 0.1036960590708142
phrase.count: 1
phrase.text: 'all the considered types systems'
ic| phrase.chunks: [all the considered types systems]
ic| phrase.rank: 0.08812713074893187
phrase.count: 1
phrase.text: 'a minimal supporting set'
ic| phrase.chunks: [a minimal supporting set]
ic| phrase.rank: 0.08444534702772151
phrase.count: 1
phrase.text: 'linear'
ic| phrase.chunks: [linear]
ic| phrase.rank: 0.08243620500315359
phrase.count: 1
phrase.text: 'a system'
ic| phrase.chunks: [a system]
ic| phrase.rank: 0.07944607954086784
phrase.count: 1
phrase.text: 'a minimal set'
ic| phrase.chunks: [a minimal set]
ic| phrase.rank: 0.0763527926213032
phrase.count: 1
phrase.text: 'algorithms'
ic| phrase.chunks: [algorithms]
ic| phrase.rank: 0.07593126037016427
phrase.count: 1
phrase.text: 'all types'
ic| phrase.chunks: [all types]
ic| phrase.rank: 0.07309361902551355
phrase.count: 1
phrase.text: 'Diophantine'
ic| phrase.chunks: [Diophantine]
ic| phrase.rank: 0.0702090100898443
phrase.count: 1
phrase.text: 'construction'
ic| phrase.chunks: [construction]
ic| phrase.rank: 0.05800111772673988
phrase.count: 1
phrase.text: 'the set'
ic| phrase.chunks: [the set]
ic| phrase.rank: 0.054251394765316464
phrase.count: 1
phrase.text: 'components'
ic| phrase.chunks: [components]
ic| phrase.rank: 0.04516904342912139
phrase.count: 1
phrase.text: 'Compatibility'
ic| phrase.chunks: [Compatibility]
ic| phrase.rank: 0.04516904342912139
phrase.count: 1
phrase.text: 'compatibility'
ic| phrase.chunks: [compatibility]
ic| phrase.rank: 0.04435648606848154
phrase.count: 1
phrase.text: 'the corresponding algorithms'
ic| phrase.chunks: [the corresponding algorithms]
ic| phrase.rank: 0.042273783712246285
phrase.count: 1
phrase.text: 'Criteria'
ic| phrase.chunks: [Criteria]
ic| phrase.rank: 0.01952542432474353
phrase.count: 1
phrase.text: 'These criteria'
ic| phrase.chunks: [These criteria]
Stop Words¶
To show use of the stop words feature, first we'll output a baseline...
text = pathlib.Path("../dat/gen.txt").read_text()
doc = nlp(text)
for phrase in doc._.phrases[:10]:
ic(phrase)
ic| phrase: Phrase(text='words', chunks=[words, words], count=2, rank=0.16404428603296545)
ic| phrase: Phrase(text='sentences', chunks=[sentences], count=1, rank=0.1287826954552565)
ic| phrase: Phrase(text='Mihalcea et al',
chunks=[Mihalcea et al],
count=1,
rank=0.11278365769540494)
ic| phrase: Phrase(text='Barrios et al',
chunks=[Barrios et al],
count=1,
rank=0.10760811592357011)
ic| phrase: Phrase(text='the remaining words',
chunks=[the remaining words],
count=1,
rank=0.09737893962520337)
ic| phrase: Phrase(text='text summarization',
chunks=[text summarization],
count=1,
rank=0.08861074217386355)
ic| phrase: Phrase(text='ranking webpages',
chunks=[ranking webpages],
count=1,
rank=0.07685260919250497)
ic| phrase: Phrase(text='Okapi BM25 function',
chunks=[Okapi BM25 function],
count=1,
rank=0.0756013984034083)
ic| phrase: Phrase(text='gensim implements',
chunks=[gensim implements],
count=1,
rank=0.0748386557231912)
ic| phrase: Phrase(text='every other sentence',
chunks=[every other sentence],
count=1,
rank=0.07031782290622991)
Notice how the top-ranked phrase above is words
?
Let's add that phrase to our stop words list, to exclude it from the ranked phrases...
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] } })
doc = nlp(text)
for phrase in doc._.phrases[:10]:
ic(phrase)
ic| phrase: Phrase(text='sentences', chunks=[sentences], count=1, rank=0.14407118792073048)
ic| phrase: Phrase(text='Mihalcea et al',
chunks=[Mihalcea et al],
count=1,
rank=0.12123026637064825)
ic| phrase: Phrase(text='Barrios et al',
chunks=[Barrios et al],
count=1,
rank=0.11566772028535821)
ic| phrase: Phrase(text='text summarization',
chunks=[text summarization],
count=1,
rank=0.09524776232834677)
ic| phrase: Phrase(text='ranking webpages',
chunks=[ranking webpages],
count=1,
rank=0.08260919223940909)
ic| phrase: Phrase(text='Okapi BM25 function',
chunks=[Okapi BM25 function],
count=1,
rank=0.08125840606728206)
ic| phrase: Phrase(text='gensim implements',
chunks=[gensim implements],
count=1,
rank=0.08043607214961235)
ic| phrase: Phrase(text='every other sentence',
chunks=[every other sentence],
count=1,
rank=0.07915141312258998)
ic| phrase: Phrase(text='original TextRank',
chunks=[original TextRank],
count=1,
rank=0.07013026654397199)
ic| phrase: Phrase(text='TextRank',
chunks=[TextRank, TextRank, TextRank, TextRank, TextRank],
count=5,
rank=0.06686718957926076)
For each entry, you'll need to add a key that is the lemma_ and a value that's a list of its part-of-speech tags.
Note: lemma_ of a token is base form of the token, with no inflectional suffixes. It is usually represented in lower-case form, with the exception of proper nouns and named entities. For eg. words like ran, runs, running will be lemmatized to run, London will be lemmatized to London without lower casing. It is sugggested to check the designated lemma value for a token before setting it in stopword config.
Scrubber¶
Observe how different variations of "sentence", like "every sentence" and "every other sentence", as well as variations of "sentences", occur in phrase list. You can omit such variations by passing a scrubber function in the config.
from spacy.tokens import Span
nlp = spacy.load("en_core_web_sm")
@spacy.registry.misc("prefix_scrubber")
def prefix_scrubber():
def scrubber_func(span: Span) -> str:
while len(span) > 1 and span[0].text in ("a", "the", "their", "every", "other", "two"):
span = span[1:]
return span.lemma_
return scrubber_func
nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] }, "scrubber": {"@misc": "prefix_scrubber"}})
doc = nlp(text)
for phrase in doc._.phrases[:10]:
ic(phrase)
ic| phrase: Phrase(text='sentence',
chunks=[sentences,
every sentence,
every other sentence,
the two sentences,
two sentences,
the sentences],
count=6,
rank=0.14407118792073048)
ic| phrase: Phrase(text='Mihalcea et al',
chunks=[Mihalcea et al],
count=1,
rank=0.12123026637064825)
ic| phrase: Phrase(text='Barrios et al',
chunks=[Barrios et al],
count=1,
rank=0.11566772028535821)
ic| phrase: Phrase(text='text summarization',
chunks=[text summarization],
count=1,
rank=0.09524776232834677)
ic| phrase: Phrase(text='rank webpage',
chunks=[ranking webpages],
count=1,
rank=0.08260919223940909)
ic| phrase: Phrase(text='Okapi BM25 function',
chunks=[Okapi BM25 function],
count=1,
rank=0.08125840606728206)
ic| phrase: Phrase(text='gensim implement',
chunks=[gensim implements],
count=1,
rank=0.08043607214961235)
ic| phrase: Phrase(text='original TextRank',
chunks=[original TextRank],
count=1,
rank=0.07013026654397199)
ic| phrase: Phrase(text='TextRank',
chunks=[TextRank, TextRank, TextRank, TextRank],
count=4,
rank=0.06686718957926076)
ic| phrase: Phrase(text='Olavur Mortensen',
chunks=[Olavur Mortensen, Olavur Mortensen],
count=2,
rank=0.06548020385220721)
Different variations of "sentence(s)" are now represented as part of single entry in phrase list.
As the scrubber takes in Spans
, we can also use token.pos_
or any other spaCy Token
or Span
attribute in the scrubbing. The variations of "sentences" have different DETs (determiners), so we could achieve a similar result with the folowing scrubber.
@spacy.registry.misc("articles_scrubber")
def articles_scrubber():
def scrubber_func(span: Span) -> str:
for token in span:
if token.pos_ not in ["DET", "PRON"]:
break
span = span[1:]
return span.text
return scrubber_func
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] }, "scrubber": {"@misc": "articles_scrubber"}})
doc = nlp(text)
for phrase in doc._.phrases[:10]:
ic(phrase)
ic| phrase: Phrase(text='sentences',
chunks=[sentences, the sentences],
count=2,
rank=0.14407118792073048)
ic| phrase: Phrase(text='Mihalcea et al',
chunks=[Mihalcea et al],
count=1,
rank=0.12123026637064825)
ic| phrase: Phrase(text='Barrios et al',
chunks=[Barrios et al],
count=1,
rank=0.11566772028535821)
ic| phrase: Phrase(text='text summarization',
chunks=[text summarization],
count=1,
rank=0.09524776232834677)
ic| phrase: Phrase(text='ranking webpages',
chunks=[ranking webpages],
count=1,
rank=0.08260919223940909)
ic| phrase: Phrase(text='Okapi BM25 function',
chunks=[Okapi BM25 function],
count=1,
rank=0.08125840606728206)
ic| phrase: Phrase(text='gensim implements',
chunks=[gensim implements],
count=1,
rank=0.08043607214961235)
ic| phrase: Phrase(text='other sentence',
chunks=[every other sentence],
count=1,
rank=0.07915141312258998)
ic| phrase: Phrase(text='original TextRank',
chunks=[original TextRank],
count=1,
rank=0.07013026654397199)
ic| phrase: Phrase(text='TextRank',
chunks=[TextRank, TextRank, TextRank, TextRank, TextRank],
count=5,
rank=0.06686718957926076)
We could also use Span
labels to filter out ents
, for example, or certain types of entities, e.g. "CARDINAL", or "DATE", if need to do so for our use case.
@spacy.registry.misc("entity_scrubber")
def articles_scrubber():
def scrubber_func(span: Span) -> str:
if span[0].ent_type_:
# ignore named entities
return "INELIGIBLE_PHRASE"
return span.text
return scrubber_func
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] }, "scrubber": {"@misc": "entity_scrubber"}})
doc = nlp(text)
for phrase in doc._.phrases[:10]:
if phrase.text != "INELIGIBLE_PHRASE":
ic(phrase)
ic| phrase: Phrase(text='sentences', chunks=[sentences], count=1, rank=0.14407118792073048)
ic| phrase: Phrase(text='Barrios et al',
chunks=[Barrios et al],
count=1,
rank=0.11566772028535821)
ic| phrase: Phrase(text='text summarization',
chunks=[text summarization],
count=1,
rank=0.09524776232834677)
ic| phrase: Phrase(text='ranking webpages',
chunks=[ranking webpages],
count=1,
rank=0.08260919223940909)
ic| phrase: Phrase(text='gensim implements',
chunks=[gensim implements],
count=1,
rank=0.08043607214961235)
ic| phrase: Phrase(text='every other sentence',
chunks=[every other sentence],
count=1,
rank=0.07915141312258998)
ic| phrase: Phrase(text='original TextRank',
chunks=[original TextRank],
count=1,
rank=0.07013026654397199)
ic| phrase: Phrase(text='every sentence',
chunks=[every sentence],
count=1,
rank=0.06654363130280233)
ic| phrase: Phrase(text='the sentences',
chunks=[the sentences],
count=1,
rank=0.06654363130280233)
GraphViz Export¶
Let's generate a GraphViz doc lemma_graph.dot
to visualize the lemma graph that PyTextRank produced for the most recent document...
tr = doc._.textrank
tr.write_dot(path="lemma_graph.dot")
!ls -lth lemma_graph.dot
-rw-rw-r-- 1 ankush ankush 18K Aug 9 15:06 lemma_graph.dot
!pip install graphviz
Requirement already satisfied: graphviz in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (0.20.1)
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
To render this graph, you must first download GraphViz
https://www.graphviz.org/download/
Then you can render a DOT
file...
import graphviz as gv
gv.Source.from_file("lemma_graph.dot")
Note that the image which gets rendered in a notebook is probably "squished", but other tools can renders these as interesting graphs.
Altair visualisation¶
Let's generate an interactive altair
plot to look at the lemma graph.
!pip install "altair"
Requirement already satisfied: altair in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (5.0.1)
Requirement already satisfied: numpy in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from altair) (1.25.2)
Requirement already satisfied: jsonschema>=3.0 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from altair) (4.19.0)
Requirement already satisfied: toolz in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from altair) (0.12.0)
Requirement already satisfied: typing-extensions>=4.0.1 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from altair) (4.7.1)
Requirement already satisfied: pandas>=0.18 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from altair) (2.0.3)
Requirement already satisfied: jinja2 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from altair) (3.1.2)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from jsonschema>=3.0->altair) (2023.7.1)
Requirement already satisfied: referencing>=0.28.4 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from jsonschema>=3.0->altair) (0.30.2)
Requirement already satisfied: attrs>=22.2.0 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from jsonschema>=3.0->altair) (23.1.0)
Requirement already satisfied: rpds-py>=0.7.1 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from jsonschema>=3.0->altair) (0.9.2)
Requirement already satisfied: pytz>=2020.1 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from pandas>=0.18->altair) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from pandas>=0.18->altair) (2023.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from pandas>=0.18->altair) (2.8.2)
Requirement already satisfied: MarkupSafe>=2.0 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from jinja2->altair) (2.1.3)
Requirement already satisfied: six>=1.5 in /home/ankush/workplace/os_repos/pytextrank/venv/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas>=0.18->altair) (1.16.0)
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
tr = doc._.textrank
tr.plot_keyphrases()
Extractive Summarization¶
Again, working with the most recent document above, we'll summarize based on its top 15
phrases, yielding its top 5
sentences...
for sent in tr.summary(limit_phrases=15, limit_sentences=5):
ic(sent)
ic| sent: First, a quick description of some popular algorithms & implementations for text summarization that exist today: the summarization module in gensim implements TextRank, an unsupervised algorithm based on weighted-graphs from a paper by Mihalcea et al.
ic| sent: It is an improvement from a paper by Barrios et al.
ic| sent: It is built on top of the popular PageRank algorithm that Google used for ranking webpages.
ic| sent: Create a graph where vertices are sentences.
ic| sent: In original TextRank the weights of an edge between two sentences is the percentage of words appearing in both of them.
Using TopicRank¶
The TopicRank enhanced algorithm is simple to use in the spaCy
pipeline and it supports the other features described above:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("topicrank");
Let's load an example text:
text = pathlib.Path("../dat/cfc.txt").read_text()
text
" Chelsea 'opted against' signing Salomon Rondón on deadline day.\n\nChelsea reportedly opted against signing Salomón Rondón on deadline day despite their long search for a new centre forward. With Olivier Giroud expected to leave, the Blues targeted Edinson Cavani, Dries Mertens and Moussa Dembele – only to end up with none of them. According to Telegraph Sport, Dalian Yifang offered Rondón to Chelsea only for them to prefer keeping Giroud at the club. Manchester United were also linked with the Venezuela international before agreeing a deal for Shanghai Shenhua striker Odion Ighalo. Manager Frank Lampard made no secret of his transfer window frustration, hinting that to secure top four football he ‘needed’ signings. Their draw against Leicester on Saturday means they have won just four of the last 13 Premier League matches."
doc = nlp(text)
for phrase in doc._.phrases[:10]:
ic(phrase)
ic| phrase: Phrase(text='Salomon Rondón',
chunks=[Salomon Rondón, Salomón Rondón, Rondón],
count=3,
rank=0.07866221348202057)
ic| phrase: Phrase(text='Chelsea',
chunks=[Chelsea, Chelsea, Chelsea],
count=3,
rank=0.06832817272016853)
ic| phrase: Phrase(text='Olivier Giroud',
chunks=[Olivier Giroud, Giroud],
count=2,
rank=0.05574966582168716)
ic| phrase: Phrase(text='deadline day',
chunks=[deadline day, deadline day],
count=2,
rank=0.05008120527495589)
ic| phrase: Phrase(text='Leicester', chunks=[Leicester], count=1, rank=0.039067778208486274)
ic| phrase: Phrase(text='club', chunks=[club], count=1, rank=0.037625206033098234)
ic| phrase: Phrase(text='Edinson Cavani',
chunks=[Edinson Cavani],
count=1,
rank=0.03759951959121995)
ic| phrase: Phrase(text='draw', chunks=[draw], count=1, rank=0.037353607917351345)
ic| phrase: Phrase(text='Manchester United',
chunks=[Manchester United],
count=1,
rank=0.035757812045215435)
ic| phrase: Phrase(text='Dalian Yifang',
chunks=[Dalian Yifang],
count=1,
rank=0.03570018233618092)
Using Biased TextRank¶
The Biased TextRank enhanced algorithm is simple to use in the spaCy
pipeline and it supports the other features described above:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("biasedtextrank");
doc = nlp(text)
focus = "Leicester"
doc._.textrank.change_focus(focus,bias=10.0, default_bias=0.0)
for phrase in doc._.phrases[:10]:
ic(phrase)
ic| phrase: Phrase(text='Leicester',
chunks=[Leicester, Leicester],
count=2,
rank=0.26184834028994514)
ic| phrase: Phrase(text='Saturday',
chunks=[Saturday, Saturday],
count=2,
rank=0.13938186779355857)
ic| phrase: Phrase(text='the last 13 Premier League matches',
chunks=[the last 13 Premier League matches],
count=1,
rank=0.12502820319236171)
ic| phrase: Phrase(text='none', chunks=[none], count=1, rank=1.9498221604845646e-07)
ic| phrase: Phrase(text='Moussa Dembele',
chunks=[Moussa Dembele, Moussa Dembele],
count=2,
rank=8.640024414329197e-08)
ic| phrase: Phrase(text='Dries Mertens',
chunks=[Dries Mertens, Dries Mertens],
count=2,
rank=5.152284728493906e-08)
ic| phrase: Phrase(text='Edinson Cavani',
chunks=[Edinson Cavani],
count=1,
rank=3.076049036231119e-08)
ic| phrase: Phrase(text='a new centre',
chunks=[a new centre],
count=1,
rank=2.7737546970070932e-08)
ic| phrase: Phrase(text='the Blues targeted Edinson Cavani',
chunks=[the Blues targeted Edinson Cavani],
count=1,
rank=1.9405864014707633e-08)
ic| phrase: Phrase(text='deadline day',
chunks=[deadline day, deadline day],
count=2,
rank=1.3752326412669907e-08)
The top-ranked phrases from Biased TextRank are closely related to the "focus" item: Leicester
Using PositionRank¶
The PositionRank enhanced algorithm is simple to use in the spaCy
pipeline and it supports the other features described above:
Using PositionRank¶
The PositionRank enhanced algorithm is simple to use in the spaCy
pipeline and it supports the other features described above:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("positionrank");
doc = nlp(text)
for phrase in doc._.phrases[:10]:
ic(phrase)
ic| phrase: Phrase(text='deadline day',
chunks=[deadline day, deadline day],
count=2,
rank=0.1671249044190727)
ic| phrase: Phrase(text='Salomon Rondón',
chunks=[Salomon Rondón, Salomon Rondón],
count=2,
rank=0.14836718147498046)
ic| phrase: Phrase(text='Salomón Rondón',
chunks=[Salomón Rondón, Salomón Rondón],
count=2,
rank=0.14169986334846618)
ic| phrase: Phrase(text='Chelsea',
chunks=[Chelsea, Chelsea, Chelsea, Chelsea],
count=4,
rank=0.13419811872859874)
ic| phrase: Phrase(text='Rondón', chunks=[Rondón], count=1, rank=0.12722264594603172)
ic| phrase: Phrase(text='a new centre',
chunks=[a new centre],
count=1,
rank=0.09181159181129885)
ic| phrase: Phrase(text='Giroud', chunks=[Giroud, Giroud], count=2, rank=0.0783201596831592)
ic| phrase: Phrase(text='Olivier Giroud',
chunks=[Olivier Giroud, Olivier Giroud],
count=2,
rank=0.07805316118093475)
ic| phrase: Phrase(text='none', chunks=[none], count=1, rank=0.07503538984105931)
ic| phrase: Phrase(text='their long search',
chunks=[their long search],
count=1,
rank=0.07449683199895643)
The top-ranked phrases from PositionRank are closely related to the "lead" items: Chelsea
, deadline day
, Salomon Rondón
Baseline¶
Now let's re-run this pipeline with the baseline TextRank algorithm to compare results:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank")
doc = nlp(text)
for phrase in doc._.phrases[:10]:
ic(phrase)
ic| phrase: Phrase(text='Shanghai Shenhua striker Odion Ighalo',
chunks=[Shanghai Shenhua striker Odion Ighalo,
Shanghai Shenhua striker Odion Ighalo],
count=2,
rank=0.11863090071749424)
ic| phrase: Phrase(text='none', chunks=[none], count=1, rank=0.09802416183300769)
ic| phrase: Phrase(text='Moussa Dembele',
chunks=[Moussa Dembele, Moussa Dembele],
count=2,
rank=0.09341044332809736)
ic| phrase: Phrase(text='deadline day',
chunks=[deadline day, deadline day],
count=2,
rank=0.09046182507994752)
ic| phrase: Phrase(text='Dries Mertens',
chunks=[Dries Mertens, Dries Mertens],
count=2,
rank=0.08919649435994934)
ic| phrase: Phrase(text='Edinson Cavani',
chunks=[Edinson Cavani],
count=1,
rank=0.08418633972470349)
ic| phrase: Phrase(text='Salomon Rondón',
chunks=[Salomon Rondón, Salomon Rondón],
count=2,
rank=0.08228367707127111)
ic| phrase: Phrase(text='Salomón Rondón',
chunks=[Salomón Rondón, Salomón Rondón],
count=2,
rank=0.08228367707127111)
ic| phrase: Phrase(text='Rondón', chunks=[Rondón], count=1, rank=0.0750732870664833)
ic| phrase: Phrase(text='Dalian Yifang',
chunks=[Dalian Yifang, Dalian Yifang],
count=2,
rank=0.06681675615287698)
The baseline algorithm is picking up named entities, although not emphasizing the order in which these entities were introduced in the text.