Skip to content

Note

To run this notebook in JupyterLab, load examples/sample.ipynb

Getting Started

First, we'll import the required libraries and add the PyTextRank component into the spaCy pipeline:

import pytextrank
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank")
<pytextrank.base.BaseTextRankFactory at 0x118732710>

Let's take a look at this pipeline now...

nlp.pipe_names
['tok2vec',
 'tagger',
 'parser',
 'ner',
 'attribute_ruler',
 'lemmatizer',
 'textrank']

We can examine the spaCy pipeline in much greater detail...

nlp.analyze_pipes(pretty=True)

============================= Pipeline Overview =============================

#   Component         Assigns               Requires   Scores             Retokenizes
-   ---------------   -------------------   --------   ----------------   -----------
0   tok2vec           doc.tensor                                          False

1   tagger            token.tag                        tag_acc            False

2   parser            token.dep                        dep_uas            False      
                      token.head                       dep_las                       
                      token.is_sent_start              dep_las_per_type              
                      doc.sents                        sents_p                       
                                                       sents_r                       
                                                       sents_f

3   ner               doc.ents                         ents_f             False      
                      token.ent_iob                    ents_p                        
                      token.ent_type                   ents_r                        
                                                       ents_per_type

4   attribute_ruler                                                       False

5   lemmatizer        token.lemma                      lemma_acc          False

6   textrank                                                              False

 No problems found.





{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'textrank': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'ner': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'textrank': []},
 'attrs': {'token.is_sent_start': {'assigns': ['parser'], 'requires': []},
  'token.ent_type': {'assigns': ['ner'], 'requires': []},
  'doc.sents': {'assigns': ['parser'], 'requires': []},
  'doc.ents': {'assigns': ['ner'], 'requires': []},
  'doc.tensor': {'assigns': ['tok2vec'], 'requires': []},
  'token.lemma': {'assigns': ['lemmatizer'], 'requires': []},
  'token.tag': {'assigns': ['tagger'], 'requires': []},
  'token.head': {'assigns': ['parser'], 'requires': []},
  'token.ent_iob': {'assigns': ['ner'], 'requires': []},
  'token.dep': {'assigns': ['parser'], 'requires': []}}}

Next, let's load some text from a document:

from icecream import ic
import pathlib

text = pathlib.Path("../dat/mih.txt").read_text()
text
'Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n'

Then run the spaCy pipeline...

doc = nlp(text)
len(doc)
92

Now we can access the PyTextRank component within the spaCy pipeline, and use it to get more information for post-processing of the document. For example, let's see what the elapsed time in milliseconds was for the TextRank processing:

tr = doc._.textrank
ic(tr.elapsed_time);
ic| tr.elapsed_time: 22.101879119873047

Let's examine the top-ranked phrases in the document

for phrase in doc._.phrases:
    ic(phrase.rank, phrase.count, phrase.text)
    ic(phrase.chunks)
ic| phrase.rank: 0.18359439311764025
    phrase.count: 1
    phrase.text: 'mixed types'
ic| phrase.chunks: [mixed types]
ic| phrase.rank: 0.17847961931078207
    phrase.count: 3
    phrase.text: 'systems'
ic| phrase.chunks: [systems, systems, systems]
ic| phrase.rank: 0.15037838042245094
    phrase.count: 1
    phrase.text: 'minimal generating sets'
ic| phrase.chunks: [minimal generating sets]
ic| phrase.rank: 0.14740065982407316
    phrase.count: 1
    phrase.text: 'nonstrict inequations'
ic| phrase.chunks: [nonstrict inequations]
ic| phrase.rank: 0.13946027725597837
    phrase.count: 1
    phrase.text: 'strict inequations'
ic| phrase.chunks: [strict inequations]
ic| phrase.rank: 0.1195023546245721
    phrase.count: 1
    phrase.text: 'linear Diophantine equations'
ic| phrase.chunks: [linear Diophantine equations]
ic| phrase.rank: 0.11450088293222845
    phrase.count: 1
    phrase.text: 'natural numbers'
ic| phrase.chunks: [natural numbers]
ic| phrase.rank: 0.1078071817368632
    phrase.count: 3
    phrase.text: 'solutions'
ic| phrase.chunks: [solutions, solutions, solutions]
ic| phrase.rank: 0.10529828014583348
    phrase.count: 1
    phrase.text: 'linear constraints'
ic| phrase.chunks: [linear constraints]
ic| phrase.rank: 0.10369605907081418
    phrase.count: 1
    phrase.text: 'all the considered types systems'
ic| phrase.chunks: [all the considered types systems]
ic| phrase.rank: 0.08812713074893187
    phrase.count: 1
    phrase.text: 'a minimal supporting set'
ic| phrase.chunks: [a minimal supporting set]
ic| phrase.rank: 0.08243620500315357
    phrase.count: 1
    phrase.text: 'a system'
ic| phrase.chunks: [a system]
ic| phrase.rank: 0.07944607954086784
    phrase.count: 1
    phrase.text: 'a minimal set'
ic| phrase.chunks: [a minimal set]
ic| phrase.rank: 0.0763527926213032
    phrase.count: 1
    phrase.text: 'algorithms'
ic| phrase.chunks: [algorithms]
ic| phrase.rank: 0.07593126037016427
    phrase.count: 1
    phrase.text: 'all types'
ic| phrase.chunks: [all types]
ic| phrase.rank: 0.07309361902551356
    phrase.count: 1
    phrase.text: 'Diophantine'
ic| phrase.chunks: [Diophantine]
ic| phrase.rank: 0.0702090100898443
    phrase.count: 1
    phrase.text: 'construction'
ic| phrase.chunks: [construction]
ic| phrase.rank: 0.060225391238828516
    phrase.count: 1
    phrase.text: 'Upper bounds'
ic| phrase.chunks: [Upper bounds]
ic| phrase.rank: 0.05800111772673988
    phrase.count: 1
    phrase.text: 'the set'
ic| phrase.chunks: [the set]
ic| phrase.rank: 0.05425139476531647
    phrase.count: 1
    phrase.text: 'components'
ic| phrase.chunks: [components]
ic| phrase.rank: 0.04516904342912139
    phrase.count: 1
    phrase.text: 'Compatibility'
ic| phrase.chunks: [Compatibility]
ic| phrase.rank: 0.04516904342912139
    phrase.count: 1
    phrase.text: 'compatibility'
ic| phrase.chunks: [compatibility]
ic| phrase.rank: 0.04435648606848154
    phrase.count: 1
    phrase.text: 'the corresponding algorithms'
ic| phrase.chunks: [the corresponding algorithms]
ic| phrase.rank: 0.042273783712246285
    phrase.count: 1
    phrase.text: 'Criteria'
ic| phrase.chunks: [Criteria]
ic| phrase.rank: 0.01952542432474353
    phrase.count: 1
    phrase.text: 'These criteria'
ic| phrase.chunks: [These criteria]

Stop Words

To show use of the stop words feature, first we'll output a baseline...

text = pathlib.Path("../dat/gen.txt").read_text()
doc = nlp(text)

for phrase in doc._.phrases[:10]:
    ic(phrase)
ic| phrase: Phrase(text='words', chunks=[words, words], count=2, rank=0.16291261251069125)
ic| phrase: Phrase(text='sentences', chunks=[sentences], count=1, rank=0.1306067022269487)
ic| phrase: Phrase(text='Mihalcea et al', chunks=[Mihalcea et al], count=1, rank=0.10483404051843853)
ic| phrase: Phrase(text='the remaining words', chunks=[the remaining words], count=1, rank=0.09665514724405054)
ic| phrase: Phrase(text='text summarization', chunks=[text summarization], count=1, rank=0.0890015856562531)
ic| phrase: Phrase(text='gensim implements', chunks=[gensim implements], count=1, rank=0.07756577483242053)
ic| phrase: Phrase(text='Okapi BM25 function', chunks=[Okapi BM25 function], count=1, rank=0.07659857245638771)
ic| phrase: Phrase(text='every other sentence', chunks=[every other sentence], count=1, rank=0.07121330443014147)
ic| phrase: Phrase(text='Okapi BM25', chunks=[Okapi BM25], count=1, rank=0.06637129156118694)
ic| phrase: Phrase(text='webpages', chunks=[webpages], count=1, rank=0.06543620998485791)

Notice how the top-ranked phrase above is words ? Let's add that phrase to our stop words list, to exclude it from the ranked phrases...

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] } })

doc = nlp(text)

for phrase in doc._.phrases[:10]:
    ic(phrase)
ic| phrase: Phrase(text='sentences', chunks=[sentences], count=1, rank=0.14589095939167795)
ic| phrase: Phrase(text='Mihalcea et al', chunks=[Mihalcea et al], count=1, rank=0.11251869467335668)
ic| phrase: Phrase(text='text summarization', chunks=[text summarization], count=1, rank=0.09552418217461901)
ic| phrase: Phrase(text='gensim implements', chunks=[gensim implements], count=1, rank=0.08324627350503458)
ic| phrase: Phrase(text='Okapi BM25 function', chunks=[Okapi BM25 function], count=1, rank=0.08221256597903477)
ic| phrase: Phrase(text='every other sentence', chunks=[every other sentence], count=1, rank=0.08004789962776512)
ic| phrase: Phrase(text='Okapi BM25', chunks=[Okapi BM25], count=1, rank=0.07123533671820845)
ic| phrase: Phrase(text='webpages', chunks=[webpages], count=1, rank=0.0702329645089719)
ic| phrase: Phrase(text='TextRank', chunks=[TextRank, TextRank, TextRank, TextRank, TextRank], count=5, rank=0.06770669955429025)
ic| phrase: Phrase(text='every sentence', chunks=[every sentence], count=1, rank=0.06738414774169442)

For each entry, you'll need to add a key that is the lemma and a value that's a list of its part-of-speech tags.

GraphViz Export

Let's generate a GraphViz doc lemma_graph.dot to visualize the lemma graph that PyTextRank produced for the most recent document...

tr = doc._.textrank
tr.write_dot(path="lemma_graph.dot")
!ls -lth lemma_graph.dot
-rw-r--r--  1 paco  staff    17K May  4 16:41 lemma_graph.dot
!pip install graphviz
Requirement already satisfied: graphviz in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (0.16)

To render this graph, you must first download GraphViz https://www.graphviz.org/download/

Then you can render a DOT file...

import graphviz as gv

gv.Source.from_file("lemma_graph.dot")

svg

Note that the image which gets rendered in a notebook is probably "squished", but other tools can renders these as interesting graphs.

Altair visualisation

Let's generate an interactive altair plot to look at the lemma graph.

!pip install "altair"
Requirement already satisfied: altair in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (4.1.0)
Requirement already satisfied: pandas>=0.18 in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from altair) (1.2.4)
Requirement already satisfied: toolz in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from altair) (0.11.1)
Requirement already satisfied: entrypoints in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from altair) (0.3)
Requirement already satisfied: jinja2 in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from altair) (2.11.3)
Requirement already satisfied: jsonschema in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from altair) (3.2.0)
Requirement already satisfied: numpy in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from altair) (1.20.1)
Requirement already satisfied: pytz>=2017.3 in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from pandas>=0.18->altair) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from pandas>=0.18->altair) (2.8.1)
Requirement already satisfied: six>=1.5 in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas>=0.18->altair) (1.15.0)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from jinja2->altair) (1.1.1)
Requirement already satisfied: attrs>=17.4.0 in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from jsonschema->altair) (20.3.0)
Requirement already satisfied: setuptools in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from jsonschema->altair) (47.1.0)
Requirement already satisfied: importlib-metadata in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from jsonschema->altair) (3.7.3)
Requirement already satisfied: pyrsistent>=0.14.0 in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from jsonschema->altair) (0.17.3)
Requirement already satisfied: zipp>=0.5 in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from importlib-metadata->jsonschema->altair) (3.4.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /Users/paco/src/pytextrank/venv/lib/python3.7/site-packages (from importlib-metadata->jsonschema->altair) (3.7.4.3)
tr = doc._.textrank
tr.plot_keyphrases()

Extractive Summarization

Again, working with the most recent document above, we'll summarize based on its top 15 phrases, yielding its top 5 sentences...

for sent in tr.summary(limit_phrases=15, limit_sentences=5):
    ic(sent)
ic| sent: First, a quick description of some popular algorithms & implementations for text summarization that exist today: the summarization module in gensim implements TextRank, an unsupervised algorithm based on weighted-graphs from a paper by Mihalcea et al.
ic| sent: Create a graph where vertices are sentences.
ic| sent: Gensim’s TextRank uses Okapi BM25 function to see how similar the sentences are.
ic| sent: It is built on top of the popular PageRank algorithm that Google used for ranking webpages.
ic| sent: Connect every sentence to every other sentence by an edge.

Using PositionRank

The PositionRank enhanced algorithm is simple to use in the spaCy pipeline and it supports all of the other features described above:

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("positionrank")
<pytextrank.positionrank.PositionRankFactory at 0x11a399f10>

Let's load an example text:

text = pathlib.Path("../dat/cfc.txt").read_text()
text
" Chelsea 'opted against' signing Salomon Rondón on deadline day.\n\nChelsea reportedly opted against signing Salomón Rondón on deadline day despite their long search for a new centre forward. With Olivier Giroud expected to leave, the Blues targeted Edinson Cavani, Dries Mertens and Moussa Dembele – only to end up with none of them. According to Telegraph Sport, Dalian Yifang offered Rondón to Chelsea only for them to prefer keeping Giroud at the club. Manchester United were also linked with the Venezuela international before agreeing a deal for Shanghai Shenhua striker Odion Ighalo. Manager Frank Lampard made no secret of his transfer window frustration, hinting that to secure top four football he ‘needed’ signings. Their draw against Leicester on Saturday means they have won just four of the last 13 Premier League matches."
doc = nlp(text)

for phrase in doc._.phrases[:10]:
    ic(phrase)
ic| phrase: Phrase(text='deadline day', chunks=[deadline day, deadline day], count=2, rank=0.16712490441907274)
ic| phrase: Phrase(text='Salomon Rondón', chunks=[Salomon Rondón, Salomon Rondón], count=2, rank=0.14836718147498051)
ic| phrase: Phrase(text='Salomón Rondón', chunks=[Salomón Rondón, Salomón Rondón], count=2, rank=0.14169986334846624)
ic| phrase: Phrase(text='Chelsea', chunks=[Chelsea, Chelsea, Chelsea, Chelsea, Chelsea, Chelsea], count=6, rank=0.13419811872859877)
ic| phrase: Phrase(text='Rondón', chunks=[Rondón], count=1, rank=0.12722264594603178)
ic| phrase: Phrase(text='a new centre', chunks=[a new centre], count=1, rank=0.09181159181129886)
ic| phrase: Phrase(text='Giroud', chunks=[Giroud, Giroud], count=2, rank=0.07832015968315921)
ic| phrase: Phrase(text='Olivier Giroud', chunks=[Olivier Giroud, Olivier Giroud], count=2, rank=0.07805316118093476)
ic| phrase: Phrase(text='none', chunks=[none], count=1, rank=0.07503538984105934)
ic| phrase: Phrase(text='their long search', chunks=[their long search], count=1, rank=0.07449683199895644)

The top-ranked phrases from PositionRank are closely related to the "lead" items: Chelsea, deadline day, Salomon Rondón

Now let's re-run this pipeline with TextRank to compare results:

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank")
doc = nlp(text)

for phrase in doc._.phrases[:10]:
    ic(phrase)
ic| phrase: Phrase(text='Shanghai Shenhua striker Odion Ighalo', chunks=[Shanghai Shenhua striker Odion Ighalo], count=1, rank=0.11863090071749424)
ic| phrase: Phrase(text='none', chunks=[none], count=1, rank=0.09802416183300769)
ic| phrase: Phrase(text='Moussa Dembele', chunks=[Moussa Dembele, Moussa Dembele], count=2, rank=0.09341044332809736)
ic| phrase: Phrase(text='deadline day', chunks=[deadline day, deadline day], count=2, rank=0.09046182507994752)
ic| phrase: Phrase(text='Dries Mertens', chunks=[Dries Mertens, Dries Mertens], count=2, rank=0.08919649435994934)
ic| phrase: Phrase(text='Edinson Cavani', chunks=[Edinson Cavani, Edinson Cavani], count=2, rank=0.08418633972470349)
ic| phrase: Phrase(text='Shanghai Shenhua', chunks=[Shanghai Shenhua], count=1, rank=0.08254442709505863)
ic| phrase: Phrase(text='Salomon Rondón', chunks=[Salomon Rondón, Salomon Rondón], count=2, rank=0.08228367707127111)
ic| phrase: Phrase(text='Salomón Rondón', chunks=[Salomón Rondón, Salomón Rondón], count=2, rank=0.08228367707127111)
ic| phrase: Phrase(text='Premier League', chunks=[Premier League], count=1, rank=0.08198820712767879)

The baseline algorithm is picking up named entities, although not emphasizing the order in which these entities were introduced in the text.


Last update: 2021-07-24