PyTextRank¶

PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, for graph-based natural language work -- and related knowledge graph practices. This includes the family of textgraph algorithms:

TextRank by [mihalcea04textrank]
PositionRank by [florescuc17]
Biased TextRank by [kazemi-etal-2020-biased]
TopicRank by [bougouin-etal-2013-topicrank]

Popular use cases for this library include:

phrase extraction: get the top-ranked phrases from a text document
low-cost extractive summarization of a text document
help infer concepts from unstructured text into more structured representation

The entity linking aspects here are a work-in-progress, based on kglab.

Cut to the Chase¶

To get started right away, jump to Getting Started
For a hands-on coding tour through pytextrank, see the Tutorial notebooks
Check the source code at https://github.com/DerwenAI/pytextrank

Motivations¶

Some modifications in PyTextRank attempt to improve on the base algorithm as originally described in [mihalcea04textrank]:

fixed a bug: see Java impl, 2008
use lemmatization in place of out-dated stemming
integration with spaCy as a pipeline component factory
simple extractive summarization based on vector distance from ranked phrases
leverage preprocessing via noun chunking and named entity recognition
optionally, include verbs in the graph (although not in the resulting ranked phrases)

The use of graph algorithms within natural language work -- notably, through eigenvector centrality -- helps provide a more flexible and robust basis for integrating additional AI techniques. There have been many amazing innovations since late 2017 in the application of deep learning for language models. Most certainly these kinds of DL models get leveraged by PyTextRank, within spaCy 3.x during the earlier stages of processing. However, using transformers and related DL models throughout all of the NLP pipeline stages -- while popular -- also tends to imply certain trade-offs:

emphasis on predictive power for recognizing sequences
models which require substantial resources to train, deploy, etc.
relatively opaque models
large carbon footprint
disjoint from leveraging domain expertise

Our experience with textgraphs is this category of algorithms provides computationally efficient methods that do not require substantial training in advance, which can import and leverage domain expertise.

Moreover, this approach can be integrated downstream in knowledge graph use cases through embedding methods (deep learning) for complementary, hybrid AI solutions.

Community Resources¶

Links for other open source community resources:

Other good ways to help troubleshoot issues:

search related discussions on StackOverflow
tweet to #textrank on Twitter (cc @pacoid)

The Knowledge Graph Conference hosts several community resources where you can post questions and get help about pytextrank and related KG topics.

community Slack – specifically on the #ask channel
Graph Data Science group on LinkedIn – join to receive related updates, news, conference coupons, etc.

For related course materials and training, please check for calendar updates in the article "Natural Language Processing in Python".

Last update: 2022-03-23