PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, for graph-based natural language work -- and related knowledge graph practices. This includes the family of textgraph algorithms:
- TextRank by [mihalcea04textrank]
- PositionRank by [florescuc17]
- Biased TextRank by [kazemi-etal-2020-biased]
- TopicRank by [bougouin-etal-2013-topicrank]
Popular use cases for this library include:
- phrase extraction: get the top-ranked phrases from a text document
- low-cost extractive summarization of a text document
- help infer concepts from unstructured text into more structured representation
The entity linking aspects here are a work-in-progress, based on
Cut to the Chase¶
- To get started right away, jump to Getting Started
- For a hands-on coding tour through pytextrank, see the Tutorial notebooks
- Check the source code at https://github.com/DerwenAI/pytextrank
Some modifications in PyTextRank attempt to improve on the base algorithm as originally described in [mihalcea04textrank]:
- fixed a bug: see Java impl, 2008
- use lemmatization in place of out-dated stemming
- integration with
spaCyas a pipeline component factory
- simple extractive summarization based on vector distance from ranked phrases
- leverage preprocessing via noun chunking and named entity recognition
- optionally, include verbs in the graph (although not in the resulting ranked phrases)
The use of graph algorithms within natural language work --
-- helps provide a more flexible and robust basis for integrating
additional AI techniques.
There have been many amazing innovations since late 2017
in the application of deep learning for
Most certainly these kinds of DL models get leveraged by
spaCy 3.x during the earlier stages of
However, using transformers and related DL models throughout all of
the NLP pipeline stages -- while popular -- also tends to imply
- emphasis on predictive power for recognizing sequences
- models which require substantial resources to train, deploy, etc.
- relatively opaque models
- large carbon footprint
- disjoint from leveraging domain expertise
Our experience with textgraphs is this category of algorithms provides computationally efficient methods that do not require substantial training in advance, which can import and leverage domain expertise.
Moreover, this approach can be integrated downstream in knowledge graph use cases through embedding methods (deep learning) for complementary, hybrid AI solutions.
Links for other open source community resources:
Other good ways to help troubleshoot issues:
- community Slack – specifically on the
- Graph-Based Data Science group on LinkedIn – join to receive related updates, news, conference coupons, etc.
For related course materials and training, please check for calendar updates in the article "Natural Language Processing in Python".