PyTextRank¶
PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, for graph-based natural language work -- and related knowledge graph practices. This includes the family of textgraph algorithms:
- TextRank by [mihalcea04textrank]
- PositionRank by [florescuc17]
- Biased TextRank by [kazemi-etal-2020-biased]
- TopicRank by [bougouin-etal-2013-topicrank]
Popular use cases for this library include:
- phrase extraction: get the top-ranked phrases from a text document
- low-cost extractive summarization of a text document
- help infer concepts from unstructured text into more structured representation
The entity linking aspects here are a work-in-progress, based on
kglab
.
Cut to the Chase¶
- To get started right away, jump to Getting Started
- For a hands-on coding tour through pytextrank, see the Tutorial notebooks
- Check the source code at https://github.com/DerwenAI/pytextrank
Motivations¶
Some modifications in PyTextRank attempt to improve on the base algorithm as originally described in [mihalcea04textrank]:
- fixed a bug: see Java impl, 2008
- use lemmatization in place of out-dated stemming
- integration with
spaCy
as a pipeline component factory - simple extractive summarization based on vector distance from ranked phrases
- leverage preprocessing via noun chunking and named entity recognition
- optionally, include verbs in the graph (although not in the resulting ranked phrases)
The use of graph algorithms within natural language work --
notably, through
eigenvector centrality
-- helps provide a more flexible and robust basis for integrating
additional AI techniques.
There have been many amazing innovations since late 2017
in the application of deep learning for
language models.
Most certainly these kinds of DL models get leveraged by
PyTextRank, within spaCy
3.x during the earlier stages of
processing.
However, using transformers and related DL models throughout all of
the NLP pipeline stages -- while popular -- also tends to imply
certain trade-offs:
- emphasis on predictive power for recognizing sequences
- models which require substantial resources to train, deploy, etc.
- relatively opaque models
- large carbon footprint
- disjoint from leveraging domain expertise
Our experience with textgraphs is this category of algorithms provides computationally efficient methods that do not require substantial training in advance, which can import and leverage domain expertise.
Moreover, this approach can be integrated downstream in knowledge graph use cases through embedding methods (deep learning) for complementary, hybrid AI solutions.
Community Resources¶
Links for other open source community resources:
Other good ways to help troubleshoot issues:
- search related discussions on StackOverflow
- tweet to
#textrank
on Twitter (cc@pacoid
)
The Knowledge Graph Conference hosts several community resources where you can post questions and get help about pytextrank and related KG topics.
- community Slack – specifically on the
#ask
channel - Graph Data Science group on LinkedIn – join to receive related updates, news, conference coupons, etc.
For related course materials and training, please check for calendar updates in the article "Natural Language Processing in Python".