Internally, PyTextRank constructs a lemma graph to represent links among the candidate phrases (e.g., unrecognized entities) and also references within supporting language within the text.
The results from components in earlier stages of the
produce two important kinds of annotations for each token in a parsed
Note that when you have these two annotation plus the disambiguated word sense (i.e., the meaning of a word based on its context and usage) then you can map from a token to a concept.
The gist of the TextRank algorithm is to apply a sliding window across the tokens within a parsed sentence, constructing a graph from the lemmatized tokens where neighbor within the window get linked. Each lemma is unique within the lemma graph, such that repeated instances collect more links.
A centrality measure gets calculated for each node in the graph, then the nouns can be ranked in descending order.
An additional pass through the graph uses both noun chunks and named entities to help agglomerate adjacent nouns into ranked phrases.
Leveraging Semantic Relations¶
Generally speaking, any means of enriching the lemma graph prior to phrase ranking will tend to improve results.
For example, WordNet and DBpedia both provide means for inferring links among entities, and purpose-built knowledge graphs can be applied for specific use cases. These can help enrich a lemma graph even in cases where links are not explicit within the text.
Consider a paragraph that mentions
kittens in different
sentences: an implied semantic relation exists between the two nouns
since the lemma
kitten is a hyponym of the lemma
cat -- such that
an inferred link can be added between them.
One of the motivations for PyTextRank is to provide support (eventually) for entity linking, in contrast to the more commonplace usage of named entity recognition. These approaches can be used together in complementary ways to improve the results overall.
This has an additional benefit of linking parsed and annotated documents into more structured data, and can also be used to support knowledge graph construction.
Note that much better approaches exist for summarizing text. For instance, see https://primer.ai for a commercial example using state of the art abstractive summarization based on a combination of deep learning and knowledge graph approaches.
Even so, there are engineering and policy trade-offs1 to consider. Arguably, lower-cost alternatives such as PyTextRank allow for a wider range of trade-offs to suit your use cases.
Let us know if you find this package useful, tell us about use cases, describe what else you would like to see integrated, etc.
We're focused on our community and pay special attention to the business use cases. We're also eager to hear your feedback and suggestions for this open source project.