Skip to content

Package Reference: pytextrank

BaseTextRankFactory class

A factory class that provides the document with its instance of BaseTextRank


__call__ method

[source]

__call__(doc)

Set the extension attributes on a spaCy Doc document to create a pipeline component for BaseTextRank as a stateful component, invoked when the document gets processed.

See: https://spacy.io/usage/processing-pipelines#pipelines

  • doc : spacy.tokens.doc.Doc
    a document container, providing the annotations produced by earlier stages of the spaCy pipeline

__init__ method

[source]

__init__(edge_weight=1.0, pos_kept=None, token_lookback=3, scrubber=None, stopwords=None)

Constructor for a factory used to instantiate the PyTextRank pipeline components.

  • edge_weight : float
    default weight for an edge

  • pos_kept : typing.List[str]
    parts of speech tags to be kept; adjust this if strings representing the POS tags change

  • token_lookback : int
    the window for neighboring tokens – similar to a skip gram

  • scrubber : typing.Union[typing.Callable, NoneType]
    optional "scrubber" function to clean up punctuation from a token; if None then defaults to pytextrank.default_scrubber; when running, PyTextRank will throw a FutureWarning warning if the configuration uses a deprecated approach for a scrubber function

  • stopwords : typing.Union[str, pathlib.Path, typing.Dict[str, typing.List[str]], NoneType]
    optional dictionary of lemma: [pos] items to define the stop words, where each item has a key as a lemmatized token and a value as a list of POS tags; may be a file name (string) or a pathlib.Path for a JSON file; otherwise throws a TypeError exception

BaseTextRank class

Implements the TextRank algorithm defined by [mihalcea04textrank], deployed as a spaCy pipeline component.

This class does not get called directly; instantiate its factory instead.


__init__ method

[source]

__init__(doc, edge_weight, pos_kept, token_lookback, scrubber, stopwords)

Constructor for a TextRank object.

  • doc : spacy.tokens.doc.Doc
    a document container, providing the annotations produced by earlier stages of the spaCy pipeline

  • edge_weight : float
    default weight for an edge

  • pos_kept : typing.List[str]
    parts of speech tags to be kept; adjust this if strings representing the POS tags change

  • token_lookback : int
    the window for neighboring tokens – similar to a skip gram

  • scrubber : typing.Callable
    optional "scrubber" function to clean up punctuation from a token

  • stopwords : typing.Dict[str, typing.List[str]]
    optional dictionary of lemma: [pos] items to define the stop words, where each item has a key as a lemmatized token and a value as a list of POS tags


calc_sent_dist method

[source]

calc_sent_dist(limit_phrases)

For each sentence in the document, calculate its distance from a unit vector of top-ranked phrases.

  • limit_phrases : int
    maximum number of top-ranked phrases to use in the unit vector

  • returns : typing.List[pytextrank.base.Sentence]
    a list of sentence distance measures


calc_textrank method

[source]

calc_textrank()

Iterate through each sentence in the doc, constructing a lemma graph then returning the top-ranked phrases.

This method represents the heart of the TextRank algorithm.

  • returns : typing.List[pytextrank.base.Phrase]
    list of ranked phrases, in descending order

get_personalization method

[source]

get_personalization()

Get the node weights for initializing the use of the Personalized PageRank algorithm.

Defaults to a no-op for the base TextRank algorithm.

  • returns : typing.Union[typing.Dict[pytextrank.base.Lemma, float], NoneType]
    None

get_unit_vector method

[source]

get_unit_vector(limit_phrases)

Construct a unit vector representing the top-ranked phrases in a spaCy Doc document. This provides a characteristic for comparing each sentence to the entire document. Taking the ranked phrases in descending order, the unit vector is a normalized list of their calculated ranks, up to the specified limit.

  • limit_phrases : int
    maximum number of top-ranked phrases to use in the unit vector

  • returns : typing.List[pytextrank.base.VectorElem]
    the unit vector, as a list of VectorElem objects


plot_keyphrases method

[source]

plot_keyphrases()

Plot a document's keyphrases rank profile using altair.Chart

Throws an ImportError if the altair and pandas libraries are not installed.

  • returns : typing.Any
    the altair chart being rendered

reset method

[source]

reset()

Reinitialize the data structures needed for extracting phrases, removing any pre-existing state.


segment_paragraphs method

[source]

segment_paragraphs(sent_dist)

Segment a ranked document into paragraphs.

  • sent_dist : typing.List[pytextrank.base.Sentence]
    a list of ranked Sentence data objects

  • returns : typing.List[pytextrank.base.Paragraph]
    a list of Paragraph data objects


summary method

[source]

summary(limit_phrases=10, limit_sentences=4, preserve_order=False, level="sentence")

Run an extractive summarization, based on the vector distance (per sentence) for each of the top-ranked phrases.

  • limit_phrases : int
    maximum number of top-ranked phrases to use in the distance vectors

  • limit_sentences : int
    total number of sentences to yield for the extractive summarization

  • preserve_order : bool
    flag to preserve the order of sentences as they originally occurred in the source text; defaults to False

  • level : str
    default extractive summarization with "sentence" value; when set as "paragraph" get the average score per paragraph then sort the paragraphs to produce the summary

  • yields :
    texts for sentences, in order


write_dot method

[source]

write_dot(path="graph.dot")

Serialize the lemma graph in the Dot file format.

  • path : typing.Union[str, pathlib.Path, NoneType]
    path for the output file; defaults to "graph.dot"

PositionRankFactory class

A factory class that provides the document with its instance of PositionRank


__call__ method

[source]

__call__(doc)

Set the extension attributes on a spaCy Doc document to create a pipeline component for PositionRank as a stateful component, invoked when the document gets processed.

See: https://spacy.io/usage/processing-pipelines#pipelines

  • doc : spacy.tokens.doc.Doc
    a document container, providing the annotations produced by earlier stages of the spaCy pipeline

PositionRank class

Implements the PositionRank algorithm described by [florescuc17], deployed as a spaCy pipeline component.

This class does not get called directly; instantiate its factory instead.


get_personalization method

[source]

get_personalization()

Get the node weights for initializing the use of the Personalized PageRank algorithm.

From the cited reference:

Specifically, we propose to assign a higher probability to a word found on the 2nd position as compared with a word found on the 50th position in the same document. The weight of each candidate word is equal to its inverse position in the document. If the same word appears multiple times in the target document, then we sum all its position weights.

For example, a word v_i occurring in the following positions: 2nd, 5th and 10th, has a weight p(v_i) = 1/2 + 1/5 + 1/10 = 4/5 = 0.8 The weights of words are normalized before they are used in the position-biased PageRank.

  • returns : typing.Union[typing.Dict[pytextrank.base.Lemma, float], NoneType]
    Biased restart probabilities to use in the PageRank algorithm.

BiasedTextRankFactory class

A factory class that provides the document with its instance of BiasedTextRank


__call__ method

[source]

__call__(doc)

Set the extension attributes on a spaCy Doc document to create a pipeline component for BiasedTextRank as a stateful component, invoked when the document gets processed.

See: https://spacy.io/usage/processing-pipelines#pipelines

  • doc : spacy.tokens.doc.Doc
    a document container, providing the annotations produced by earlier stages of the spaCy pipeline

BiasedTextRank class

Implements the Biased TextRank algorithm described by [kazemi-etal-2020-biased], deployed as a spaCy pipeline component.

This class does not get called directly; instantiate its factory instead.


change_focus method

[source]

change_focus(focus=None, bias=1.0, default_bias=1.0)

Re-runs the Biased TextRank algorithm with the given focus. This approach allows an application to "change focus" without re-running the entire pipeline.

  • focus : str
    optional text (string) with space-delimited tokens to use for the focus set; defaults to None

  • bias : float
    optional bias for node weight values on tokens found within the focus set; defaults to 1.0

  • default_bias : float
    optional bias for node weight values on tokens not found within the focus set; set to 0.0 to enhance the focus, especially in the case of long documents; defaults to 1.0

  • returns : typing.List[pytextrank.base.Phrase]
    list of ranked phrases, in descending order


get_personalization method

[source]

get_personalization()

Get the node weights for initializing the use of the Personalized PageRank algorithm.

  • returns : typing.Union[typing.Dict[pytextrank.base.Lemma, float], NoneType]
    biased restart probabilities to use in the PageRank algorithm.

Lemma class

A data class representing one node in the lemma graph.


__delattr__ method

[source]

__delattr__(name)

__eq__ method

[source]

__eq__(other)

__ge__ method

[source]

__ge__(other)

__gt__ method

[source]

__gt__(other)

__hash__ method

[source]

__hash__()

__init__ method

[source]

__init__(lemma, pos)

__le__ method

[source]

__le__(other)

__lt__ method

[source]

__lt__(other)

__repr__ method

[source]

__repr__()

__setattr__ method

[source]

__setattr__(name, value)

label method

[source]

label()

Generates a more simplified string representation than repr() provides.

  • returns : str
    string representation

Paragraph class

A data class representing the distance measure for one paragraph.


__eq__ method

[source]

__eq__(other)

__init__ method

[source]

__init__(start, end, para_id, distance)

__repr__ method

[source]

__repr__()

Phrase class

A data class representing one ranked phrase.


__eq__ method

[source]

__eq__(other)

__init__ method

[source]

__init__(text, chunks, count, rank)

__repr__ method

[source]

__repr__()

Sentence class

A data class representing the distance measure for one sentence.


__eq__ method

[source]

__eq__(other)

__init__ method

[source]

__init__(start, end, sent_id, phrases, distance)

__repr__ method

[source]

__repr__()

empty method

[source]

empty()

Test whether this sentence includes any ranked phrases.

  • returns : bool
    True if the phrases is not empty.

text method

[source]

text(doc)

Accessor for the text slice of the spaCy Doc document represented by this sentence.

  • doc : spacy.tokens.doc.Doc
    source document

  • returns : str
    the sentence text

VectorElem class

A data class representing one element in the unit vector of the document.


__eq__ method

[source]

__eq__(other)

__init__ method

[source]

__init__(phrase, phrase_id, coord)

__repr__ method

[source]

__repr__()

module functions


default_scrubber method

[source]

default_scrubber(span)

Removes spurious punctuation from the given text. Note: this is intended for documents in English.

  • span : spacy.tokens.span.Span
    input text Span

  • returns : str
    scrubbed text


filter_quotes method

[source]

filter_quotes(text, is_email=True)

Filter the quoted text out of an email message. This handles quoting methods for popular email systems.

  • text : str
    raw text data

  • is_email : bool
    flag for whether the text comes from an email message;

  • returns : typing.List[str]
    the filtered text representing as a list of lines


groupby_apply method

[source]

groupby_apply(data, keyfunc, applyfunc)

GroupBy using a key function and an apply function, without a pandas dependency. See: https://docs.python.org/3/library/itertools.html#itertools.groupby

  • data : typing.Iterable[typing.Any]
    iterable

  • keyfunc : typing.Callable
    callable to define the key by which you want to group

  • applyfunc : typing.Callable
    callable to apply to the group

  • returns : typing.List[typing.Tuple[typing.Any, typing.Any]]
    an iterable with the accumulated values


maniacal_scrubber method

[source]

maniacal_scrubber(span)

Applies multiple approaches for aggressively removing garbled Unicode and spurious punctuation from the given text.

OH: "It scrubs the garble from its stream... or it gets the debugger again!"

  • span : spacy.tokens.span.Span
    input text Span

  • returns : str
    scrubbed text


split_grafs method

[source]

split_grafs(lines)

Segments a raw text, given as a list of lines, into paragraphs.

  • lines : typing.List[str]
    the raw text document, split into a lists of lines

  • yields :
    text per paragraph


module types

StopWordsLike type

StopWordsLike = typing.Union[str, pathlib.Path, typing.Dict[str, typing.List[str]]]

Last update: 2021-07-24