Text::Categorize::Textrank::En - Find potential keywords in English text.
Text::Categorize::Textrank::En
use strict; use warnings; use Text::Categorize::Textrank::En; use Data::Dump qw(dump); my $textrankerEn = Text::Categorize::Textrank::En->new(); my $text = 'This is the first sentence. Here is the second sentence.'; my $results = $textrankerEn->getTextrankInfoOfText(listOfText => [$text]); dump $results->{hashOfTextrankValues};
Text::Categorize::Textrank::En provides methods for ranking the words in English text as potential keywords. It implements a version of the textrank algorithm from the report TextRank: Bringing Order into Texts by R. Mihalcea and P. Tarau.
Encoding of all text should be in Perl's internal format; see Text::Iconv or Encode for converting text from various encodings.
new
The method new creates an instance of the Text::Categorize::Textrank::En class with the following parameters:
endingSentenceTag
endingSentenceTag => 'PP'
endingSentenceTag is the part-of-speech tag that should be used to indicate the end of a sentence. The default is 'PP'. The value of this tag must be a tag generated by the module Lingua::EN::Tagger.
listOfPOSTypesToKeep
listOfPOSTypesToKeep => [qw(TEXTRANK_WORDS)]
The textrank algorithm preprocesses the text so that only certain parts-of-speech (POS) are retained and used to build the graph representing the text. The module Lingua::EN::Tagger is used to tag the parts-of-speech of the text. The parts-of-speech retained can be specified by word types, where the type is a combination of 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', or 'VERBS'. The default is [qw(TEXTRANK_WORDS)], which equates to [qw(ADJECTIVES NOUNS)].
[qw(TEXTRANK_WORDS)]
[qw(ADJECTIVES NOUNS)]
listOfPOSTagsToKeep
listOfPOSTagsToKeep => [...]
listOfPOSTagsToKeep provides finer control over the parts-of-speech to be retained when filtering the tagged text. For a list of all the possible tags call getListOfPartOfSpeechTags().
getListOfPartOfSpeechTags()
getTextrankInfoOfText
getTextrankInfoOfText (...)
The method getTextrankInfoOfText returns a data structure (hash-reference) containing all the stemmed words partitioned into their sentences (listOfStemmedTaggedSentences), the subset of words used to compute the textranks (listOfFilteredSentences), and the textrank of the tokens (hashOfTextrankValues) that occur in listOfFilteredSentences. The sum of all the textrank values is one.
listOfFilteredSentences
More precisely, if $results is the returned hash, then $results->{listOfStemmedTaggedSentences} contains the array reference generated by the getStemmedAndTaggedText method of Text::StemTagPOS, $results->{listOfFilteredSentences} contains the array reference generated by getTaggedTextToKeep of Text::StemTagPOS, and $results->{hashOfTextrankValues} holds the hash of the textrank values computed by getTextrankOfListOfTokens. $results->{useStemmedWords} is also set to the value of useStemmedWords.
$results
$results->{listOfStemmedTaggedSentences}
$results->{listOfFilteredSentences}
$results->{hashOfTextrankValues}
$results->{useStemmedWords}
useStemmedWords
listOfStemmedTaggedSentences
listOfStemmedTaggedSentences => [...]
listOfStemmedTaggedSentences is the array reference containing the list of stemmed and part-of-speech tagged sentences from Text::StemTagPos. If listOfStemmedTaggedSentences is not defined, then the text to be processed should be provided via listOfText.
listOfText
listOfText => [...]
listOfText is an array reference containing the strings of text to be categorized. listOfText is only used if listOfStemmedTaggedSentences is undefined.
edgeCreationSpan
edgeCreationSpan => 1
For each word in the text, edgeCreationSpan is the number of successive words used to make an edge in the textrank token graph. For example, if tokenEdgeSpanSize is two, then given the word sequence "apple orange pear" the edges [apple, orange] and [apple, pear] will be added to the text graph for the word apple. The default is one.
tokenEdgeSpanSize
"apple orange pear"
[apple, orange]
[apple, pear]
apple
Note that loop edges are ignored. For example, if edgeCreationSpan is two, then given the word sequence "daba daba doo" the edge [daba, daba] is disguarded but the edge [daba, doo] is added to the token graph.
"daba daba doo"
[daba, daba]
[daba, doo]
directedGraph
directedGraph => 0
If directedGraph is true, the textranks are computed from the directed token graph, if false, they are computed from the undirected version of the graph. The default is false.
pageRankDampeningFactor
pageRankDampeningFactor => 0.85
When computing the textranks of the token graph, the dampening factor specified by pageRankDampeningFactor will be used; it should range from zero to one. The default is 0.85.
addEdgesSpanningSentences
addEdgesSpanningLists => 1
If addEdgesSpanningLists is true, then when building the token graph, links between the tokens at the end of a list and the beginning of the next list will be made. For example, for the lists [[qw(This is the first list)], [qw(Here is the second list)]] the edge [list, Here] will be added to the token graph. The default is true.
addEdgesSpanningLists
[[qw(This is the first list)], [qw(Here is the second list)]]
[list, Here]
useStemmedWords => 1
If useStemmedWords is true, then when building the token graph, the stemmed words are used as the id of each node, otherwise the original words are used; in both cases the stemmed or original words are converted to lowercase. The default is true.
To install the module run the following commands:
perl Makefile.PL make make test make install
If you are on a windows box you should use 'nmake' rather than 'make'.
Please email bugs reports or feature requests to bug-text-categorize-textrank-en@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Categorize-Textrank-En. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.
bug-text-categorize-textrank-en@rt.cpan.org
Jeff Kubina<jeff.kubina@gmail.com>
Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
categorize, english, keywords, keyprhases, nlp, pagerank, textrank
Lingua::EN::Tagger, Lingua::Stem::Snowball, Log::Log4perl, Text::Categorize::Textrank, Text::StemTagPOS
To install Text::Categorize::Textrank, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Categorize::Textrank
CPAN shell
perl -MCPAN -e shell install Text::Categorize::Textrank
For more information on module installation, please visit the detailed CPAN module installation guide.