The Article Title
This is the first paragraph of the summary.
This is the second paragraph of the summary.
First paragraph of this section.
Second paragraph of this section.
First paragraph of this section.
Second paragraph of this section.
First paragraph of this sub-section.
Second paragraph of this sub-section.
EXAMPLE
The following example computes and prints the median, mean, and standard
deviation of the fraction of words (ignoring repeats) in a summary that
also occur in the body of the text for all the articles in the corpora.
use Text::Corpus::Summaries::Wikipedia;
use Statistics::Descriptive;
use File::Slurp;
use Encode;
my $corpus = Text::Corpus::Summaries::Wikipedia->new;
my $statistics = Statistics::Descriptive::Full->new;
foreach my $textFilePair (@{$corpus->getListOfTextFiles})
{
my $summary = lc decode ('utf8', read_file ($textFilePair->{summary}, binmode => ':raw'));
my %summaryWords = map {($_, 1)} split (/\P{Letter}/, $summary);
my $totalUniqueSummaryWords = keys %summaryWords;
next unless $totalUniqueSummaryWords;
my $body = lc decode ('utf8', read_file ($textFilePair->{body}, binmode => ':raw'));
map {delete $summaryWords{$_}} split (/\P{Letter}/, $body);
my $totalUniqueSummaryWordsNotInBody = keys %summaryWords;
$statistics->add_data (1 - $totalUniqueSummaryWordsNotInBody / $totalUniqueSummaryWords);
}
print 'count: ', $statistics->count(), "\n";
print 'median: ', $statistics->median(), "\n";
print 'mean: ', $statistics->mean(), "\n";
print 'standard deviation: ', $statistics->standard_deviation(), "\n";
SCRIPTS
The script create_summary_corpus.pl makes a corpus for summarization
testing using this module.
INSTALLATION
Use CPAN to install the module and all its prerequisites:
perl -MCPAN -e shell
>install Text::Corpus::Summaries::Wikipedia
BUGS
This module creates corpora by parsing Wikipedia pages, the xpath
expressions used to extract links and text will become invalid as the
format of the various pages changes, causing some corpora not to be
created.
Please email bugs reports or feature requests to
"bug-text-corpus-summaries-wikipedia@rt.cpan.org", or through the web
interface at