NAME
Lingua::Stem::UniNE - University of Neuchâtel stemmers
VERSION
This document describes Lingua::Stem::UniNE v0.04.
SYNOPSIS
use Lingua::Stem::UniNE;
# create Bulgarian stemmer
$stemmer = Lingua::Stem::UniNE->new(language => 'bg');
# get stem for word
$stem = $stemmer->stem($word);
# get list of stems for list of words
@stems = $stemmer->stem(@words);
DESCRIPTION
This module contains a collection of stemmers for multiple languages
based on stemming algorithms provided by Jacques Savoy of the University
of Neuchâtel (UniNE). The languages currently implemented are Bulgarian,
Czech, and Persian. Work is ongoing for Arabic, Bengali, Finnish,
French, German, Hindi, Hungarian, Italian, Portuguese, Marathi, Russian,
Spanish, and Swedish. The top priority is languages for which there are
no stemmers available on CPAN.
Attributes
language
The following language codes are currently supported.
┌───────────┬────┐
│ Bulgarian │ bg │
│ Czech │ cs │
│ Persian │ fa │
└───────────┴────┘
They are in the two-letter ISO 639-1 format and are case-insensitive
but are always returned in lowercase when requested.
# instantiate a stemmer object
$stemmer = Lingua::Stem::UniNE->new(language => $language);
# get current language
$language = $stemmer->language;
# change language
$stemmer->language($language);
Country codes such as "cz" for the Czech Republic are not supported,
nor are IETF language tags such as "fa-AF" or "fa-IR".
Methods
stem
Accepts a list of words, stems each word, and returns a list of
stems. The list returned will always have the same number of
elements in the same order as the list provided. When no stemming
rules apply to a word, the original word is returned.
@stems = $stemmer->stem(@words);
# get the stem for a single word
$stem = $stemmer->stem($word);
The words should be provided as character strings and the stems are
returned as character strings. Byte strings in arbitrary character
encodings are intentionally not supported.
languages
Returns a list of supported two-letter language codes using
lowercase letters.
# object method
@languages = $stemmer->languages;
# class method
@languages = Lingua::Stem::UniNE->languages;
SEE ALSO
IR Multilingual Resources at UniNE
provides the original
stemming algorithms that were implemented in this module.
Lingua::Stem::Any provides a unified interface to any stemmer on CPAN,
including this module, as well as additional features like
normalization, casefolding, and in-place stemming.
Lingua::Stem::Snowball provides alternate stemming algorithms for
Finnish, French, German, Hungarian, Italian, Portuguese, Russian,
Spanish, and Swedish, as well as other languages.
ACKNOWLEDGEMENTS
Jacques Savoy and Ljiljana
Dolamic of the University of Neuchâtel authored the original stemming
algorithms that were implemented in this module.
This module is brought to you by Shutterstock
(@ShutterTech
). Additional open source projects from
Shutterstock can be found at code.shutterstock.com
.
AUTHOR
Nick Patch
COPYRIGHT AND LICENSE
© 2012–2013 Nick Patch
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.