NAME
Lingua::JA::NormalizeText - text normalizer
SYNOPSIS
use Lingua::JA::NormalizeText;
use utf8;
my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu );
my $normalizer = Lingua::JA::NormalizeText->new(@options);
print $normalizer->normalize('鳥が㌧㌦でありんす♥');
# -> 鳥がトンドルです♥
sub dearinsu_to_desu
{
my $text = shift;
$text =~ s/でありんす/です/g;
return $text;
}
# or
use Lingua::JA::NormalizeText qw/nfkc decode_entities/;
use utf8;
my $text = '鳥が㌧㌦でありんす♥';
print dearinsu_to_desu( decode_entities( nfkc($text) ) );
# -> 鳥がトンドルです♥
sub dearinsu_to_desu
{
my $text = shift;
$text =~ s/でありんす/です/g;
return $text;
}
DESCRIPTION
Lingua::JA::NormalizeText normalizes text.
METHODS
new(@options)
Creates a new Lingua::JA::NormalizeText instance.
The following options are available.
OPTION SAMPLE INPUT OUTPUT FOR SAMPLE INPUT
--------------------- ------------------ -----------------------
lc DdD ddd
uc DdD DDD
nfkc ㌦ ドル (length: 2)
nfkd ㌦ ドル (length: 3)
nfc
nfd
decode_entities ♥ ♥
strip_html あ あ
alnum_z2h ABC123 ABC123
alnum_h2z ABC123 ABC123
space_z2h
space_h2z
katakana_z2h ハァハァ ハァハァ
katakana_h2z スーハースーハー スーハースーハー
katakana2hiragana パンツ ぱんつ
hiragana2katakana ぱんつ パンツ
unify_3dots はぁ。。。 はぁ…
wave2tilde 〜 ~
tilde2wave ~ 〜
wavetilde2long 〜, ~ ー
wave2long 〜 ー
tilde2long ~ ー
fullminus2long − ー
dashes2long — ー
drawing_lines2long ─ ー
unify_long_repeats ヴァーーー ヴァー
nl2space (new line) (space)
unify_long_spaces (space)(space) (space)
remove_head_space (space)あ(space)あ あ(space)あ
remove_tail_space ああ(space)(space) ああ
modernize_kana_usage ゐヰゑヱ いイえエ
The order these options are applied is according to the order of the
elements of @options. (i.e., The first element is applied first, and the
last element is applied finally.)
External functions are also addable. (See dearinsu_to_desu function of
SYNOPSIS section)
normalize($text)
normalizes $text.
AUTHOR
pawa
SEE ALSO
LICENSE
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.