NAME
Lingua::JA::NormalizeText - Text Normalizer
SYNOPSIS
use Lingua::JA::NormalizeText;
use utf8;
my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu );
my $normalizer = Lingua::JA::NormalizeText->new(@options);
print $normalizer->normalize('鳥が㌧㌦でありんす♥');
# -> 鳥がトンドルです♥
sub dearinsu_to_desu
{
my $text = shift;
$text =~ s/でありんす/です/g;
return $text;
}
# or
use Lingua::JA::NormalizeText qw/old2new_kanji/;
use utf8;
print old2new_kanji('惡の華');
# -> 悪の華
DESCRIPTION
Lingua::JA::NormalizeText normalizes text.
METHODS
new(@options)
Creates a new Lingua::JA::NormalizeText instance.
The following options are available:
OPTION SAMPLE INPUT OUTPUT FOR SAMPLE INPUT
--------------------- --------------------- -----------------------
lc DdD ddd
uc DdD DDD
nfkc ㌦ ドル (length: 2)
nfkd ㌦ ドル (length: 3)
nfc
nfd
decode_entities ♥ ♥
strip_html あ あ
alnum_z2h ABC123 ABC123
alnum_h2z ABC123 ABC123
space_z2h
space_h2z
katakana_z2h ハァハァ ハァハァ
katakana_h2z スーハースーハー スーハースーハー
katakana2hiragana パンツ ぱんつ
hiragana2katakana ぱんつ パンツ
wave2tilde 〜 ~
tilde2wave ~ 〜
wavetilde2long 〜, ~ ー
wave2long 〜 ー
tilde2long ~ ー
fullminus2long − ー
dashes2long — ー
drawing_lines2long ─ ー
unify_long_repeats ヴァーーー ヴァー
nl2space (LF)(CR)(CRLF} (space)(space)(space)
unify_nl (LF)(CR)(CRLF) \n\n\n
unify_long_spaces あ(space)(space)あ あ(space)あ
unify_whitespaces \x{00A0} (space)
trim (space)あ(space)あ(space) あ(space)あ
ltrim (space)あ(space) あ(space)
rtrim ああ(space)(space) ああ
old2new_kana ゐヰゑヱヸヹ いイえエイ゙エ゙
old2new_kanji 亞逸鬭 亜逸闘
tab2space (tab)(tab) (space)(space)
remove_controls あ\x{0000}あ ああ
remove_spaces (space)あ(space)あ(space) ああ
dakuon_normalize さ\x{3099} ざ
handakuon_normalize は\x{309A} ぱ
all_dakuon_normalize さ\x{3099}は\x{309A} ざぱ
The order in which these options are applied is according to the order
of the elements of @options. (i.e., The first element is applied first,
and the last element is applied last.)
External functions are also addable. (See dearinsu_to_desu function of
the SYNOPSIS section.)
normalize($text)
normalizes $text.
OPTIONS
dashes2long
Note that this option does not convert hyphens into long.
unify_long_spaces
Note that this option unifies only SPACE(U+0020) and IDEOGRAPHIC
SPACE(U+3000).
remove_controls
Note that this option does not remove the following characters:
CHARACTER TABULATION
LINE FEED
CARRIAGE RETURN
remove_spaces
Note that this option removes only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000).
unify_whitespaces
This option converts the following characters into SPACE(U+0020).
LINE TABULATION
FORM FEED
NEXT LINE
NO-BREAK SPACE
OGHAM SPACE MARK
MONGOLIAN VOWEL SEPARATOR
EN QUAD
EM QUAD
EN SPACE
EM SPACE
THREE-PER-EM SPACE
FOUR-PER-EM SPACE
SIX-PER-EM SPACE
FIGURE SPACE
PUNCTUATION SPACE
THIN SPACE
HAIR SPACE
LINE SEPARATOR
PARAGRAPH SEPARATOR
NARROW NO-BREAK SPACE
MEDIUM MATHEMATICAL SPACE
Note that this does not convert the following characters:
CHARACTER TABULATION
LINE FEED
CARRIAGE RETURN
IDEOGRAPHIC SPACE
AUTHOR
pawa
SEE ALSO
新旧字体表:
Lingua::JA::Regular::Unicode
Lingua::JA::Dakuon
Lingua::JA::Moji
Unicode::Normalize
HTML::Entities
HTML::Scrubber
LICENSE
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.