NAME Lingua::JA::NormalizeText - text normalizer SYNOPSIS use Lingua::JA::NormalizeText; use utf8; my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu ); my $normalizer = Lingua::JA::NormalizeText->new(@options); print $normalizer->normalize('é³¥ãŒãŒ§ãŒ¦ã§ã‚ã‚Šã‚“ã™♥'); # -> é³¥ãŒãƒˆãƒ³ãƒ‰ãƒ«ã§ã™â™¥ sub dearinsu_to_desu { my $text = shift; $text =~ s/ã§ã‚ã‚Šã‚“ã™/ã§ã™/g; return $text; } # or use Lingua::JA::NormalizeText qw/nfkc decode_entities/; use utf8; my $text = 'é³¥ãŒãŒ§ãŒ¦ã§ã‚ã‚Šã‚“ã™♥'; print dearinsu_to_desu( decode_entities( nfkc($text) ) ); # -> é³¥ãŒãƒˆãƒ³ãƒ‰ãƒ«ã§ã™â™¥ sub dearinsu_to_desu { my $text = shift; $text =~ s/ã§ã‚ã‚Šã‚“ã™/ã§ã™/g; return $text; } DESCRIPTION Lingua::JA::NormalizeText normalizes text. METHODS new(@options) Creates a new Lingua::JA::NormalizeText instance. The following options are available. OPTION SAMPLE INPUT OUTPUT FOR SAMPLE INPUT --------------------- ------------------ ----------------------- lc DdD ddd uc DdD DDD nfkc ㌦ ドル (length: 2) nfkd ㌦ ドル (length: 3) nfc nfd decode_entities ♥ ♥ strip_html <em>ã‚</em> ã‚ alnum_z2h ABC123 ABC123 alnum_h2z ABC123 ABC123 space_z2h space_h2z katakana_z2h ãƒã‚¡ãƒã‚¡ ハァハァ katakana_h2z スーハースーハー スーãƒãƒ¼ã‚¹ãƒ¼ãƒãƒ¼ katakana2hiragana パンツ ã±ã‚“㤠hiragana2katakana ã±ã‚“㤠パンツ unify_3dots ã¯ã。。。 ã¯ã… wave2tilde 〜 ~ tilde2wave ~ 〜 wavetilde2long 〜, ~ ー wave2long 〜 ー tilde2long ~ ー fullminus2long − ー dashes2long — ー drawing_lines2long ─ ー unify_long_repeats ヴァーーー ヴァー nl2space (new line) (space) unify_long_spaces (space)(space) (space) remove_head_space (space)ã‚(space)ã‚ ã‚(space)ã‚ remove_tail_space ã‚ã‚(space)(space) ã‚ã‚ modernize_kana_usage ã‚ヰゑヱ ã„イãˆã‚¨ The order these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied finally.) External functions are also addable. (See dearinsu_to_desu function of SYNOPSIS section) normalize($text) normalizes $text. AUTHOR pawa <pawapawa@cpan.org> SEE ALSO LICENSE This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.