Regex for Japanese

Regex is not set up for non Roman based languages. Lack of built in classes and lack of spaces denoting word boundaries make for challenges.

However there is at least one implementation of regex that has features for Japanese. Oniguruma regular expression library by K. Kosako. It is used in Textmate.

github.com/kkos/oniguruma

  [一-龠]   All kanji code points (possibly!) 
  [ぁ-ヿ]   All kana codepoints

Regex to convert output of lists

from “Japanese for iOS” dictionary to 3 column tab delimited text.

Pass 1 #remove brackets

  Match:（([ぁ-ヿ、]+)）\n(.+)\n\n
  Replace:  \t$1\t$2\n

regex101.com/r/1JGezs/4

Pass 2 #kana only entries

  Match: ([ぁ-ヿ、]+)\n(.+)\n\n
  Replace:  $1\t$1\t$2\n

regex101.com/r/K8bUFu/5

Pass 3 remove extra definitions

  Match: (;(.+)\n)
  Replace:

regex101.com/r/gyGVWJ/2

Regex to convert character set

Using Conditional Technique by
www.rexegg.com/regex-trick-conditional-replacement.html

Katakana to Hiragana

  Match: (?sx)([ァ-ヺ])(?=.*:\1=(.))
  Replace:** $2

regex101: build, test, and debug regex

This dictionary table needs to be pasted at the bottom of the file for this method to work.

Dictionary:ァ=ぁ:ア=あ:ィ=ぃ:イ=い:ゥ=ぅ:ウ=う:ェ=ぇ:エ=え:ォ=ぉ:オ=お:カ=か:ガ=が:キ=き:ギ=ぎ:ク=く:グ=ぐ:ケ=け:ゲ=げ:コ=こ:ゴ=ご:サ=さ:ザ=ざ:シ=し:ジ=じ:ス=す:ズ=ず:セ=せ:ゼ=ぜ:ソ=そ:ゾ=ぞ:タ=た:ダ=だ:チ=ち:ヂ=ぢ:ッ=っ:ツ=つ:ヅ=づ:テ=て:デ=で:ト=と:ド=ど:ナ=な:ニ=に:ヌ=ぬ:ネ=ね:ノ=の:ハ=は:バ=ば:パ=ぱ:ヒ=ひ:ビ=び:ピ=ぴ:フ=ふ:ブ=ぶ:プ=ぷ:ヘ=へ:ベ=べ:ペ=ぺ:ホ=ほ:ボ=ぼ:ポ=ぽ:マ=ま:ミ=み:ム=む:メ=め:モ=も:ャ=ゃ:ヤ=や:ュ=ゅ:ユ=ゆ:ョ=ょ:ヨ=よ:ラ=ら:リ=り:ル=る:レ=れ:ロ=ろ:ヮ=ゎ:ワ=わ:ヰ=ゐ:ヱ=ゑ:ヲ=を:ン=ん:ヴ=ゔ:ヵ=ヵ:ヶ=ヶ:ヷ=わ゙:ヸ=ゐ゙:ヹ=ゑ゙:ヺ=ヺ

Hiragana to Katakana

  Match: (?sx)([ぁ-ゔ])(?=.*:\1=(.))
  Replace: $2

regex101: build, test, and debug regex

This dictionary table needs to be pasted at the bottom of the file for this method to work.

Dictionary:ぁ=ァ:あ=ア:ぃ=ィ:い=イ:ぅ=ゥ:う=ウ:ぇ=ェ:え=エ:ぉ=ォ:お=オ:か=カ:が=ガ:き=キ:ぎ=ギ:く=ク:ぐ=グ:け=ケ:げ=ゲ:こ=コ:ご=ゴ:さ=サ:ざ=ザ:し=シ:じ=ジ:す=ス:ず=ズ:せ=セ:ぜ=ゼ:そ=ソ:ぞ=ゾ:た=タ:だ=ダ:ち=チ:ぢ=ヂ:っ=ッ:つ=ツ:づ=ヅ:て=テ:で=デ:と=ト:ど=ド:な=ナ:に=ニ:ぬ=ヌ:ね=ネ:の=ノ:は=ハ:ば=バ:ぱ=パ:ひ=ヒ:び=ビ:ぴ=ピ:ふ=フ:ぶ=ブ:ぷ=プ:へ=ヘ:べ=ベ:ぺ=ペ:ほ=ホ:ぼ=ボ:ぽ=ポ:ま=マ:み=ミ:む=ム:め=メ:も=モ:ゃ=ャ:や=ヤ:ゅ=ュ:ゆ=ユ:ょ=ョ:よ=ヨ:ら=ラ:り=リ:る=ル:れ=レ:ろ=ロ:ゎ=ヮ:わ=ワ:ゐ=ヰ:ゑ=ヱ:を=ヲ:ん=ン:ゔ=ヴ:ヵ=ヵ:ヶ=ヶ:わ゙=ヷ:ゐ゙=ヸ:ゑ゙=ヹ:ヺ=ヺ

I have since found
A blog post on Japanese unicode codepoints
and
Python unicode regex groups

comment at Mastodon

software, python, 日本語

しあわせ

Navigation

Tags

Page Tools

Site Tools

User Tools

Table of Contents

Regex for Japanese

Regex to convert output of lists

Regex to convert character set

Katakana to Hiragana

Hiragana to Katakana