しあわせ

Regex for Japanese

Regex is not set up for non Roman based languages. Lack of built in classes and lack of spaces denoting word boundaries make for challenges.

However there is at least one implementation of regex that has features for Japanese. Oniguruma regular expression library by K. Kosako. It is used in Textmate.

github.com/kkos/oniguruma

  [一-龠]   All kanji code points (possibly!) 
  [ぁ-ヿ]   All kana codepoints

Regex to convert output of lists

from “Japanese for iOS” dictionary to 3 column tab delimited text.

Pass 1 #remove brackets

  Match:(([ぁ-ヿ、]+))\n(.+)\n\n
  Replace:  \t$1\t$2\n

regex101.com/r/1JGezs/4

Pass 2 #kana only entries

  Match: ([ぁ-ヿ、]+)\n(.+)\n\n
  Replace:  $1\t$1\t$2\n

regex101.com/r/K8bUFu/5

Pass 3 remove extra definitions

  Match: (;(.+)\n)
  Replace:

regex101.com/r/gyGVWJ/2


Regex to convert character set

Katakana to Hiragana
  Match: (?sx)([ァ-ヺ])(?=.*:\1=(.))
  Replace:** $2

regex101: build, test, and debug regex

This dictionary table needs to be pasted at the bottom of the file for this method to work.

Dictionary:ァ=ぁ:ア=あ:ィ=ぃ:イ=い:ゥ=ぅ:ウ=う:ェ=ぇ:エ=え:ォ=ぉ:オ=お:カ=か:ガ=が:キ=き:ギ=ぎ:ク=く:グ=ぐ:ケ=け:ゲ=げ:コ=こ:ゴ=ご:サ=さ:ザ=ざ:シ=し:ジ=じ:ス=す:ズ=ず:セ=せ:ゼ=ぜ:ソ=そ:ゾ=ぞ:タ=た:ダ=だ:チ=ち:ヂ=ぢ:ッ=っ:ツ=つ:ヅ=づ:テ=て:デ=で:ト=と:ド=ど:ナ=な:ニ=に:ヌ=ぬ:ネ=ね:ノ=の:ハ=は:バ=ば:パ=ぱ:ヒ=ひ:ビ=び:ピ=ぴ:フ=ふ:ブ=ぶ:プ=ぷ:ヘ=へ:ベ=べ:ペ=ぺ:ホ=ほ:ボ=ぼ:ポ=ぽ:マ=ま:ミ=み:ム=む:メ=め:モ=も:ャ=ゃ:ヤ=や:ュ=ゅ:ユ=ゆ:ョ=ょ:ヨ=よ:ラ=ら:リ=り:ル=る:レ=れ:ロ=ろ:ヮ=ゎ:ワ=わ:ヰ=ゐ:ヱ=ゑ:ヲ=を:ン=ん:ヴ=ゔ:ヵ=ヵ:ヶ=ヶ:ヷ=わ゙:ヸ=ゐ゙:ヹ=ゑ゙:ヺ=ヺ

Hiragana to Katakana
  Match: (?sx)([ぁ-ゔ])(?=.*:\1=(.))
  Replace: $2

regex101: build, test, and debug regex

This dictionary table needs to be pasted at the bottom of the file for this method to work.

Dictionary:ぁ=ァ:あ=ア:ぃ=ィ:い=イ:ぅ=ゥ:う=ウ:ぇ=ェ:え=エ:ぉ=ォ:お=オ:か=カ:が=ガ:き=キ:ぎ=ギ:く=ク:ぐ=グ:け=ケ:げ=ゲ:こ=コ:ご=ゴ:さ=サ:ざ=ザ:し=シ:じ=ジ:す=ス:ず=ズ:せ=セ:ぜ=ゼ:そ=ソ:ぞ=ゾ:た=タ:だ=ダ:ち=チ:ぢ=ヂ:っ=ッ:つ=ツ:づ=ヅ:て=テ:で=デ:と=ト:ど=ド:な=ナ:に=ニ:ぬ=ヌ:ね=ネ:の=ノ:は=ハ:ば=バ:ぱ=パ:ひ=ヒ:び=ビ:ぴ=ピ:ふ=フ:ぶ=ブ:ぷ=プ:へ=ヘ:べ=ベ:ぺ=ペ:ほ=ホ:ぼ=ボ:ぽ=ポ:ま=マ:み=ミ:む=ム:め=メ:も=モ:ゃ=ャ:や=ヤ:ゅ=ュ:ゆ=ユ:ょ=ョ:よ=ヨ:ら=ラ:り=リ:る=ル:れ=レ:ろ=ロ:ゎ=ヮ:わ=ワ:ゐ=ヰ:ゑ=ヱ:を=ヲ:ん=ン:ゔ=ヴ:ヵ=ヵ:ヶ=ヶ:わ゙=ヷ:ゐ゙=ヸ:ゑ゙=ヹ:ヺ=ヺ

I have since found
A blog post on Japanese unicode codepoints
and
Python unicode regex groups


comment at Mastodon