Nostr Web Client

Language detection is surprisingly difficult. The neural networks get basic things wrong. They will say that Korean text is actually Chinese, even though you can obviously see with your eyes that it's not.

After multiple libraries failed this basic test, I did something brave and implemented a naive regex solution in #Ditto. It does a first pass on the text before moving it on to the neural network.

안녕하세요

For example, if ALL the characters are in Korean script, it must be Korean. Even if it's a nonsensical sequence of Korean characters, it cannot be any language other than Korean due to the fact Korean makes exclusive use of this character set.

There are only a few languages where this is possible: Korean, Greek, and Hebrew.

Again, this is only possible if ALL characters in the text match a target language, so simply using "π" in a text does not make it Greek. So, currently this check is very narrow.

Notes about other languages:

Chinese: it's not possible to do a regex-only solution for Chinese, since Han script is also part of Japanese.

Japanese: we *can* definitively detect Japanese, as long as the text contains at least one Hirigana or Katakana character in addition to 0 or more Han characters. So at least *some* Japanese text can be unambiguously detected just by a regex.

Russian: Cyrillic text is used by a handful of languages besides Russian. BUT, if the text is entirely Cyrillic, that at least narrows down the *possible* languages it could be.

Next steps:

To optimize this, the regexes will narrow down possible languages of a text before passing it to the neural network.

For example, if a text is entirely Han, we would restrict the model to deciding only between Chinese and Japanese. If it's Cyrillic, we'd do the same thing, but with the 6 or so Cyrillic languages.

We could also try to match, say, 90% of the text instead of 100%, to any specific script, to catch outliers like occasional English words used in Japanese, etc. We are already stripping things like punctuation, emojis, and URLs before passing text to the model.

Finally, this is all so we can use a lightweight, embedded solution for language detection, instead of calling out to some proprietary API, or even a giant self-hosted solution. In that case, I believe a layered solution will always be needed. We have to do these naive checks to put "guardrails" on the model, so its guesses can't stray outside of common sense. Switching the model can improve it, but these naive checks will still be true.

Yohan Yukiya Sese Cuneta 사요한 🦩 10mo ago

Segway: Actually, Hangeul (the Korean script) is no longer exclusive to Hangugeo (the Korean language) for a few years now (hmm, I think already a decade). One such using it officially and legally is the Cia-Cia language.

The Latin script cannot fully capture the requirements of the Cia-Cia Language, and can even cause confusion even with the natives. But once Hangeul was adopted and taught in schools, it just worked perfectly. Hangeul can fully represent the Cia-Cia Language in written form.

There are other nations currently working on adopting Hangeul too, inspired by the success for the Cia-Cia Language. Even the dead or forgotten proposals to adopt Hangeul back in pre, during, and post WW2 era, are being revisited.

Sadly, mobile technology still hasn't caught up with the changes. For example, there are two obsolete Jamo the Cia-Cia Language is using which standard Hangeul is no longer using. Then again, even in Korea, there are obsolete Jamo they are still using themselves but can't be found in mobile keyboards. And there are yet to be translator services, neural or otherwise, supporting Cia-Cia in Hangeul script.

So, at least for now, it's still safe to assume that Hangeul (Korean script) means Hangugeo (Korean language). Maybe in 2030 or 2040, it'll be felt online. The Latin script is not ideal for many Asian languages but Hangeul is perfect. Even for the 200 languages and dialects here in the Philippines, we can better express our languages in Hangeul than in Latin script.

Reply to this note

Discussion