I spent a non-trivial amount of time unsuccessfully trying to solve it in Damus by extending the hashtag parsing code to pull in the icu4c library to leverage multilocale Unicode parsing. The API was too confusing so I stopped. Happy to have someone else try. It’s possible my approach isn’t the best.
Discussion
Hey nostr:npub1yaul8k059377u9lsu67de7y637w4jtgeuwcmh5n7788l6xnlnrgs3tvjmf
The following solution may possibly work. I am not an expert in iOS app development (in fact never done it). So, please take what I say with a pinch of salt :D
The call flow in damus goes like this from my preliminary analysis of the code on Github 👇
```
damus.c :: parse_hashtags ==>
cursor.h :: consume_until_boundary ==>
cursor.h :: is_boundary
```
I believe the fix is to change how the `is_boundary` function is implemented in damus.
When we look at the implementation of hashtags from the other clients, the regular expression based matching does not have any check for alphanumerics, but instead just a set of 'prohibited characters' (a black list)
Amethyst regex is this:
```java
"#([^\\s!@#\$%^&*()=+./,\\[{\\]};:'\"?><]+)(.*)"
```
Snort regex is this:
```js
/(#[^\s!@#$%^&*()=+.\/,\[{\]};:'"?><]+)/
```
Both of these regexes try to just write a blacklist of chars that can't be there in a hashtag, instead of whitelisting alphanumerics.
If, in c code, we can come up with a function that will check for blacklisted characters and return a boolean, we could use this to implement international hashtags in damus too.
Not sure if my theory is correct.
Maybe Apple’s Natural Language framework could help here. It can tokenize text in many languages and determine word boundaries.
I had a conversation with #[11] a while ago about this problem and he asserts that Swift parsing code is slow on large notes, which is why it's done in C. nostr:note1r30jgsyepxu7he7zcxr703s7tdpvgeksurl7taldp0r5n5q2v8tql5x7kq