Nostr Web Client

Agreed on the "broadcasting identifications are probably illegal". If only there was a provably safe way to do it where the exact location of the content (i.e., URL) , wasn't communicated but you gave clients a 99% certainty of being able to block it, still. basically, somehow you give clients the power to test any url to decide if it was bad, but you don't provide the actual scores in such a way as the scores could be directly used in a search or other content discovery exercise... Maybe this is just impossible, because, what's to stop someone else from running a script to test every image and reproduce the score, and then try to use or access the bad scores?

hodlbod 11mo ago

Right, that's basically how microsoft's product works. They don't release the database because you could then craft images that foil the fuzzy hashing algorithm they use.

Reply to this note

Please Login to reply.

Discussion

Rizful.com 11mo ago

Right. But FYI, you don't need microsoft's service, you can roll your own with open source models that will return a confidence score between 0 and 1. And a lot of those models are totally open source -- https://huggingface.co/docs/transformers/en/tasks/image_classification They are just classification models which return a value between 0 and 1. And they're pretty fast & efficient since Google and other have been fighting this issue for 20+ years and have developed very good and efficient models. (Which work 99.5% of the time. I think it's impossible to get to 100%).

nostr.build 11mo ago

We use one of these models.

Rizful.com 11mo ago

Right. That's what I thought. And it works for you, right? You've been able to measure the benefits in terms of fewer complaints or something?

nostr.build 11mo ago

Yes, easier and quicker to identify, less we have to do. We still use PhotoDNA for their hashes, but is the best solution so far.

hodlbod 11mo ago

It's a totally different approach, but maybe you're right, good LLMs are relatively new and maybe could be considered to supersede fuzzy hashing. But the main problem of reverse-engineering the compression algorithm (which is one way to think about llms) still exists. If you're thinking of working on this I'm happy to see what I can do to help.

nostr.build 11mo ago

We tried CloudFlare’s integrated service and Microsoft’s PhotoDNA, they are ok, but only compare to existing hashes and only supported images, not videos.. AI models scan it all, searches existing hashes and recognizes unreported patterns.

Rizful.com 11mo ago

Here's an example: https://huggingface.co/Falconsai/nsfw_image_detection ... putting one of these (or actually multiple, and averaging the results...) behind an API endpoint is not too difficult, and I'd be happy to do it for any service which has a **way to measure the effectiveness** ... since I will not be reviewing any images manually (!) , and YOU will not be reviewing any images manually (1) and I will be deleting all data a few milliseconds after it hits the model and returns a score, you must have SOME way of deciding if the service is useful. Like, user complaints, or blocks, or something like that.... ideally if you run a big enough service where you can measure "complaints/blocks per day" and see that the "number goes down" when you start using the scores that I provide.

As discussed in this thread, making these scores public is potentially dangerous, but providing a service that simply scores images, especially if that service is only offered to a small number of entities who can be trusted to use it only to help them delete something .... is something Microsoft has been doing for decades, I can't see any particular risk in it.

But I only want to build this if someone can say "yes, I'll be able to measure the effectiveness somehow"... because doing this without measurement of any kind of useless, right?

hodlbod 11mo ago

That's a good approach, but could be tricky on nostr. Maybe you could scrape NIP 56 reports? I know those still get published by some clients (which is awful, but I can't convince the devs to stop).

fiatjaf 11mo ago

It's not awful.

hodlbod 11mo ago

It's illegal

fiatjaf 11mo ago

Awful is very different from illegal.

hodlbod 11mo ago

Ok, but legality has bearing on awfulness

Rizful.com 11mo ago

You don't have to worry about definitions. These models are very smart and are happy to provide you with a float between zero and one. And then you just set a threshold on the what scores you will tolerate. No need to engage further with the question.

hodlbod 11mo ago

Semantics aside, I'd use that if I ran a relay. I'm not sure it makes sense to bake it into a client? Especially since it would be sending the data to a third party server

Rizful.com 11mo ago

Right. Actually baking it into a client is a thing that would be the most dangerous too.

fiatjaf 11mo ago

Yes, more than often what is awful is legal and what is good is illegal.