Is this something that implementing exponential backoff for retries would help with on the Damus client? I don’t actually know how it works or if that’s the problem.

Reply to this note

Please Login to reply.

Discussion

It may not hurt, but I presume this is caused primarily by spammers not real users. The volume of their events is just staggering.

Once the relay gets behind, it becomes challenging to catch back up and often ends with a freeze and then a mega event dump. Seems like damus relay is resilient enough to catch back up where others can’t recover, but the short freezes are still happening.

Damus will now retry when it doesn’t get an OK result, so at least users won’t suffer too bad from this

Are you doing rate limiting ?

We can’t do IP rate limiting (on filter.nostr.wine) since we are reading from other relays directly and don’t have source IP.

We are doing some spam filtering using the Strfry stream plugin, but with the occasional event surges we can’t keep up since it’s synchronous and we are streaming from the big public relays simultaneously.

Just moved the plug-in/spam filtering part to a separate box and now streaming only the pre-filtered and de-duped events. Curious if that is enough for us to manage at least temporarily.

Ultimately I think we will just have to handle the spam filtering part asynchronously with our own design. No matter how fast the plug-in is I think it’ll eventually get behind at scale.

We’ll be improving this with a move to queues with distributed spam workers to improve stability and handle spikes, but hopefully this will hold while we work on getting that implemented.

What kind of throughput do you need to handle?

I’m at around ~60 spam events/second and ~30 non-spam (rough averages). I was handling 250/second for bursts/importing events easily. Basically I process 10X the unique events (after spam removal). Something like 8-10 million events/day (with spam and dupes).

I’ve got two validation workers (I tried six at one point, but didn’t need them), four persist workers, and basically a Redis steam setup. Validation worker idle time is 1ms average (it batches). And persistence averages around 2-4ms. Validation worker also calls spam ML model.

Streams helped a lot. Batching does too. Deferring work as well - I process tags, meta events, zaps, contact lists , db usage, pow, relay meta, all after the fact - but a relay doesn’t need that post processing.

I also don’t broadcast any events until they have been persisted and validated. Mostly so I can de-dupe too.

I actually stopped connecting to Damus (for aggregation) a little while back due to spam volume. Basically other relays like nos.lol have the same data but filtered :)

Happy to chat more. What I have has been pretty low touch for a while now. Has limitation around broadcasting older events if workers offline for a while or importing missed events - but can address or filter as desired.

Going to let #[4]​ answer since she will architect the data pipeline but your setup sounds somewhat similar to what we’ve discussed.

To your comment re: damus, that is what we are seeing as well. I saw 800+ events come in a second earlier when it got backed up for ~2 minutes.

Sounds like a great setup! The main bottleneck we have now is that the design for the plug-in architecture in strfry is synchronous - so it waits on our verdict for each message before moving onto the next, and it doesn’t support async at this point (just one instance per relay we are steaming from). It also happens before deduping, and we have no access to network information like IPs that could help us manage it at a network level since it’s coming from the relays (your setup sounds similar in that way). Right now the spam detection is fairly straightforward and just accessing metadata in redis, but the rest of the system is light and/or async and funneling it all through a single linear point just can’t handle bursts. We also don’t have much control with the current implementation of how data is handled as it starts to get backed up during bursts. We actually have RAM to spare and it’s our disk taking a beating - so would love to have a little bit more control over that, even if it’s distributing some of the I/O across different mounted disks. I don’t think we need too much more to handle current day (but the damus relay definitely is causing the linear bottleneck to be an issue), but I’m also concerned about scale if nostr breaks into new markets. We really want to maintain availability as a relay, especially as a premium one people are paying for, so want to be sure we can handle spikes gracefully! I also anticipate the spam detection to grow in complexity and want to make sure we can distribute the processing to prevent latency issues. The queues would take some of the strain off of strfry when things get busy and give us some ability to take advantage of autoscaling for efficient infra usage/preventing latency. Sounds like we’re doing similar things though - may be worth collaborating, esp if we can design components that have crossover as utilities.

🫡 🙏