Nostr Web Client

We’ve been getting shelled by events in the last 24 hours. #[0] Damus relay having any issues? We are seeing occasional pauses and then insane bursts of events.

Just saw a 2 minute gap in events and then thousands within a second when it caught up.

Terry Yiu 2y ago

Is this something that implementing exponential backoff for retries would help with on the Damus client? I don’t actually know how it works or if that’s the problem.

Reply to this note

Please Login to reply.

Discussion

Mazin 2y ago

It may not hurt, but I presume this is caused primarily by spammers not real users. The volume of their events is just staggering.

Once the relay gets behind, it becomes challenging to catch back up and often ends with a freeze and then a mega event dump. Seems like damus relay is resilient enough to catch back up where others can’t recover, but the short freezes are still happening.

jb55 2y ago

Damus will now retry when it doesn’t get an OK result, so at least users won’t suffer too bad from this

jb55 2y ago

Are you doing rate limiting ?

Mazin 2y ago

We can’t do IP rate limiting (on filter.nostr.wine) since we are reading from other relays directly and don’t have source IP.

We are doing some spam filtering using the Strfry stream plugin, but with the occasional event surges we can’t keep up since it’s synchronous and we are streaming from the big public relays simultaneously.

Just moved the plug-in/spam filtering part to a separate box and now streaming only the pre-filtered and de-duped events. Curious if that is enough for us to manage at least temporarily.

Mazin 2y ago

Ultimately I think we will just have to handle the spam filtering part asynchronously with our own design. No matter how fast the plug-in is I think it’ll eventually get behind at scale.

Katie 2y ago

We’ll be improving this with a move to queues with distributed spam workers to improve stability and handle spikes, but hopefully this will hold while we work on getting that implemented.

Blake 2y ago

What kind of throughput do you need to handle?

I’m at around ~60 spam events/second and ~30 non-spam (rough averages). I was handling 250/second for bursts/importing events easily. Basically I process 10X the unique events (after spam removal). Something like 8-10 million events/day (with spam and dupes).

I’ve got two validation workers (I tried six at one point, but didn’t need them), four persist workers, and basically a Redis steam setup. Validation worker idle time is 1ms average (it batches). And persistence averages around 2-4ms. Validation worker also calls spam ML model.

Streams helped a lot. Batching does too. Deferring work as well - I process tags, meta events, zaps, contact lists , db usage, pow, relay meta, all after the fact - but a relay doesn’t need that post processing.

I also don’t broadcast any events until they have been persisted and validated. Mostly so I can de-dupe too.

I actually stopped connecting to Damus (for aggregation) a little while back due to spam volume. Basically other relays like nos.lol have the same data but filtered :)

Happy to chat more. What I have has been pretty low touch for a while now. Has limitation around broadcasting older events if workers offline for a while or importing missed events - but can address or filter as desired.

Mazin 2y ago

Going to let #[4] answer since she will architect the data pipeline but your setup sounds somewhat similar to what we’ve discussed.

To your comment re: damus, that is what we are seeing as well. I saw 800+ events come in a second earlier when it got backed up for ~2 minutes.

Katie 2y ago

Sounds like a great setup! The main bottleneck we have now is that the design for the plug-in architecture in strfry is synchronous - so it waits on our verdict for each message before moving onto the next, and it doesn’t support async at this point (just one instance per relay we are steaming from). It also happens before deduping, and we have no access to network information like IPs that could help us manage it at a network level since it’s coming from the relays (your setup sounds similar in that way). Right now the spam detection is fairly straightforward and just accessing metadata in redis, but the rest of the system is light and/or async and funneling it all through a single linear point just can’t handle bursts. We also don’t have much control with the current implementation of how data is handled as it starts to get backed up during bursts. We actually have RAM to spare and it’s our disk taking a beating - so would love to have a little bit more control over that, even if it’s distributing some of the I/O across different mounted disks. I don’t think we need too much more to handle current day (but the damus relay definitely is causing the linear bottleneck to be an issue), but I’m also concerned about scale if nostr breaks into new markets. We really want to maintain availability as a relay, especially as a premium one people are paying for, so want to be sure we can handle spikes gracefully! I also anticipate the spam detection to grow in complexity and want to make sure we can distribute the processing to prevent latency issues. The queues would take some of the strain off of strfry when things get busy and give us some ability to take advantage of autoscaling for efficient infra usage/preventing latency. Sounds like we’re doing similar things though - may be worth collaborating, esp if we can design components that have crossover as utilities.

captjack 🏴‍☠️✨💜 2y ago

🫡 🙏