1. yes, 2. yes, 3. yes, 4. yes.
Discussion
Thanks again. So, if it’s all yes, then in practice Gossip is also trying to connect to broken relays, right?
What sort of timeout settings and retry logic does Gossip currently use? nostr:nprofile1qqsyvrp9u6p0mfur9dfdru3d853tx9mdjuhkphxuxgfwmryja7zsvhqpz9mhxue69uhkummnw3ezuamfdejj7qgswaehxw309ahx7um5wghx6mmd9uq3wamnwvaz7tmkd96x7u3wdehhxarjxyhxxmmd9ukfdvuv also mentioned that broken relays aren’t as much of a problem in Amethyst as they are in Haven. I get the feeling that each of us may be handling this differently.
Haven tests relays during startup. It also uses a Penalty Box: if a relay fails to connect, it retries after 30 seconds, and then exponential backoff kicks in. Out of curiosity I put a counter on it, and with my current implementation this meant ~110k failed connection attempts in the first 24 hours after restarting the relay (though as exponential backoff kicks in the number of connections reduce substantially). Of course that I could make the "broken relay detection" converge faster by tweaking exponential backoff, but this also means that the next time that one of the popular arrays go offline Haven will quickly give up on them.
This is where a good blacklist could be useful. You can do this without fully trusting the list as well. I.e., if I can use Nosotros list above, still try to connect to all relays during initialisation, but if the connection fails and the relay is in the "black list", then I set a much higher time until reconnect for the exponential backoff algorithm (e.g. 1 hour or even 24 hours). So if the list is wrong or one of the dead relays gets "ressurected" Haven will eventually connect to it, but it makes the whole process much cheaper.
For most connections I use a 30 second timeout. For advertising relay lists I use a 5 second timeout.
Gossip client tracks jobs, unfinished jobs with benign errors will try again after a timeout, but there is a long list of error conditions which change the behavior.
For a few random examples, if the relay returned status 4000, we exclude connections to that relay for 600 seconds. If the connection timed out, we exclude it for 60 seconds. If we got a NOT_FOUND, 600 seconds. If we got a websocket error, 15 seconds. Some of them are probably poorly chosen numbers and could use review.
We then randomize the exclusion value a bit so that retries to multiple relays don't all line up together. For jobs that are not marked "persistent" we don't bother trying again. From my reading of the code right now it looks like if a relay is excluded and a new job wants to talk to it, an error is thrown rather than queueing up that job to happen after the exclusion, which might not be ideal.
Relays marked rank=0 are never connected to, so users can disable relays that don't work, but this involves the user who usually isn't going to notice these details.
Also, I never assume a relay is broken.... just that it isn't working right now. So it never stops trying. But exclusions of 600 seconds are long enough that this doesn't matter much.
Sometime in the past I coded a way for users to delete (knowledge of) relays that they knew were "dead". The code didn't work and after some debugging I realized that it actually did work, but the relay got recreated because data in my local database referred to it, and any reference to a relay URL makes sure a relay record exists for it and starts collecting statistics on it. So it just gets deleted and recreated by the other logic. Even if I deleted the events in the local database, new events on nostr would recreate it. I would have to mark it with a tombstone saying to never recreate it... which seems wrong because the DNS name might one day be used for a relay again.
I also pick relays judiciously. When I need to read from someone's set of outboxes (or inboxes), I choose the relays that are most likely to be working.
The console log shows all kinds of relay errors and complaints. I just ignore most of it.
Thanks, Mike. That was all very helpful.
The more I read and think about it, the less I believe there’s any obvious solution to distributed systems problems, especially on Nostr, where infrastructure is mostly provided on a best-effort basis.
IMV, what you’re describing is the typical outcome of a “simple protocol” meeting the realities of distributed systems. These are the usual complexities that enterprise-grade distributed systems have to deal with: customisable timeouts and retry policies, backoff with jitter, throttling, debouncing, smart adaptive routing, and ultimately embracing the mess of unreliable relays as part of the design. Add aggregators, Blastr relays ("Broadcaster"), and so on, and it starts looking a lot like the typical stuff I do for a living.
Of course, we can always make incremental improvements (e.g. relays that have been dead for over a year are unlikely to come back online), but ultimately what you’re saying makes sense. What I dislike about this situation is that every developer will have to reinvent the wheel over and over again. Truth be told, this is very difficult to get right, and it will only get harder as the network grows and more relays are decommissioned.
My take is that we need serious work on SDKs and core libraries for Nostr to scale. For example, go-nostr, which is one of the better libraries out there, implements Penalty Box with naive backoff (no jitter, no custom policies, no parameterisation of timeouts, etc.). It helps, but it’s not enough on its own to sanely scale the Outbox model. Several SDKs and libs out there provide nothing like this out of the box.
In an ideal world, there should be a one-liner to apply default policies to the connection pool so that it is both more robust and less wasteful. This should be available in all major SDKs and libraries for all programming languages. If Nostr wants the Outbox model to become a standard, and for clients to eventually deal with thousands or even tens of thousands of relays in the future, it’s just not realistic to expect every single client (and specialised relay) to achievr this level of robustness on its own.
I’ve seen very well-funded projects take years to reach this level of maturity, usually with the help of a strong “core” team (people like Fiatjaf, you, some experienced SREs) providing a Golden Path with robust tooling, architectural guidance, and, importantly, proper funding and time.
I was having a discussion yesterday about Nostr and OSS governance with nostr:nprofile1qqsrhuxx8l9ex335q7he0f09aej04zpazpl0ne2cgukyawd24mayt8gprfmhxue69uhhq7tjv9kkjepwve5kzar2v9nzucm0d5hszxmhwden5te0wfjkccte9emk2um5v4exucn5vvhxxmmd9us2xuyp. This is going to be an interesting challenge for Nostr’s decentralised governance model for sure: how do we get people to come together and work on the less appreciated but core tasks like this?
I think we have been exploring and learning for a while, and some kind of useful guidance for those who don't want to spend years exploring and learning sounds like a good idea. Starting from the principle of defending yourself. Don't trust, don't rely on the unreliable, program for the worst case, including active attack.