Replying to Avatar Anthony Accioly

Thanks again. So, if it’s all yes, then in practice Gossip is also trying to connect to broken relays, right?

What sort of timeout settings and retry logic does Gossip currently use? nostr:nprofile1qqsyvrp9u6p0mfur9dfdru3d853tx9mdjuhkphxuxgfwmryja7zsvhqpz9mhxue69uhkummnw3ezuamfdejj7qgswaehxw309ahx7um5wghx6mmd9uq3wamnwvaz7tmkd96x7u3wdehhxarjxyhxxmmd9ukfdvuv also mentioned that broken relays aren’t as much of a problem in Amethyst as they are in Haven. I get the feeling that each of us may be handling this differently.

Haven tests relays during startup. It also uses a Penalty Box: if a relay fails to connect, it retries after 30 seconds, and then exponential backoff kicks in. Out of curiosity I put a counter on it, and with my current implementation this meant ~110k failed connection attempts in the first 24 hours after restarting the relay (though as exponential backoff kicks in the number of connections reduce substantially). Of course that I could make the "broken relay detection" converge faster by tweaking exponential backoff, but this also means that the next time that one of the popular arrays go offline Haven will quickly give up on them.

This is where a good blacklist could be useful. You can do this without fully trusting the list as well. I.e., if I can use Nosotros list above, still try to connect to all relays during initialisation, but if the connection fails and the relay is in the "black list", then I set a much higher time until reconnect for the exponential backoff algorithm (e.g. 1 hour or even 24 hours). So if the list is wrong or one of the dead relays gets "ressurected" Haven will eventually connect to it, but it makes the whole process much cheaper.

For most connections I use a 30 second timeout. For advertising relay lists I use a 5 second timeout.

Gossip client tracks jobs, unfinished jobs with benign errors will try again after a timeout, but there is a long list of error conditions which change the behavior.

For a few random examples, if the relay returned status 4000, we exclude connections to that relay for 600 seconds. If the connection timed out, we exclude it for 60 seconds. If we got a NOT_FOUND, 600 seconds. If we got a websocket error, 15 seconds. Some of them are probably poorly chosen numbers and could use review.

We then randomize the exclusion value a bit so that retries to multiple relays don't all line up together. For jobs that are not marked "persistent" we don't bother trying again. From my reading of the code right now it looks like if a relay is excluded and a new job wants to talk to it, an error is thrown rather than queueing up that job to happen after the exclusion, which might not be ideal.

Relays marked rank=0 are never connected to, so users can disable relays that don't work, but this involves the user who usually isn't going to notice these details.

Reply to this note

Please Login to reply.

Discussion

No replies yet.