Yeah, same problem here (I'm reading Kind 10002 and NIP-05 relays for experimental WoT stuff). I bet we're not the only ones running into this. Calling ensureRelay on each of the formerly popular relays is just a huge waste of time. I bet a lot of Outbox enabled clients are facing the same issue. nostr:nprofile1qqswuyd9ml6qcxd92h6pleptfrcqucvvjy39vg4wx7mv9wm8kakyujgpypmhxue69uhkx6r0wf6hxtndd94k2erfd3nk2u3wvdhk6w35xs6z7qgwwaehxw309ahx7uewd3hkctcpypmhxue69uhkummnw3ezuetfde6kuer6wasku7nfvuh8xurpvdjj7a0nq40, nostr:nprofile1qqsyvrp9u6p0mfur9dfdru3d853tx9mdjuhkphxuxgfwmryja7zsvhqpz9mhxue69uhkummnw3ezuamfdejj7qgswaehxw309ahx7um5wghx6mmd9uq3wamnwvaz7tmkd96x7u3wdehhxarjxyhxxmmd9ukfdvuv, correct me if I'm wrong here. Also, do you have a better solution to this issue than "big shared broken relay list"?

Reply to this note

Please Login to reply.

Discussion

Connecting to relays that are not working is pretty cheap. So, I just try to connect and ignore if the relay is offline.

I’m doing the same. The real problem is telling apart flakey or temporarily down relays from ones that are actually dead. E.g, if I treat ā€œfail to connect during startupā€ as dead, I risk missing relays that are just temporarily offline, which can be catastrophic for long-running processes like relays. But if I don’t, then I need increasingly complex retry algorithms (ā€œpenalty boxā€, exponential backoff, etc.), which still wastes time and resources retesting truly dead relays. Even worse, some dead relays just hang connections. Right now I’m using a 5-second timeout when testing relays, and even that adds up quickly when combined with retry algorithms across thousands of relays. A good blacklist would be very useful.

You can also check for HTTP status codes when opening the connection. Many relays are offline and sending 400 and 500 codes. Those you can re-test once a day or so.

Significnatly less efficient than NIP-66

I don't like blacklists here because that becomes another point of trust. I'd prefer to know for myself if a relay never works for me. So I want the client to do this and to remember.

Gossip client keep statistics on how successful a relay has been, and has fairly aggressive timeouts for poor performing relays.

Also, for choosing outbox relays I go through a fairly long process to find appropriate relays and then I hardcode them in the next release. This goes like this:

1. Print out the relays from my gossip client sorted by rank (best performing first)

2. Skip if we never connected to it

3. Skip if the URL has a path beyond the hostname

4. Skip if the relay does not have a nip-11

5. Skip if the pubkey in the nip11 is invalid (or a prefix)

6. Skip of nip-11 indicates limitation.restricted_writes

7. Skip if nip-11 indicates fees

8. Remove relays known to be special-purpose

9. Score according to a custom algorithm that uses statistics from gossip client (number of attempts, last connected, success rate, etc).

10. test with a random keypair to see if it accepts a note and I can read it back.

Fewer and fewer relays pass this gauntlet because of spam filtering measures.

Hey Mike,

Thanks for coming back on this one. Just so I’m clear on what we’re discussing, given the sophisticated heuristics here: are you talking about choosing ā€œdefaultā€ Outbox relays for new keys? Or is Gossip ignoring write relays from users that don’t match the above criteria? For example, if I manually set a write relay that has a broken NIP-11, will Gossip respect my choice, or skip the relay I configured myself when I publish my own events?

Either way, despite your comment below about not wanting shared blacklists, it feels like every client going through this process to compute lists of dead relays is a lot of duplicated work and wasted effort. I understand the value of having stricter criteria for suggesting default relays to new npubs and potential fallbacks, but I wouldn’t expect every client (or relays needing to calculate WoT and other advanced functionality) to handle all of this aggregate data collection and processing.

IMO a shared blacklist (optional for clients to use, and that they are of course free tk verify) seems like a more pragmatic solution.

I mean for picking your own relay. I'm not judging other peoples choices, but if their relays don't take my reply then that's their problem.

Apologies again, but just to remove any ambiguity: when you say "for picking your own relay", what exactly does that mean? Is this something only for you (Mike) to filter writable relays? How is Gossip actually using the resulting list of vetted relays for its users, if at all?

For example, if I have a brand new, non-

NIP-11 compliant relay with a non-standard path tagged in my Kind 10002:

```

["r", "wss://brando-relay.com/inbox", "read"],

["r", "wss://brando-relay.com", "write"]

```

This is the first time that Gossip encounters brando relay. Will it:

1. Write my own kind 1 notes to wss://brando-relay.com/?

2. Write other users’ notes tagging me to wss://brando-relay.com/inbox?

Users must pick relays. But users don't know "by which criteria" they should pick them.

So I scoured >1000 relays and picked about 25 open relays that don't require AUTH or payment and so they work for inbox and outbox for new people. When you use gossip for the first time and set up an account, these are the relays it lists for you to pick from. I am not associated with any of them, they just passed the tests.

ALSO, inside gossip's relay panel for any particular relay, you can press "TEST" and it will do some tests and then let you know if the relay works as an inbox or outbox for you.... so that if it doesn't, you'll know to not add it to your relays list.

There has been a push among the most prominent nostr developers to make more and more custom kinds of relays. Most of those don't work as inbox/outbox relays so we are in this situation where regular users are both required to choose relays for their kind-10002 relay list, but also that they have no idea how to do it and even if they knew how, don't have the tools or information necessary to make good choices.

Got it, thanks. So, back to my questions above (just to be 100% sure that we are on the same page):

1. The selection of relays that meet your criteria above is only used as the default selection for new npubs, correct?

2. If I, as a new user, manually set my read and write relays, Gossip will try to connect to the relays I configured regardless of the criteria above (as per questions 1 and 2 of my previous message), correct?

3. And if I add a dead relay, will it still try to connect to it?

4. After it fails to connect to a manually configured relay once, will it ever try to reconnect? For example, what happens if the relay above goes offline for a couple of hours and then comes back?

1. yes, 2. yes, 3. yes, 4. yes.

Thanks again. So, if it’s all yes, then in practice Gossip is also trying to connect to broken relays, right?

What sort of timeout settings and retry logic does Gossip currently use? nostr:nprofile1qqsyvrp9u6p0mfur9dfdru3d853tx9mdjuhkphxuxgfwmryja7zsvhqpz9mhxue69uhkummnw3ezuamfdejj7qgswaehxw309ahx7um5wghx6mmd9uq3wamnwvaz7tmkd96x7u3wdehhxarjxyhxxmmd9ukfdvuv also mentioned that broken relays aren’t as much of a problem in Amethyst as they are in Haven. I get the feeling that each of us may be handling this differently.

Haven tests relays during startup. It also uses a Penalty Box: if a relay fails to connect, it retries after 30 seconds, and then exponential backoff kicks in. Out of curiosity I put a counter on it, and with my current implementation this meant ~110k failed connection attempts in the first 24 hours after restarting the relay (though as exponential backoff kicks in the number of connections reduce substantially). Of course that I could make the "broken relay detection" converge faster by tweaking exponential backoff, but this also means that the next time that one of the popular arrays go offline Haven will quickly give up on them.

This is where a good blacklist could be useful. You can do this without fully trusting the list as well. I.e., if I can use Nosotros list above, still try to connect to all relays during initialisation, but if the connection fails and the relay is in the "black list", then I set a much higher time until reconnect for the exponential backoff algorithm (e.g. 1 hour or even 24 hours). So if the list is wrong or one of the dead relays gets "ressurected" Haven will eventually connect to it, but it makes the whole process much cheaper.

For most connections I use a 30 second timeout. For advertising relay lists I use a 5 second timeout.

Gossip client tracks jobs, unfinished jobs with benign errors will try again after a timeout, but there is a long list of error conditions which change the behavior.

For a few random examples, if the relay returned status 4000, we exclude connections to that relay for 600 seconds. If the connection timed out, we exclude it for 60 seconds. If we got a NOT_FOUND, 600 seconds. If we got a websocket error, 15 seconds. Some of them are probably poorly chosen numbers and could use review.

We then randomize the exclusion value a bit so that retries to multiple relays don't all line up together. For jobs that are not marked "persistent" we don't bother trying again. From my reading of the code right now it looks like if a relay is excluded and a new job wants to talk to it, an error is thrown rather than queueing up that job to happen after the exclusion, which might not be ideal.

Relays marked rank=0 are never connected to, so users can disable relays that don't work, but this involves the user who usually isn't going to notice these details.

Also, I never assume a relay is broken.... just that it isn't working right now. So it never stops trying. But exclusions of 600 seconds are long enough that this doesn't matter much.

Sometime in the past I coded a way for users to delete (knowledge of) relays that they knew were "dead". The code didn't work and after some debugging I realized that it actually did work, but the relay got recreated because data in my local database referred to it, and any reference to a relay URL makes sure a relay record exists for it and starts collecting statistics on it. So it just gets deleted and recreated by the other logic. Even if I deleted the events in the local database, new events on nostr would recreate it. I would have to mark it with a tombstone saying to never recreate it... which seems wrong because the DNS name might one day be used for a relay again.

I also pick relays judiciously. When I need to read from someone's set of outboxes (or inboxes), I choose the relays that are most likely to be working.

The console log shows all kinds of relay errors and complaints. I just ignore most of it.

Thanks, Mike. That was all very helpful.

The more I read and think about it, the less I believe there’s any obvious solution to distributed systems problems, especially on Nostr, where infrastructure is mostly provided on a best-effort basis.

IMV, what you’re describing is the typical outcome of a ā€œsimple protocolā€ meeting the realities of distributed systems. These are the usual complexities that enterprise-grade distributed systems have to deal with: customisable timeouts and retry policies, backoff with jitter, throttling, debouncing, smart adaptive routing, and ultimately embracing the mess of unreliable relays as part of the design. Add aggregators, Blastr relays ("Broadcaster"), and so on, and it starts looking a lot like the typical stuff I do for a living.

Of course, we can always make incremental improvements (e.g. relays that have been dead for over a year are unlikely to come back online), but ultimately what you’re saying makes sense. What I dislike about this situation is that every developer will have to reinvent the wheel over and over again. Truth be told, this is very difficult to get right, and it will only get harder as the network grows and more relays are decommissioned.

My take is that we need serious work on SDKs and core libraries for Nostr to scale. For example, go-nostr, which is one of the better libraries out there, implements Penalty Box with naive backoff (no jitter, no custom policies, no parameterisation of timeouts, etc.). It helps, but it’s not enough on its own to sanely scale the Outbox model. Several SDKs and libs out there provide nothing like this out of the box.

In an ideal world, there should be a one-liner to apply default policies to the connection pool so that it is both more robust and less wasteful. This should be available in all major SDKs and libraries for all programming languages. If Nostr wants the Outbox model to become a standard, and for clients to eventually deal with thousands or even tens of thousands of relays in the future, it’s just not realistic to expect every single client (and specialised relay) to achievr this level of robustness on its own.

I’ve seen very well-funded projects take years to reach this level of maturity, usually with the help of a strong ā€œcoreā€ team (people like Fiatjaf, you, some experienced SREs) providing a Golden Path with robust tooling, architectural guidance, and, importantly, proper funding and time.

I was having a discussion yesterday about Nostr and OSS governance with nostr:nprofile1qqsrhuxx8l9ex335q7he0f09aej04zpazpl0ne2cgukyawd24mayt8gprfmhxue69uhhq7tjv9kkjepwve5kzar2v9nzucm0d5hszxmhwden5te0wfjkccte9emk2um5v4exucn5vvhxxmmd9us2xuyp. This is going to be an interesting challenge for Nostr’s decentralised governance model for sure: how do we get people to come together and work on the less appreciated but core tasks like this?

I think we have been exploring and learning for a while, and some kind of useful guidance for those who don't want to spend years exploring and learning sounds like a good idea. Starting from the principle of defending yourself. Don't trust, don't rely on the unreliable, program for the worst case, including active attack.