Thanks, Mike. That was all very helpful.
The more I read and think about it, the less I believe there’s any obvious solution to distributed systems problems, especially on Nostr, where infrastructure is mostly provided on a best-effort basis.
IMV, what you’re describing is the typical outcome of a “simple protocol” meeting the realities of distributed systems. These are the usual complexities that enterprise-grade distributed systems have to deal with: customisable timeouts and retry policies, backoff with jitter, throttling, debouncing, smart adaptive routing, and ultimately embracing the mess of unreliable relays as part of the design. Add aggregators, Blastr relays ("Broadcaster"), and so on, and it starts looking a lot like the typical stuff I do for a living.
Of course, we can always make incremental improvements (e.g. relays that have been dead for over a year are unlikely to come back online), but ultimately what you’re saying makes sense. What I dislike about this situation is that every developer will have to reinvent the wheel over and over again. Truth be told, this is very difficult to get right, and it will only get harder as the network grows and more relays are decommissioned.
My take is that we need serious work on SDKs and core libraries for Nostr to scale. For example, go-nostr, which is one of the better libraries out there, implements Penalty Box with naive backoff (no jitter, no custom policies, no parameterisation of timeouts, etc.). It helps, but it’s not enough on its own to sanely scale the Outbox model. Several SDKs and libs out there provide nothing like this out of the box.
In an ideal world, there should be a one-liner to apply default policies to the connection pool so that it is both more robust and less wasteful. This should be available in all major SDKs and libraries for all programming languages. If Nostr wants the Outbox model to become a standard, and for clients to eventually deal with thousands or even tens of thousands of relays in the future, it’s just not realistic to expect every single client (and specialised relay) to achievr this level of robustness on its own.
I’ve seen very well-funded projects take years to reach this level of maturity, usually with the help of a strong “core” team (people like Fiatjaf, you, some experienced SREs) providing a Golden Path with robust tooling, architectural guidance, and, importantly, proper funding and time.
I was having a discussion yesterday about Nostr and OSS governance with nostr:nprofile1qqsrhuxx8l9ex335q7he0f09aej04zpazpl0ne2cgukyawd24mayt8gprfmhxue69uhhq7tjv9kkjepwve5kzar2v9nzucm0d5hszxmhwden5te0wfjkccte9emk2um5v4exucn5vvhxxmmd9us2xuyp. This is going to be an interesting challenge for Nostr’s decentralised governance model for sure: how do we get people to come together and work on the less appreciated but core tasks like this?