Oh, yeah, negation would be great. Send me these types of things, but not these specific ones, as I already have those.

Darn. Would be so nice. Let's build that.

Fucking hell, that would be so nice. ROFLMAO

Reply to this note

Please Login to reply.

Discussion

Also, tombstones. Why does everyone store the deletion requests? Just don't save those eventIDs again.

whole relays 80% deletion requests OMG LOL

yeah, realy has tombstones... and yeah it really should not store them, but only push them out to relays that are subscribed to them (which would be driven by the req of the replicas)

tombstones do eventually need to be cleaned up though. the tombstones in realy have a timestamp on them, i had it in mind eventually to make a GC to clear them out when they get too numerous and prune out the oldest ones that are unlikely to appear again (let's say, after 3 months or something)

Could probably go for at least a year, without clearing, on a local instance. Wouldn't get enough traffic to have a gigantic list.

Should probably go by list-size, rather than date. Has more to do with how hard it is to filter against the list.

idk how it works with other databases but with badger you can use these "batch" streaming functions that automatically run with as many threads as you specify. a mark and sweep style GC pass on 18gb takes about 8 seconds on my machine, probably faster on current gen NVMe and ddr5 memory

the GC can also do multiple types of collection as well, all at the same time, so you could set it to prune stuff that you keep access counters and first-seen timestamps as well as snuffing old tombstones

ah, just to explain how you do things with badger, because it differs from most other key/value stores due to the separation of key and value tables...

because writing values doesn't force any writes on the keys, keys stay in order a lot more, generally, once compacted, forever compacted (compaction is playing the log out to push it into an easily iterated, pre-sorted array)

as a result, the best strategy with badger for storing any kind of information that won't change, and needs to be scanned a lot, you put, very often, values in the keys for such immutable stuff, such as tombstones

it's also used for searching, as you would expect, but this is the reason why when you use badger (properly) to write a database, it's so much faster. it doesn't have to skip past the values when its' scanning, and you don't have to re-compact the keys when you change values (and yes, it of course has versioning of keys, i don't use this feature but in theory there is often some number of past versions of a value that can be accessed with a special accessor for this, but more generally it makes the store more resilient, as you would expect)

so, yeah, current arrangement for tombstones in realy is the first (left, most significant) part of the event ID hash is the key. finding it is thus simple and fast, just trim off the last half and prefix with the tombstone key prefix and even you can just use the "get" function on the transaction instead of making a whole iterator, so it's very neat, and very fast.

i also exploit these properties of badger key tables with the "return only the ID" functions by creating an index that contains the whole ID after the event's serial number, which means the event itself doesn't have to be decoded for this case, which is a huge performance optimization as well.

yes, that full ID index, it also contains a truncated hash of the pubkeys and the kind number and timestamp, so you can just pull all of the relevant keys for the result serials found, filter out pubkeys, kinds, slice it by range (if the index search didn't already do this) and then sort them in ascending or descending order of timestamp, and then just return the event ids in this order.

it's a much faster request to process which means once the client has this list, it can just pull the IDs with a single query for its initial display, and add some extra to add some room to pull the rest as the display requires them, lazy loading style.

this is the key reason why i make this index and i designed it so that it is as svelte and sleek as possible for both bandwidth and rendering efficiency

none of this is possible if you don't use a badger key/value store also. and this is also why they built badger, its main purpose for existing was to serve as the storage engine for a graph database, which is very read-heavy, you know database stuff, so you know that joining tables is the biggest time cost in a graph database, which is basically a database that does massive table joins to pull results.

I'm just gonna feed what you just wrote into my AI and tell it to do some magic and see what it suggests.

yeah, storage is cheap... a year would be definitely sufficient to ensure it isn't stored again

The NFDB API will support this

Even request pipelining, because no one has the time to wait for multiple round trips.

Your stuff is mostly the only remote relays I'm bothering with, TBH. Everything else is slow as fuck or full of irrelevant garbage and life is too short to cater to cheapskates and spammy relays, fr. Sick of it.

People can login and then it adds in their mailboxes and that. is. enough.

Totally trimming everything to our own stuff and I got zero fucks to give, this shit gonna be lit. Fuck em. Rage coding FTW.

🔥

Relays like strfry, khatru, realy are great until they are not.

They work great for caching locally but when you get to scale it implodes.

You want to store 1TB of books. You have to rent a single server that can store 1TB.

But what if it goes down? So you buy a few more replicas. Then you try to shard events across servers and fail.

Have fun compacting the database or upgrading it every once in a while.

NFDB fixes this. Just like SQLite is great for small scale, Postgres is better for larger scale.

I'm using them for local caching. Doesn't have to be the only local relay. I currently have 4 local relays running on my laptop and Citrine on mobile. I do a lot of testing, but still. My new 10432 kind that contains all localhost relays means that you can have as many as you want, to do what you want.

I don't think it's appropriate to store everything on the local system; that's actually a risky data strategy. It's for making sure you can transition between online and offline and autosync when you're in a good network. Like OpenDrive and Sharepoint do, but not retarded and dirt-cheap.

I think it's safe to assume that someone handling large or important data stores will have the sense to hire a professional admin or be an admin, themselves, but that's what we have you and nostr:npub10npj3gydmv40m70ehemmal6vsdyfl7tewgvz043g54p0x23y0s8qzztl5h for. Above my pay grade and not my problem.

Good.

For an in-browser use case, use the SQLite in-browser relay I suggested too. You at least have a cache that is better than nothing, compared to no cache until they set up realy or something else.

I'm using indexeddb, as a mandatory cache, since it works on phones. Isn't that one you suggested something that has to be natively installed and _doesn't_ run in the browser? Or did I check the wrong link?

No it’s a web worker that uses OPFS and wasm to run a relay

Hmm... I'll look. What was the link, again?

the whole collecting IDs and comparing before downloading events and then just downloading what is missing, that's what negentropy does, that's why i thjnk its neat having it built into the protocol..

as for uptime and redundancy, this always comes in at least double the cost. obviously it will take a super long time to compact a 1tb db. possibly on the order of days.. but you can run a replica, and still do a zero downtime failover once it is complete as long as you have enough disk space.

ive been spec'ing out some server tiers that could handle it, while also keeping cost in mind. i think having as low a server cost as possible is really important for nostr businesses.

i also like that clients have the distributed mindset here it should help with uptime by decreasing the odds of both relays experiencing unexpected downtime.

badger is better because it has split key/value tables. a lot less wasted time compacting every time values are written, and easier and faster to use key table to store some of the data that has to be scanned a lot.

for whatever stupid reason, nobody else in database development has realised the benefit of the key/value table splitting, even though the tech has been around for 9 years already.

probably similar reasons why so many businesses are stuck with oracle

at least some sane people realized, tbh

FoundationDB’s new Redwood engine, underlying architecture of S3, NFDB’s IA store as well

that's great. i think badger is cuter and more mature tho

I’ll have a REST API if you want. You can ask it to only send IDs or the full events.

Want to get the events by ID? Just use an ids filter on the same endpoint, no one has a use for 2 endpoints doing 1 thing.

But no SSE. Browsers put a limit on the number of parallel HTTP requests and SSE may exceed that as they stay open for a long time

Also, NFDB will support cursors on all interfaces so you can paginate without the pain. Forget created_at based pagination, which can be very unreliable.

If you want to know how to do it, calculate the lowest created_at for each relay, and use the highest one as the next one to use. Or a relay with a very old event will say “nope nothing left”

Okay, that gives us two sane connections to tap, one local, one external. Good.

Feeling cute. Might fix Nostr.

yeah, i think as always hybrid is generally better than dedicated. smaller, simpler parts that are built with clean simple interfaces, are much more easy to scale per use case

yeah, i've been thinking about how to do SSE properly. it seems to me the right way is to open one to handle all subscriptions, starting from when the client wants a subscription to be open, one is opened, and everything goes through that, and the event format (as in SSE event) includes the subscription id related to it.

this avoids the problem of the limits, which i think are typically about 8. but even 8 is plenty, it's just not necessary because you can multiplex them just by using subscription IDs just like the websocket API does. it also simplifies maintaining the subscription channel open, and it also allows arbitrary other kinds of notifications to get pushed as well, that we haven't thought of yet, aside from subscriptions to event queries based on newly arrived events.

why i think use SSE instead of a websocket? because it's less complex, basically just a HTTP body that slowly is written to. this pushes everything to the TCP layer instead of adding a secondary additional websocket pingpong layer. the client also knows if the SSE has disconnected and can start it up again, and the subscription processing on the relay side should keep a queue of events so when a subscription SSE dies it pushes the cache of events that have come in between the last one sent to the client (so it also means there needs to be a top level subscription connection identifier, IP address is probably not going to always work for multiple users behind one NAT).

also, just keep in mind that the websocket API by default creates a subscription management thing for every single socket that is opened, whereas if you do the queries primarily by http requests, this is slimmed down to a single subscription multiplexer, which will make it more memory efficient as well.

i don't think that there is enough of a clear benefit in using websockets for everything, their only real utility is for highly interactive connections, the complexity of multiplexing them at the application layer compared to one shot requests most of the time and one subscription stream for pushing data back is a huge reduction in the required data for managing a single client connection

Yeah, been researching and Twitter and ChatGPT actually use SSE, not websockets. They're streaming out, not in.

You only need one-way streaming, as the client knows when you are writing. There is normally no use case on Nostr for two-way streaming, in fact, since we don't really have live chats.

my primary concern comes from the amount of parallel streams. WS excels in this as it is one connection according to the browser and in the underlying stack.

In Nostr you can have many active subscriptions for new events.

Any non-subscription queries go straight to HTTP though. Helps a ton with load balancing and it means a software upgrade doesn’t interrupt queries.

Under the hood it’s all REST between internal NFDB services.

Yeah, but we don't actually need that, as most relays are now aggregators.

haha, yeah... this is the flaw with the bigger is better strategy when you can get a lot of mileage out of horizontal replication, especially when the database has GC to contain disk usage and the concomittant search cost of a larger database

You can, but we're going Big Data and need something more enterprise-level.

We're going after Sharepoint and Oracle, Kafka, and the Internet Archive. Need the really big guns.

Or when you manage thousands of customers. The NFDB relay is optimized for thousands of small ones as well.

It has a medium fixed cost, but the marginal costs are lower.

Can it handle a nation-state-sized customer? Asking for a friend.

Probably. If it doesn’t it can be easily upgraded to be able to do that.

There are some intentional design choices made by NFDB that could be changed for larger scale, at the cost of requiring more time to implement.

But at large scale you have the resources to do that.

So the answer is yes.

Good. Don't need that, yet, but maybe think through the design choices, so that you could explain the possibilities to someone.

There is already a scaling plan.

I’m not doing it yet because there is a chance no one will end up getting to that scale, or it will prove insufficient. There are constant architectural changes happening to NFDB because of new information from deploying it.

Doing it when the time comes, if ever, is more effective.

Yeah, but it's important to have that possibility. If they ask, I can say, Oh, we can do that...

You could scale even more if we could constrain queries.

Say you are developing Alexandria and want to access wiki pages. But all you do is look up wiki pages by their d tag.

You don’t need anything else. People mostly search up articles by content, and things like author rarely.

You don’t need Nostr queries for that, you can use the semantic search engine.

Congrats this will allow scaling 10x more with no changes, I am not kidding.

yeah, the shitty filter query thing is shitty

Well, we look up by naddr and d-tag, but yeah.

having indexes for those makes the searches pretty fast

"Interesting perspective! 🤔 It's always a balance between design choices and scalability. Sometimes the best innovations come from unexpected paths. Excited to see how it all unfolds! #Innovation #DesignThinking"

decentralization and small business/orgs need the small stuff, but big business needs bigger stuff. the approaches are complementary. small simple relays that are easy to set up and manage have a lower marginal cost in HR for a small user, and for the overall picture, a more resilient network, the decentralization/censorship resistance (and smaller attack surface for taking down many small targets, big systems are easier to do more damage).

the way it's played out over the history of internet services has been very clear. the more you centralize, the more brittle the system becomes.

We're trying to offer a system that has both and stores the information they need more often more close to the invidividual person's computer, with the information getting more and more concentrated, the closer you get to the central servers.

That means that all data is still available on SOME RELAY SOMEWHERE IN THE SYSTEM, even if the middle drops out, and people wouldn't notice the central server going down for a while, as data could be repopulated from the outside-in.

If you look at it, for any large or small scale system, the probability of failure relative to scale becomes lower in larger systems.

For small scale systems, they individually seem more reliable, but if you put 500 customers on 500 systems compared to 1 system, the former will have a higher overall failure rate.

The probability of experiencing downtime as an individual with a large system is not much different than a small system. But if a big system fails, more people notice, and it feels bigger.

With small systems, the frustration is spread out over a large time period, and so it feels like it never happens.

The distributed systems world has figured this out ages ago and this is why there is fault isolation, so that failures are contained, and become another ignored blip.

Yes, but Nostr takes it a step further with the signed, atomic events. Adds another layer of possibility.

What a local relay does is allow you to work offline and create a local hub. I'm the test case, as I commute on a 2,5 hour train ride through internet black-holes, so I need an app that automatically syncs, when the Internet "turns on" and then switches back to local-only, when we go through a tunnel or something.

Also, just fuck the constant wss streaming. Nostrudel or Amethyst are polling over 100 relays, simultaneously, and Primal connects so horribly, that my browser crashes. GB and GB of text messages, every damn day. Mobile plan is empty within a week, and then I've got 3 weeks of snail-mail Internet connections. Great.

AND THE BATTERY USAGE OMG

Nostr is an AP system. And for many things, who the fuck cares? Nostr is not meant to handle financial TXs or other OLTP workloads anyway, if you want that go use a database.

Nostr is a communications protocol, so you always have a database. The data has to be parked, someplace, before it can be relayed or fetched, after all.

This is about the efficiency of moving the information around.

even WITH chats it still makes no sense to add all that extra complexity

if it was streaming audio/video, different story. but then you would use RTP instead anyway.

I don't think we need SSE from external sources, but it makes hella-sense from a local relay to a local client. The relay can be syncing/polling/streaming from other relays in the background, after all.

Those two connections can be different protocols.