Replying to 21823843...

In strfry there's currently no way to do read-time filtering, but I'm going to put some thought in how to do this flexibly (see my sibling reply). Your ruleset description will be quite helpful as a solid use-case - thanks!

Also, that's very cool with your new relay design, I'm looking forward to seeing it.

About using a parallel mmap'ed file to store the event data, I have some experience with this approach and frankly I would not do that except as a last resort.

The biggest downside IMO is that you won't be guaranteeing transactional consistency between the event data and the indices. For instance, what if a write update to the sled DB succeeds, but writing the event to the mmap fails? This could happen when you're in low-diskspace conditions, or maybe there's an application bug or server failure and you crash at a bad time. In this case, you'd have a corrupted DB, because there would be pointers into invalid offsets in the mmap. Similarly, deleting an event and recovering the space will become much harder without transactional consistency.

OTOH, one really great thing you can do is directly `pwritev()` from the mmap to the socket. I have had success with this in the past, because the data doesn't even need to be copied to userspace and userspace page-tables don't need updating. In fact, with `sendfile()` and good enough network card drivers, it can actually be copied directly from the kernel's page cache to the network card's memory.

Although possible, I don't actually do this in strfry for a bunch of reasons. First of all, nostr events are usually too small to really make this worthwhile. Also, I don't want to deal with the complexity of long-running transactions in the case of a slow-reading client. Lastly, this usually isn't possible anyway because of websocket compression and strfry's feature where events can be stored zstd compressed on disk.

Anyway, on balance I think it's preferable to store the event data in the same DB as the indices. With LMDB you get pretty much all the benefits (and downsides) of a mmap() anyway. I don't know about sled, I haven't looked into that before.

About serialisation, I quickly looked at speedy and from what I can tell it seems like it's pretty simple and concatenates the struct's fields directly in-order (?). That may work pretty well in some cases, but let me elaborate a bit on my choice of flatbuffers.

flatbuffers have an offsets table (like a "vtable" if you're familiar with the C++ terminology). This lets you directly access a single field from the struct without having to touch the rest of the struct. For instance, if you wanted to extract a single string from a large object, you look up its offset in the vtable and directly access that memory. This has the advantage that none of the rest of the memory of the struct needs to be touched. If you're using a mmap like LMDB, then some of the record may not even need to be paged in.

Typically you will never fully "deserialise" the record - that doesn't even really make sense in flatbuffers because there is no deserialised format (or rather, the encoded format is the same as the "deserialised" one). This means that getting some information out of a flatbuffer does not require any heap memory allocations. Additionally, small fixed-size fields are typically stored together (even if there is some long string "in between" them), so loading one will often as a side-effect cause the others to be paged in/loaded as well, and live in the same CPU cache line.

I may be wrong, but speedy looks more like protobufs where you have to traverse and decode the entire struct, and allocate memory for each field before you can access a single field in the record. In the case where you wanted to read just a couple fields (pretty common for a DB record), this could be pretty wasteful. Again, the data required for indexing nostr events is fairly small so this may not be a big deal -- my choice of default is from experience with much larger records.

One more thing: flatbuffers is quite interesting (and I think unique?) in that it allows you to separate the error checking from the access/deserialisation. For example, when you deserialise you typically assume this input is untrusted, so you'll be doing bounds-checking, field validation, etc. However, in the case where the records are purely created and consumed internally, these checks are unnecessary. This is one reason flatbuffers is so effective for the use-case of database records.

Sorry for this big wall of text!

Very cool. Thanks for the explanations.

On the separate mmap, I'm writing the serialized OpaqueEvent first, then writing the indexes. On a crash we might have a useless event that didn't get indexed, but it wouldn't be corrupt. The writer locks a write-lock, appends, then updates the 'end' pointer at the beginning of the map, then releases the write lock.

Also, these events are opaque (customer-ready, pre-validated). I never access the fields.

I noticed strfry does an index scan of these flatbuffers (of stripped down events) when necessary, using the fields to filter. I'm instead trying to use multiple btree-like indexes:

id -> offset

(created_at, id) -> offset

(pubkey, created_at, id) -> offset

(event_kind, created_at, id) -> offset

(tagdata, created_at, id) -> offset

Each key has to be unique per event, hence the id on each. And the created_at is to scan a since/until range. It's a lot of data to store, but I think that is okay on a personal relay. I don't think it would cause more page faults and cache misses.

I'm finding the set of events that match the filter by doing a functional intersection over the results from these filters. I say functional because I don't collect each result into a set and then intersect them, I intersect as I go: first index: collect the offsets in a hashset; second index: check each in the previous hashset and if it exists, copy to the next hashset (thus functionally being a set intersection) otherwise drop it.

Maybe this is a dumb algorithm. Databases are more my brother's thing, but I'm stubbernly trying to reinvent them myself (I would never learn as well otherwise).

I think deleted events will just be marked as such in another hash table, and a periodic rewrite will filter them out. I haven't really worked that out and now that I'm thinking about it, it could be quite expensive.

A lot of these OS level features like sendfile() or pwritev() become very difficult to use when relying on an upstream library that doesn't expose it's API in a way to utilize those. If I use tungstenite for websockets, I don't think I can write into it's outbound buffers at a low level.

At this point I'm really still exploring. It will change.

Reply to this note

Please Login to reply.

Discussion

Got it, makes sense. I think if you make sure the event is written before updating the DB then that would be safe, as you describe. It still might be worthwhile to check if there is any performance difference between the parallel file and storing the validated event JSON in sled DB along with the indices. If it works like LMDB then retrieving it will be functionally identical (an offset into an mmap), and an event write will require only one fsync, not two. It also might save you a lot of trouble later on, when it comes to backups and periodic rewrites you mention.

About the indices: No, I don't think it's a dumb algorithm. It's actually a really difficult problem, to figure out which index or indices to use in a scan. strfry has all the indices you mention, also composite with created_at. Additionally, there is a (pubkey,kind,created_at) index, which I think is useful because I believe queries like "get contact list of user X" are quite common.

strfry currently never joins two indices together or does a set intersection as you describe -- its algorithm is even dumber. It only ever uses 1 index. If the query can be fully satisfied by the index then it doesn't need to load the flatbuffers (most queries are like this). But if it does need to load the flatbuffers this is when it's important that accessing them and applying filters are fast. Optimisations for this code path are also beneficial for the live events filtering (ie post EOSE).

Your approach sounds like it might be quite effective. The only thing I'd watch out for is if your IO patterns are excessively random. The nice thing about working on one index at once is that the access is largely contiguous data. Even several contiguous streams is probably OK, but some queries have like hundreds of pubkeys in them (for instance).

I'm sure you are right about just using sled instead of doing my own mmap. But I feel pride that it seems to work so I'm gonna leave that code for a while. Once it bites me, I'll do what you suggest. I've got a compiler fence so it doesn't reorder the writes, but I had forgotten to put in the msyncs!

Sorry i meant a full fence, not just compile time.