Very cool. Thanks for the explanations.

On the separate mmap, I'm writing the serialized OpaqueEvent first, then writing the indexes. On a crash we might have a useless event that didn't get indexed, but it wouldn't be corrupt. The writer locks a write-lock, appends, then updates the 'end' pointer at the beginning of the map, then releases the write lock.

Also, these events are opaque (customer-ready, pre-validated). I never access the fields.

I noticed strfry does an index scan of these flatbuffers (of stripped down events) when necessary, using the fields to filter. I'm instead trying to use multiple btree-like indexes:

id -> offset

(created_at, id) -> offset

(pubkey, created_at, id) -> offset

(event_kind, created_at, id) -> offset

(tagdata, created_at, id) -> offset

Each key has to be unique per event, hence the id on each. And the created_at is to scan a since/until range. It's a lot of data to store, but I think that is okay on a personal relay. I don't think it would cause more page faults and cache misses.

I'm finding the set of events that match the filter by doing a functional intersection over the results from these filters. I say functional because I don't collect each result into a set and then intersect them, I intersect as I go: first index: collect the offsets in a hashset; second index: check each in the previous hashset and if it exists, copy to the next hashset (thus functionally being a set intersection) otherwise drop it.

Maybe this is a dumb algorithm. Databases are more my brother's thing, but I'm stubbernly trying to reinvent them myself (I would never learn as well otherwise).

I think deleted events will just be marked as such in another hash table, and a periodic rewrite will filter them out. I haven't really worked that out and now that I'm thinking about it, it could be quite expensive.

A lot of these OS level features like sendfile() or pwritev() become very difficult to use when relying on an upstream library that doesn't expose it's API in a way to utilize those. If I use tungstenite for websockets, I don't think I can write into it's outbound buffers at a low level.

At this point I'm really still exploring. It will change.

Got it, makes sense. I think if you make sure the event is written before updating the DB then that would be safe, as you describe. It still might be worthwhile to check if there is any performance difference between the parallel file and storing the validated event JSON in sled DB along with the indices. If it works like LMDB then retrieving it will be functionally identical (an offset into an mmap), and an event write will require only one fsync, not two. It also might save you a lot of trouble later on, when it comes to backups and periodic rewrites you mention.

About the indices: No, I don't think it's a dumb algorithm. It's actually a really difficult problem, to figure out which index or indices to use in a scan. strfry has all the indices you mention, also composite with created_at. Additionally, there is a (pubkey,kind,created_at) index, which I think is useful because I believe queries like "get contact list of user X" are quite common.

strfry currently never joins two indices together or does a set intersection as you describe -- its algorithm is even dumber. It only ever uses 1 index. If the query can be fully satisfied by the index then it doesn't need to load the flatbuffers (most queries are like this). But if it does need to load the flatbuffers this is when it's important that accessing them and applying filters are fast. Optimisations for this code path are also beneficial for the live events filtering (ie post EOSE).

Your approach sounds like it might be quite effective. The only thing I'd watch out for is if your IO patterns are excessively random. The nice thing about working on one index at once is that the access is largely contiguous data. Even several contiguous streams is probably OK, but some queries have like hundreds of pubkeys in them (for instance).

Reply to this note

Please Login to reply.

Discussion

I'm sure you are right about just using sled instead of doing my own mmap. But I feel pride that it seems to work so I'm gonna leave that code for a while. Once it bites me, I'll do what you suggest. I've got a compiler fence so it doesn't reorder the writes, but I had forgotten to put in the msyncs!

Sorry i meant a full fence, not just compile time.