How many events/edges are processed in 15 sec?

Reply to this note

Please Login to reply.

Discussion

Zero events in those 15 seconds, bc the graph database is updated continuously as new kind 3 and 10000 events are detected. The graph right now has 204k pubkeys and 6.5 million relationships, mostly follows, some mutes.

By comparison, I expend about a minute to calculate personalized GrapeRank for the entire graph. The scripts for that I’ve custom coded. I’m guessing it could be faster if I used neo4j to calculate GrapeRank for me, but that’s a hill I have not yet climbed.

I use more event kinds and links btw them, it's roughly 100x bigger data set, do you have any idea if neo4j could handle that? How much RAM/disk/time could that take?

Neo4j is definitely resource hungry.

But it is designed to handle very large datasets, with real-world examples in the hundreds of billions of nodes. My understanding is that there is no theoretical upper limit to the number of nodes and edges other than those imposed by hardware.

I envision two applications for neo4j and nostr: 1) central services like nostr.band and 2) personalized WoT relays, with personal relays using intelligent Knowledge Graph based WoT to curate the local dataset so it doesn’t get too unwieldy and expensive.

So what would it take to put the entire nostr.band dataset onto a neo4j graph? Perhaps nostr:npub10mtatsat7ph6rsq0w8u8npt8d86x4jfr2nqjnvld2439q6f8ugqq0x27hf you can help us do a back of the envelope calculation in terms of RAM and other requirements to support this database. Suppose we were to host it on AWS. How much would it cost?

I imagine we would start with 3 node types: NostrUser, NostrEvent, and NostrRelay.

Then add relationships:

user to user: FOLLOW, MUTE, REPORT, etc

user to event: AUTHOR

Various event to event relationships: IS_A_REPLY_TO, etc.

Various other relationships and node types as we turn it into a bona fide Knowledge Graph.

How big is your current dataset? Do you filter anything out or keep every event that you come across?

Just on the RAM requirements, Neo4j's guidance is to have enough to store the entire dataset in RAM. If you don't have enough RAM, it's fine, but you'll be limited by your storage IO speed.

Their cloud configuration minimum recommendation is 2CPU/2GB RAM, but 16+ CPUs are recommended. You can get by with 4 CPUs in my experience, but big queries will take longer. I don't believe there are any limits imposed in the free license besides not having enterprise features/optimizations.

Let’s calculate how much it would cost to put all of nostr into neo4j 🤓

We can use nostr.band for reference.

Assume the events are kept in a performant key value db like LMDB. Don’t need to load content field into neo4j. Maybe not the tags either.

I would store the event jsons in nodes at first, but it's worth it to compare how storing them separately would affect performance.

Suppose each :NostrEvent node were limited to just a few properties: eventId, created_at, author. +/- tags. This would reduce the db size by quite a lot.

Yeah. Strfry does the heavy lifting for storing events already. Neo4j just needs to store the relationships between them to run WoT algorithms.

Yup. Pairing LMDB and neo4j together makes a lot of sense. If you want to do keyword search through content, pull from LMDB. Then to filter and sort the results, that’s what neo4j is for.

I plan on adding compact wot graph support to nostrdb (which is lmdb). Seems like an obvious win for this state to be embedded in nostr clients.

neo4j? Some other FOSS graph db? Or custom code the graph?

nostrdb is a custom embeddable nostr database/relay built on lmdb. It’s the engine that powers damus ios and damus notedeck. I want it to automatically calculate follow graphs/web of trust when processing contact events, so this state is readily available and queryable.

I listened to your recent talk with nostr:npub1jlrs53pkdfjnts29kveljul2sm0actt6n8dxrrzqcersttvcuv3qdjynqn on nostr:npub1mlcas7pe55hrnlaxd7trz0u3kzrnf49vekwwe3ca0r7za2n3jcaqhz8jpa and I’m excited to see where you go with notedeck 💜

If you want notedeck to calculate follow graphs / WoT, then you’re going to want to use a graph db like neo4j, which is what I use to calculate your personalized Follows Network in under half a minute. It spits out a json with 180k pubkeys and the number of follow-hops between you and each pubkey. This is separate from calculating personalized PageRank, which it also does in about 15 seconds.

https://fountain.fm/episode/A49mWX658yQqbvOU9M98

I have lmdb, i can build up these relations myself using lmdb indices. It doesn’t really make sense to pull in a new database just for this feature

Actually, it doesn't make sense to recreate Neo4j's graph data structure and query language. It's more mature than you think.

That seems like a lot of work, maybe eventually. Just gonna keep the api relatively simple: a function that gives wot score between two pubkeys

I agree. Neo4j has a nontrivial learning curve and we can’t all do all the things.

I think the ideal scenario would be devs who are already familiar with Neo4j to put together some ETL tools to sync LMDB with Neo4j. Something that can be readily paired with notedeck, strfry, khatru, etc.

I'd need to do some research ingesting events into neo4j. It shouldn't be much more than running a relay in terms of storage space. Where you'll pay the most is in CPUs. On Amazon, the difference between 2 and 4 CPUs looks like $50/mo. I'm sure you could have good performance on a small or medium sized web of trust in their medium 2CPU/4GB instances for $15-20/mo.

For all of Nostr, primal style centralized cache with lots of users, I'd guess you could spend $50-100 or more a month.

I just looked, and the free edition allows 4 CPUs max, 34 billion nodes, and no clustering. Makes sense price wise, because as soon as you're paying for than 4 CPUs in hosting, you're big enough to not gawk at the price of an enterprise license.

Imagine building a personalized WoT relay using community edition neo4j. Fully FOSS. Personalized reputation scores. Knowledge graph to organize the content that you care about the most.

One could do a lot with 34 billion nodes.

I got strfry and neo4j playing nice on a server together. Next step is ETL pipeline from strfry’s LMDB into neo4j.

Ofc, lots of relays use LMDB, so the above pipeline could be easily applied to khatru, nostrdb, etc.

Perhaps nostr:npub1fvmadl0mch39c3hlr9jaewh7uwyul2mlf2hsmkafhgcs3dra6dzqg6szfu + neo4j … 🤔

nostr:npub10npj3gydmv40m70ehemmal6vsdyfl7tewgvz043g54p0x23y0s8qzztl5h

neo4j can definitely handle everything that nostr can throw at it.

This was in 2021:

“Behind the Scenes of Creating the World’s Biggest Graph Database”

280 TB, 1 trillion relationships

https://neo4j.com/developer-blog/behind-the-scenes-worlds-biggest-graph-database/