well, see, that's the thing

ok, my current task i'm building some pieces that dramatically reduce the data overhead for storage for social interactions, mentions, tagging users, probably DMs, follows, mutes, all that social graph stuff will be dramatically optimized with this thing i'm working on right now fully working - mainly because it will cut the storage cost and primary retrieval latency from the DB

but event to event linking is a second kind of problem, i am interested in giving it some good thought how to make it work better, but i think that what i have already built is pretty fast at it... i have tried a few approaches in my work so far, my first version tried to compress data sizes by recognising tags that are actually representing binary data, but the complexity of the logic made it very difficult for me to get right

where it is probably practical to create an index scheme for publishers, indexing actual events is a separate problem, the main thing is that you can't do it with everything, or you open up a major resource exhaustion attack vector, i have bounds zipped up for identities because of the use of a web of relations scheme for access control, and thus there is some limits on how many npubs are going to end up in the index, the entire effort is void if there is more npubs than references to them, the cost of an identity index is a query and the identity plus 9 bytes of key type and index, and those indexes, see, events can much more easily spiral out of the bounds of even 64 bits of index space, and that is a lot of overhead

so, yeah, i think you are capable of following my thinking here, the obvious naive answers don't apply for optimizing this kind of data access pattern, i will of course be thinking about it a lot because i want to help with this but i am just not interested in offering a half baked solution that turns out to net slow things down after the data set rises beyond a certain scale or proportion of the total

and i fully expect that you will take that information in and return some good points that will help me narrow down the direction to probe at this problem

Reply to this note

Please Login to reply.

Discussion

You are speaking of data retrieval, but I was thinking of data uploading over one npub nostr:nprofile1qqsruxks7wja8sfzghdh0zz5d3p6mc7e03hqgmzefasp0ntv6stydyqprdmhxue69uhhg6r9vehhyetnwshxummnw3erztnrdakj7qguwaehxw309ankjarrd96xzer9dshxummnw3erztnrdakj7qguwaehxw309a6xsetrd96xzer9dshxummnw3erztnrdakj7ef5npc , to move data from PDFs and websites to 30040 events, and parse them.

Like I did here, manually. It took hours. AI and some good scripting should be able to do the same in minutes.

https://next-alexandria.gitcitadel.eu/publication?d=less-partnering-less-children-or-both-by-j.i.s.-hellstrand-v-1

write side is not the bottleneck

also, parsing data ... you are raising the spectre of the need to build ... dun dun dun ... state machines, the most difficult bitches in all of software engineering to make work

We know.

state machine specialist at your service

i'm only barely a journeyman at the task, i revel at every chance i get to do it... i've done tree walkers and parsers and all this kind of stuff... makes me happy thinking about it, even more than streamlining concurrent processes

Well, AsciiDoctor built the text parser. I extended it to layer some Nostr event-specific state on top of AsciiDoctor's Abstract Syntax Tree (AST).

It definitely was fun to whip out some good ol-fashioned tree walking algorithms.

Famous last words.

AI text manglers are not the most efficient for the job of transforming one semantic graph into another, that is what state machines are for, and ...

ok, here's a good little bit of info about why each linguistic transformation task really is best served by a specific implementation built for the task

https://stackoverflow.com/questions/20355112/why-do-we-use-intermediate-languages-instead-of-ast

every data decomposition, storage and recomposition task involves this stuff and the absolute best tool for the job is some human brain that has spent a lot of time specialising in how to build state machines, there is lexical decomposition, then you have to design an intermediate data format, and then an engine that takes that intermediate representation and uses it to derive your product formatting

the AIs can sorta do it, but they are hella inefficient in production to do this, i mean, thousands if not millions of times more processing time and watts of energy and bandwidth to achieve it

there is a lot of work already existing relating to the decomposition and synthesis steps of this process for specific encoding systems, and what i want you to do, in order to help solve this problem is this:

enumerate the most common sources and targets of the data you want to transform and how you want it to be accessed (this influences what is the best form to store it in) and when you have figured out a reasonable skeleton of those things then we have the map of how to speed up your job

Yeah, nostr:nprofile1qqs82et8gqsfjcx8fl3h8e55879zr2ufdzyas6gjw6nqlp42m0y0j2spz9mhxue69uhkummnw3ezuamfdejj7kpr32f was saying he might have a parser for books, but we'd probably need to adjust it and build a second one for research papers.

I talked to my cousin about it and he is just converting the html5 version of the project Gutenberg ebooks into epub3 after doing some clean up.

Hm. They already usually have epub3 format.

Using HTML has the benefit of tags and images, of course. I should probably switch to those and write a parser based upon the HTML tags.

Is he just scraping their website?

I am not sure how he handles the initial download. But he wanted to put in a bunch of heuristics to make a more presentable epub without miss-sized images, unfortunate line breaks etc.

That's probably where an LLM would come in useful, for refining.