Like I did here, manually. It took hours. AI and some good scripting should be able to do the same in minutes.
You are speaking of data retrieval, but I was thinking of data uploading over one npub nostr:nprofile1qqsruxks7wja8sfzghdh0zz5d3p6mc7e03hqgmzefasp0ntv6stydyqprdmhxue69uhhg6r9vehhyetnwshxummnw3erztnrdakj7qguwaehxw309ankjarrd96xzer9dshxummnw3erztnrdakj7qguwaehxw309a6xsetrd96xzer9dshxummnw3erztnrdakj7ef5npc , to move data from PDFs and websites to 30040 events, and parse them.
Discussion
write side is not the bottleneck
also, parsing data ... you are raising the spectre of the need to build ... dun dun dun ... state machines, the most difficult bitches in all of software engineering to make work
We know.
state machine specialist at your service
i'm only barely a journeyman at the task, i revel at every chance i get to do it... i've done tree walkers and parsers and all this kind of stuff... makes me happy thinking about it, even more than streamlining concurrent processes
nostr:nprofile1qqs8qy3p9qnnhhq847d7wujl5hztcr7pg6rxhmpc63pkphztcmxp3wgpz9mhxue69uhkummnw3ezuamfdejj7qgmwaehxw309a6xsetxdaex2um59ehx7um5wgcjucm0d5hsz9nhwden5te0dehhxarjv4kxjar9wvhx7un89uqaujaz built the Alexandria parser and he seemed to have fun with it.
Well, AsciiDoctor built the text parser. I extended it to layer some Nostr event-specific state on top of AsciiDoctor's Abstract Syntax Tree (AST).
It definitely was fun to whip out some good ol-fashioned tree walking algorithms.
Famous last words.
There's some AsciiMath in there, and footnotes and etc. , but they're not rendering. You can see them in the json.
AI text manglers are not the most efficient for the job of transforming one semantic graph into another, that is what state machines are for, and ...
ok, here's a good little bit of info about why each linguistic transformation task really is best served by a specific implementation built for the task
https://stackoverflow.com/questions/20355112/why-do-we-use-intermediate-languages-instead-of-ast
every data decomposition, storage and recomposition task involves this stuff and the absolute best tool for the job is some human brain that has spent a lot of time specialising in how to build state machines, there is lexical decomposition, then you have to design an intermediate data format, and then an engine that takes that intermediate representation and uses it to derive your product formatting
the AIs can sorta do it, but they are hella inefficient in production to do this, i mean, thousands if not millions of times more processing time and watts of energy and bandwidth to achieve it
there is a lot of work already existing relating to the decomposition and synthesis steps of this process for specific encoding systems, and what i want you to do, in order to help solve this problem is this:
enumerate the most common sources and targets of the data you want to transform and how you want it to be accessed (this influences what is the best form to store it in) and when you have figured out a reasonable skeleton of those things then we have the map of how to speed up your job
Yeah, nostr:nprofile1qqs82et8gqsfjcx8fl3h8e55879zr2ufdzyas6gjw6nqlp42m0y0j2spz9mhxue69uhkummnw3ezuamfdejj7kpr32f was saying he might have a parser for books, but we'd probably need to adjust it and build a second one for research papers.
I talked to my cousin about it and he is just converting the html5 version of the project Gutenberg ebooks into epub3 after doing some clean up.
Hm. They already usually have epub3 format.
Using HTML has the benefit of tags and images, of course. I should probably switch to those and write a parser based upon the HTML tags.
Is he just scraping their website?