Subnostr

You are speaking of data retrieval, but I was thinking of data uploading over one npub nostr:nprofile1qqsruxks7wja8sfzghdh0zz5d3p6mc7e03hqgmzefasp0ntv6stydyqprdmhxue69uhhg6r9vehhyetnwshxummnw3erztnrdakj7qguwaehxw309ankjarrd96xzer9dshxummnw3erztnrdakj7qguwaehxw309a6xsetrd96xzer9dshxummnw3erztnrdakj7ef5npc , to move data from PDFs and websites to 30040 events, and parse them.

Silberengel 1y ago

Like I did here, manually. It took hours. AI and some good scripting should be able to do the same in minutes.

https://next-alexandria.gitcitadel.eu/publication?d=less-partnering-less-children-or-both-by-j.i.s.-hellstrand-v-1

Reply to this note

Please Login to reply.

Discussion

ᴛʜᴇ ᴅᴇᴀᴛʜ ᴏꜰ ᴍʟᴇᴋᴜ 1y ago

write side is not the bottleneck

ᴛʜᴇ ᴅᴇᴀᴛʜ ᴏꜰ ᴍʟᴇᴋᴜ 1y ago

also, parsing data ... you are raising the spectre of the need to build ... dun dun dun ... state machines, the most difficult bitches in all of software engineering to make work

Silberengel 1y ago

We know.

ᴛʜᴇ ᴅᴇᴀᴛʜ ᴏꜰ ᴍʟᴇᴋᴜ 1y ago

state machine specialist at your service

i'm only barely a journeyman at the task, i revel at every chance i get to do it... i've done tree walkers and parsers and all this kind of stuff... makes me happy thinking about it, even more than streamlining concurrent processes

Silberengel 1y ago

nostr:nprofile1qqs8qy3p9qnnhhq847d7wujl5hztcr7pg6rxhmpc63pkphztcmxp3wgpz9mhxue69uhkummnw3ezuamfdejj7qgmwaehxw309a6xsetxdaex2um59ehx7um5wgcjucm0d5hsz9nhwden5te0dehhxarjv4kxjar9wvhx7un89uqaujaz built the Alexandria parser and he seemed to have fun with it.

MichaelJ 1y ago

Well, AsciiDoctor built the text parser. I extended it to layer some Nostr event-specific state on top of AsciiDoctor's Abstract Syntax Tree (AST).

It definitely was fun to whip out some good ol-fashioned tree walking algorithms.

Silberengel 1y ago

Famous last words.

Silberengel 1y ago

There's some AsciiMath in there, and footnotes and etc. , but they're not rendering. You can see them in the json.

https://njump.me/nevent1qvzqqqr4tqpzq4ekxjmysc6vhtgs7fz3wasgn63ppyxegplhzh5rc4mmgcg6umkuqydhwumn8ghj7argv43kjarpv3jkctnwdaehgu339e3k7mgprpmhxue69uhhyetvv9ujumn0wdmksetjv5hxxmmdqqsvrrp4f337skkam0krs67mgx9xt2tk2evfanx3szwnpux3ktncr7guxsme4

ᴛʜᴇ ᴅᴇᴀᴛʜ ᴏꜰ ᴍʟᴇᴋᴜ 1y ago

AI text manglers are not the most efficient for the job of transforming one semantic graph into another, that is what state machines are for, and ...

ok, here's a good little bit of info about why each linguistic transformation task really is best served by a specific implementation built for the task

https://stackoverflow.com/questions/20355112/why-do-we-use-intermediate-languages-instead-of-ast

every data decomposition, storage and recomposition task involves this stuff and the absolute best tool for the job is some human brain that has spent a lot of time specialising in how to build state machines, there is lexical decomposition, then you have to design an intermediate data format, and then an engine that takes that intermediate representation and uses it to derive your product formatting

the AIs can sorta do it, but they are hella inefficient in production to do this, i mean, thousands if not millions of times more processing time and watts of energy and bandwidth to achieve it

there is a lot of work already existing relating to the decomposition and synthesis steps of this process for specific encoding systems, and what i want you to do, in order to help solve this problem is this:

enumerate the most common sources and targets of the data you want to transform and how you want it to be accessed (this influences what is the best form to store it in) and when you have figured out a reasonable skeleton of those things then we have the map of how to speed up your job

Silberengel 1y ago

Yeah, nostr:nprofile1qqs82et8gqsfjcx8fl3h8e55879zr2ufdzyas6gjw6nqlp42m0y0j2spz9mhxue69uhkummnw3ezuamfdejj7kpr32f was saying he might have a parser for books, but we'd probably need to adjust it and build a second one for research papers.

Daniel Wigton 1y ago

I talked to my cousin about it and he is just converting the html5 version of the project Gutenberg ebooks into epub3 after doing some clean up.

Silberengel 1y ago

Hm. They already usually have epub3 format.

Using HTML has the benefit of tags and images, of course. I should probably switch to those and write a parser based upon the HTML tags.

Is he just scraping their website?

Daniel Wigton 1y ago

I am not sure how he handles the initial download. But he wanted to put in a bunch of heuristics to make a more presentable epub without miss-sized images, unfortunate line breaks etc.

Silberengel 1y ago

That's probably where an LLM would come in useful, for refining.