AI text manglers are not the most efficient for the job of transforming one semantic graph into another, that is what state machines are for, and ...

ok, here's a good little bit of info about why each linguistic transformation task really is best served by a specific implementation built for the task

https://stackoverflow.com/questions/20355112/why-do-we-use-intermediate-languages-instead-of-ast

every data decomposition, storage and recomposition task involves this stuff and the absolute best tool for the job is some human brain that has spent a lot of time specialising in how to build state machines, there is lexical decomposition, then you have to design an intermediate data format, and then an engine that takes that intermediate representation and uses it to derive your product formatting

the AIs can sorta do it, but they are hella inefficient in production to do this, i mean, thousands if not millions of times more processing time and watts of energy and bandwidth to achieve it

there is a lot of work already existing relating to the decomposition and synthesis steps of this process for specific encoding systems, and what i want you to do, in order to help solve this problem is this:

enumerate the most common sources and targets of the data you want to transform and how you want it to be accessed (this influences what is the best form to store it in) and when you have figured out a reasonable skeleton of those things then we have the map of how to speed up your job

Yeah, nostr:nprofile1qqs82et8gqsfjcx8fl3h8e55879zr2ufdzyas6gjw6nqlp42m0y0j2spz9mhxue69uhkummnw3ezuamfdejj7kpr32f was saying he might have a parser for books, but we'd probably need to adjust it and build a second one for research papers.

Reply to this note

Please Login to reply.

Discussion

I talked to my cousin about it and he is just converting the html5 version of the project Gutenberg ebooks into epub3 after doing some clean up.

Hm. They already usually have epub3 format.

Using HTML has the benefit of tags and images, of course. I should probably switch to those and write a parser based upon the HTML tags.

Is he just scraping their website?

I am not sure how he handles the initial download. But he wanted to put in a bunch of heuristics to make a more presentable epub without miss-sized images, unfortunate line breaks etc.

That's probably where an LLM would come in useful, for refining.