Training data

Reply to this note

Please Login to reply.

Discussion

LLMs can only statistically predict logic, rather than run and validate logic.

Statistically predicted logic tends to converge on correct logic through lots of examples.

Can't you just train it on literally all binaries that exist?

Is there a stack overflow of people asking questions about binaries?

Is that required? I'd argue just directly ingesting all the software that exists directly avoids all the garbage that comes with human language. Maybe there's not enough data though.

As @calle points out, there is no vast repository of assembly language projects to train the models on. Obviously there is plenty of compiled code but not sure there would be enough context for that to be useful?