yes it is dangerous to fully trust a model generated by big corps!
training is two steps. 1: curation of data. 2: the actual changing of weights. second step is pretty automated. there are tools like llama-factory.
first step is python scripts to go thru notes and deciding what is a knowledge and what is chat. removing things like news, llm generated content. i don't want other llm generated content to influence my model.
thats another danger. bigger corp llm's are kind of accepted as ground truth while training little models. thats very scary.