nah, i just found a better optimization by luck lol.
getting the verifications to go faster, the Strauss-WNAF curve multiplication optimization explains a 2-4x better performance at verification with libsecp256k1, which is precisely the ratio the two different implementations show in performance.
for some reason, getting the optimization to actually work seems to be extremely slow going work.
the fact that every other of the 4 essential functions now run faster is interesting though. since this is not hand-coded C neckbeard version. some of the difference could be Go's better optimization of generated assembler and the other could be a more efficient memory scheme. on every single operation the Go code is using less memory, mostly a lot less memory. my attempt at making it faster in pure go has already yielded what would be suitable for a more constrained device to sign/verify more stuff in less time.