But I assume no GPU has specialized sha256 instructions, so even if you run many hashes in parallel it might not be faster.
Discussion
Depends on how many cores you have. If you think about it that's why ASICs exist.
If someone managed to break down the Sha-256 update into a matrix multiplication then a GPU would be MUCH quicker.
But afaik it isn't.
Maybe it's not actually faster, but just much more energy efficient. Combined with massive parallelism the end result is more hashes per Joule.
But hopefully we don't have to guess and someone who actually implemented this can explain it all on stack exchange.