It is an extraordinarily difficult problem to transform scalar code into vector instructions. The only way to get even passable output from a vectorizing compiler is to write the code as vectors to begin with, such as with cross-platform assembly tools like Orc.
And even then you'll often end up significantly worse off than if you wrote the assembly by hand.
A run of Intel's compiler on the C versions of our DSP functions resulted in a grand total of one vectorization, which was done terribly, too.
The problem is that you used C, which doesn't have any syntax to represent meta-information about the problem you're trying to solve. When you write out C code to, say, add a list of numbers, it's hard for the compiler to optimize that. But it's very easy for the compiler when you tell it "sum this list of numbers".
And even then you'll often end up significantly worse off than if you wrote the assembly by hand.
A run of Intel's compiler on the C versions of our DSP functions resulted in a grand total of one vectorization, which was done terribly, too.