Not a chip designer, but I think it could lead to a more efficient use of the functional units.
With more instructions decoded per clock and a larger reorder buffer, the core should be able to keep its units busier, i.e. not wasting energy without producing valuable output.
This efficiency gain of course needs to outweigh the consumption of the additional decoders. This part is easier with ARM as decoding x86 is complicated.
In addition to higher unit utilization, the increased parallelism should also be an advantage in the "race to sleep" power management strategy.
With more instructions decoded per clock and a larger reorder buffer, the core should be able to keep its units busier, i.e. not wasting energy without producing valuable output.
This efficiency gain of course needs to outweigh the consumption of the additional decoders. This part is easier with ARM as decoding x86 is complicated.
In addition to higher unit utilization, the increased parallelism should also be an advantage in the "race to sleep" power management strategy.