We can’t say much yet, because this doesn’t tell us too much.
That said, improved branch prediction, generally speaking, prevents pipeline stalls that occur when there is a conditional branch in your code. (For example: if x<20 do something). When this happens, modern CPUs guess whether the branch will be taken or not. The alternative would be to wait until the CPU determines if x<20, but that could take awhile, and, in the meantime, it’s better to guess what code comes next and start conditionally executing it. If you guess wrong, you flush everything that occurred after the branch, and start over. That means you did work for no reason (bad for power) and you could have been executing useful instructions but you weren’t (slowing down execution).
Typically, branch predictors get it right over 90% of the time. But improving the prediction, as long as you don’t increase complexity too much, can improvre overall performance and power.
Wider decode and execution engines likely means each core has more ALUs operating in parallel, and the instruction scheduler can issue more instructions in parallel. This means the CPU can, on average, do more work per clock cycle (assuming it can keep all the ALUs busy). The old cores were already extraordinarily wide, but if they are wider you would expect more work to be done each clock cycle. The benefit gets smaller and smaller as you get wider and wider, though, because it gets harder to find instructions that can execute in parallel. For example, if A = B+C and D=A+F, you have to do the A= instruction BEFORE THE D= instruction - you can’t do them in parallel.