Too complicated. Branching has to be handled by a separate unit that holds the canonical program counter (and various contingent program counters) and interfaces with the instruction fetch hardware. It also has to be closely couple to the scheduler. The ALUs, by contrast, receive input operands, perform a function, and produce output results. You don’t want them to do more than that, otherwise your critical path gets much longer and your clock speed plummets. And if each ALU had a branch unit, then you’d still need some sort of arbiter to sort it all out (multiple in-flight instructions may decide to branch, or not, to different instruction addresses).
Well, that’s what Dougall describes and also what Apple patents seem to suggest. Also, if it’s a different unit how does compare and branch fusion work?