Thing is, the only way x86 can go wider is to go as out-of-order as Apple's CPUs go. I believe the decoder does not look at instructions but scans across a text-like stream of bytes in a way similar to how we scan text. It reads an opcode, sets up a μcode template and the goes on to scan in the arg specs until that template is full. Sometimes the arg specs will just drop as-is into the template, sometimes they will be links to discrete μops that will fill the template arg value slot when they complete. But a single instruction may have internal dependencies that limit its ability to spread its μops out very much, and while you could have the μops of several instructions running alongside each other, avoiding entangling interdependencies is pretty hard on x86.
That's not really a good characterization of why x86 decode is hard - I think you've got some misconceptions about x86 which are misleading you.
The big problem is that the opcode
doesn't always come first. There are zero to N prefix bytes, and the way prefixes are encoded means the CPU can't determine how many will be present just by looking at the first, so it has to scan byte-by-byte to find the start of the main opcode. There can also be postfix bytes after the main opcode, used for addressing modes (including encoding immediate values), but I believe that as soon as you are able to decode the opcode, you know how many postfix bytes there will be, so scanning is only needed for prefix bytes.
This sucks because if you want to decode many instructions in parallel, you end up with a serializing dependency chain. You may know the address of the first instruction you want to decode, but you can't know where the second instruction begins until the first decoder finds and decodes its opcode.
Many x86 chips with wide execution paths solve this decode serialization problem with brute force. They just provide a ton of decoders, enough to dedicate one to every byte offset in the entire chunk of memory provided by the instruction fetch unit each cycle. They're all started in parallel, and the serial part (picking the winning decoders which started on real instruction boundaries) is deferred until all of them have found an "opcode" and used it to determine a "length".
That's why x86 decode is so power hungry: if you want high clock speed and reasonable decode width at the same time, you end up being driven to do lots of useless work that has to be discarded.
None of this requires the stuff you imagined, a lot of which I'm not even sure what you mean. The closest thing I know of is that in Intel CPUs (probably AMD's too), only one of the N decoders is capable of handling "complex" instructions, meaning those which issue more than two uops to the execution backend. You only need one complex decoder because the vast majority of x86 instructions are "simple" ones which translate to at most two uops.
The complex decoder is mostly to handle things like the x86 "rep" (repeat) prefix. Not every x86 instruction can be repeated, but the handful which can are things like movsb, a byte copy instruction, or stosb, a byte store. I picked these as examples because by adding the rep prefix to them, you can implement strcpy() or memset() in a single instruction. Neat trick, but these self-looping instructions require a state machine which loops, emitting uops until the loop's halting condition is satisfied. You don't need to decode multiple of these in parallel because a single looping, long-running instruction is going to keep the backend fed with uops for quite a while all on its own.
There aren't limits on spreading uops from a single instruction out across execution units (other than the type of the uop determining where it can go), nor are there limits on running uops from multiple instructions at the same time. There isn't a tendency to have tons of "entangling interdependencies" either. I think you've invented a headcanon where x86 is a super complicated CISC along the lines of every instruction containing fifty kitchen sinks, but it actually isn't that way at all. Just the opposite - it was always, by accident, the RISCiest of the CISCs. Most instructions are quite simple and straightforward in terms of effects and dependencies, it's the details of how they're encoded and all the other legacy stuff which make x86 kind of a mess. Just not so much a mess that you can't build a fast x86, which is why it's survived and the other CISCs from its era haven't.