Re: multiple cycles, I worded that awkwardly, didn't mean to imply they weren't pipelined at all. Just that you don't want to increase L1 pipeline depth a ton.I’ve never designed a CPU where that was the case. The cache always takes multiple cycles. Typically 1 or more to generate and transmit the address, 1 for the read, and 1 or more to return the result. The read (where the address is already latched and everything is ready to go) has never been a thing where we come close to using the whole cycle. In every single CPU I’ve designed it is some random logic path that you would never have thought about which ends up setting the cycle time.
As for whether it's usually the critical path, I can't claim direct experience. I was just going by what I was told about the test program for the <REDACTED> core my employer used in an ASIC many years ago - it used L1 cache read to validate timing. I was also told this approach was common as L1 dcache was often the critical path.
This core was much higher performance than an embedded microcontroller, and was hardened for the process node, but wasn't high performance relative to contemporary desktop and server cores. Maybe that's a factor?