With 4K pages, you can skip the last level of translation to map a unified block of memory into a 2M "super-page" (normally called a "block"). With 16K pages the block size becomes 32M. Seems to me if you are using UMA, the larger block size is probably more practical when working with complex GPU datasets, though I could be all wrong about that.
One other advantage to 16K pages is that the first level of page lookup is one bit. This means that, in theory, the first level could consist of a pair of internal registers – 4K paging requires a trip to the top-level page table to fetch one of 512 entries, so 16K paging could save some memory visits and TLB slots (each level, IIUC, eats a TLB slot). ARM does not promote the paired-level-zero register strategy, but if that is not what Apple is doing, why not? You still have to have a table root register anyway – if it would point to a 2-entry table, might as well just have 2 semi-root registers.
As far as tricks, programmers have been using tricks and shortcuts since the days of stone knives and bear skins. It works nicely today, but we have reached a point where tricks must needs be abandoned in favor of by-the-book techniques that will still work next year.