Two notes I wanted to add:1) In addition to parallel inference and training, prompt encoding is also parallelizable even at batch_size=1 because the prompt tokens can be encoded by the LLM in parallel instead of decoded serially one by one. The token inputs into LLMs always have shape (B,T), batch by time. Parallel inference decoding is (high B, T=1), training is (high B, high T), and long prompts is (B=1, high T). So this workload can also become compute-bound (e.g. above 160 tokens) and the A100 would shine again. As your prompts get longer, your MacBook will fall farther behind the A100.2) The M2 chips from Apple are actually quite an amazing lineup and come in much larger shapes and sizes. The M2 Pro, M2 Max have 200 and 400 GB/s (you can get these in a MacBook Pro!), and the M2 Ultra (in Mac Studio) has 800 GB/s. So the M2 Ultra is the smallest, prettiest, out of the box easiest, most powerful personal LLM node today.