Update : Added context length scaling (2k-32k) and CPU baseline (R9 7945hx 64GB)
| num_ctx | Vulkan | ROCm | Native (CPU) |
|---|---|---|---|
| 2048 | 56.8 t/s | 49.7 t/s | 24.6 t/s |
| 8192 | 52.6 t/s | 49.1 t/s | 24.8 t/s |
| 16384 | 46.1 t/s | 46.5 t/s | 24.7 t/s |
| 32768 | 40.3 t/s | 43.4 t/s | 24.5 t/s |
Context scaling (2k → 32k performance loss):
- Vulkan: -29%
- ROCm: -13%
- Native: -0.2%
Power consumption:
- Vulkan: ~65 W (0.7-0.9 t/W)
- ROCm: ~150 W (0.3 t/W)
- Native: ~11 W (2.2 t/W)
Takeaways:
- Small contexts (≤8k): Vulkan wins on speed + efficiency
- Large contexts (≥16k): ROCm catches up in speed, but 2x power
- CPU scales perfectly but is 2x slower than GPU
- If power/heat matters more than speed: CPU is surprisingly viable at 25 t/s