Benchmark script for llama.cpp & results for AMD RX 7900 XTX
- Shell 100%
| benchmark-results | ||
| benchmark-rocm.sh | ||
| LICENSE | ||
| README.md | ||
AMD RX 7900 XTX ROCm Benchmarks for llama.cpp
Benchmark results for dual AMD Radeon RX 7900 XTX GPUs using the ROCm backend with llama.cpp.
Results formatted for the ROCm Scoreboard discussion #15021.
About This Benchmark
The benchmark-rocm.sh script runs the canonical llama-bench tests used by the llama.cpp community to compare GPU performance across different hardware.
What It Tests
| Test | Description |
|---|---|
| pp512 | Prompt processing speed (512 tokens) - measures how fast the model processes input |
| tg128 | Token generation speed (128 tokens) - measures inference/output speed |
| fa=0 | Flash Attention disabled |
| fa=1 | Flash Attention enabled |
Key Flags
| Flag | Purpose |
|---|---|
-ngl 99 |
Offload all layers to GPU |
-fa 0,1 |
Test both with and without Flash Attention |
-sm none -mg N |
Use only GPU N (for single-GPU tests) |
Usage
Basic (7B Canonical Test)
./benchmark-rocm.sh
With 70B Model
MODEL_70B=/path/to/your/70B-model.gguf ./benchmark-rocm.sh
Customization
Edit benchmark-rocm.sh to modify:
LLAMA_CPP_DIR- Path to your llama.cpp installationMODEL_DIR- Where to store/find modelsMODEL_7B- Path to the 7B Q4_0 model (downloads automatically if missing)- Add additional test configurations in the benchmark sections
To test different prompt/generation lengths, add flags like:
$LLAMA_BENCH -m "$MODEL_7B" -ngl 99 -fa 0,1 -p 512,1024,2048,4096 -n 128,256,512
System Configuration
| Component | Details |
|---|---|
| OS | Arch Linux 6.12.61-1-lts |
| ROCm | 7.1.1 |
| GPUs | 2x AMD Radeon RX 7900 XTX (24GB each) |
| Architecture | gfx1100, Wave Size: 32 |
| Build | 34ce48d97 (7356) |
Device Info:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
Device 1: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
Benchmark Results
Canonical 7B Q4_0 Results (for Scoreboard)
GPU 0:
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp512 | 3487.15 ± 39.07 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | tg128 | 118.82 ± 0.05 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp512 | 3785.07 ± 32.27 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | tg128 | 126.09 ± 0.28 |
GPU 1:
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp512 | 3550.31 ± 52.29 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | tg128 | 130.67 ± 0.24 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp512 | 3800.57 ± 50.29 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | tg128 | 138.57 ± 0.17 |
Extra: Dual-GPU Results (7B - not for main scoreboard)
When splitting across both GPUs (model too small to benefit):
| fa | test | t/s |
|---|---|---|
| 0 | pp512 | 3040.87 ± 20.61 |
| 0 | tg128 | 87.75 ± 0.06 |
| 1 | pp512 | 3291.03 ± 8.93 |
| 1 | tg128 | 94.13 ± 0.21 |
As expected, the 7B model is too small for multi-GPU to help — single GPU is faster.
Extra: 70B Q4_K_M Results (Dual-GPU)
Llama 3.1 70B Q4_K_M (~40GB) requires both GPUs. This is where dual-GPU shines:
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | ROCm | 99 | 0 | pp512 | 330.06 ± 0.38 |
| llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | ROCm | 99 | 0 | tg128 | 13.06 ± 0.01 |
| llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | ROCm | 99 | 1 | pp512 | 341.41 ± 0.23 |
| llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | ROCm | 99 | 1 | tg128 | 13.37 ± 0.01 |
Summary:
- pp512 (FA=1): 341.41 t/s — solid prompt processing for a 70B model
- tg128 (FA=1): 13.37 t/s — usable for interactive use (~800 tokens/min)
- Flash Attention provides ~3% improvement on this model
License
MIT License