Benchmark script for llama.cpp & results for AMD RX 7900 XTX
Find a file
2025-12-11 07:22:14 -07:00
benchmark-results init commit 2025-12-11 07:22:14 -07:00
benchmark-rocm.sh init commit 2025-12-11 07:22:14 -07:00
LICENSE init commit 2025-12-11 07:22:14 -07:00
README.md init commit 2025-12-11 07:22:14 -07:00

AMD RX 7900 XTX ROCm Benchmarks for llama.cpp

Benchmark results for dual AMD Radeon RX 7900 XTX GPUs using the ROCm backend with llama.cpp.

Results formatted for the ROCm Scoreboard discussion #15021.

About This Benchmark

The benchmark-rocm.sh script runs the canonical llama-bench tests used by the llama.cpp community to compare GPU performance across different hardware.

What It Tests

Test Description
pp512 Prompt processing speed (512 tokens) - measures how fast the model processes input
tg128 Token generation speed (128 tokens) - measures inference/output speed
fa=0 Flash Attention disabled
fa=1 Flash Attention enabled

Key Flags

Flag Purpose
-ngl 99 Offload all layers to GPU
-fa 0,1 Test both with and without Flash Attention
-sm none -mg N Use only GPU N (for single-GPU tests)

Usage

Basic (7B Canonical Test)

./benchmark-rocm.sh

With 70B Model

MODEL_70B=/path/to/your/70B-model.gguf ./benchmark-rocm.sh

Customization

Edit benchmark-rocm.sh to modify:

  • LLAMA_CPP_DIR - Path to your llama.cpp installation
  • MODEL_DIR - Where to store/find models
  • MODEL_7B - Path to the 7B Q4_0 model (downloads automatically if missing)
  • Add additional test configurations in the benchmark sections

To test different prompt/generation lengths, add flags like:

$LLAMA_BENCH -m "$MODEL_7B" -ngl 99 -fa 0,1 -p 512,1024,2048,4096 -n 128,256,512

System Configuration

Component Details
OS Arch Linux 6.12.61-1-lts
ROCm 7.1.1
GPUs 2x AMD Radeon RX 7900 XTX (24GB each)
Architecture gfx1100, Wave Size: 32
Build 34ce48d97 (7356)

Device Info:

Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
Device 1: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

Benchmark Results

Canonical 7B Q4_0 Results (for Scoreboard)

GPU 0:

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 3487.15 ± 39.07
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 118.82 ± 0.05
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 3785.07 ± 32.27
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 126.09 ± 0.28

GPU 1:

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 3550.31 ± 52.29
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 130.67 ± 0.24
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 3800.57 ± 50.29
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 138.57 ± 0.17

Extra: Dual-GPU Results (7B - not for main scoreboard)

When splitting across both GPUs (model too small to benefit):

fa test t/s
0 pp512 3040.87 ± 20.61
0 tg128 87.75 ± 0.06
1 pp512 3291.03 ± 8.93
1 tg128 94.13 ± 0.21

As expected, the 7B model is too small for multi-GPU to help — single GPU is faster.


Extra: 70B Q4_K_M Results (Dual-GPU)

Llama 3.1 70B Q4_K_M (~40GB) requires both GPUs. This is where dual-GPU shines:

model size params backend ngl fa test t/s
llama 70B Q4_K - Medium 39.59 GiB 70.55 B ROCm 99 0 pp512 330.06 ± 0.38
llama 70B Q4_K - Medium 39.59 GiB 70.55 B ROCm 99 0 tg128 13.06 ± 0.01
llama 70B Q4_K - Medium 39.59 GiB 70.55 B ROCm 99 1 pp512 341.41 ± 0.23
llama 70B Q4_K - Medium 39.59 GiB 70.55 B ROCm 99 1 tg128 13.37 ± 0.01

Summary:

  • pp512 (FA=1): 341.41 t/s — solid prompt processing for a 70B model
  • tg128 (FA=1): 13.37 t/s — usable for interactive use (~800 tokens/min)
  • Flash Attention provides ~3% improvement on this model

License

MIT License