Benchmark script for llama.cpp & results for AMD RX 7900 XTX

Shell 100%

Find a file

Mike Key 70fd9bffd1 init commit		2025-12-11 07:22:14 -07:00
benchmark-results	init commit	2025-12-11 07:22:14 -07:00
benchmark-rocm.sh	init commit	2025-12-11 07:22:14 -07:00
LICENSE	init commit	2025-12-11 07:22:14 -07:00
README.md	init commit	2025-12-11 07:22:14 -07:00

README.md

AMD RX 7900 XTX ROCm Benchmarks for llama.cpp

Benchmark results for dual AMD Radeon RX 7900 XTX GPUs using the ROCm backend with llama.cpp.

Results formatted for the ROCm Scoreboard discussion #15021.

About This Benchmark

The benchmark-rocm.sh script runs the canonical llama-bench tests used by the llama.cpp community to compare GPU performance across different hardware.

What It Tests

Test	Description
pp512	Prompt processing speed (512 tokens) - measures how fast the model processes input
tg128	Token generation speed (128 tokens) - measures inference/output speed
fa=0	Flash Attention disabled
fa=1	Flash Attention enabled

Key Flags

Flag	Purpose
`-ngl 99`	Offload all layers to GPU
`-fa 0,1`	Test both with and without Flash Attention
`-sm none -mg N`	Use only GPU N (for single-GPU tests)

Usage

Basic (7B Canonical Test)

./benchmark-rocm.sh

With 70B Model

MODEL_70B=/path/to/your/70B-model.gguf ./benchmark-rocm.sh

Customization

Edit benchmark-rocm.sh to modify:

LLAMA_CPP_DIR - Path to your llama.cpp installation
MODEL_DIR - Where to store/find models
MODEL_7B - Path to the 7B Q4_0 model (downloads automatically if missing)
Add additional test configurations in the benchmark sections

To test different prompt/generation lengths, add flags like:

$LLAMA_BENCH -m "$MODEL_7B" -ngl 99 -fa 0,1 -p 512,1024,2048,4096 -n 128,256,512

System Configuration

Component	Details
OS	Arch Linux 6.12.61-1-lts
ROCm	7.1.1
GPUs	2x AMD Radeon RX 7900 XTX (24GB each)
Architecture	gfx1100, Wave Size: 32
Build	`34ce48d97` (7356)

Device Info:

Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
Device 1: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

Benchmark Results

Canonical 7B Q4_0 Results (for Scoreboard)

GPU 0:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3487.15 ± 39.07
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	118.82 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3785.07 ± 32.27
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	126.09 ± 0.28

GPU 1:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3550.31 ± 52.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	130.67 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3800.57 ± 50.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	138.57 ± 0.17

Extra: Dual-GPU Results (7B - not for main scoreboard)

When splitting across both GPUs (model too small to benefit):

fa	test	t/s
0	pp512	3040.87 ± 20.61
0	tg128	87.75 ± 0.06
1	pp512	3291.03 ± 8.93
1	tg128	94.13 ± 0.21

As expected, the 7B model is too small for multi-GPU to help — single GPU is faster.

Extra: 70B Q4_K_M Results (Dual-GPU)

Llama 3.1 70B Q4_K_M (~40GB) requires both GPUs. This is where dual-GPU shines:

model	size	params	backend	ngl	fa	test	t/s
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	ROCm	99	0	pp512	330.06 ± 0.38
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	ROCm	99	0	tg128	13.06 ± 0.01
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	ROCm	99	1	pp512	341.41 ± 0.23
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	ROCm	99	1	tg128	13.37 ± 0.01

Summary:

pp512 (FA=1): 341.41 t/s — solid prompt processing for a 70B model
tg128 (FA=1): 13.37 t/s — usable for interactive use (~800 tokens/min)
Flash Attention provides ~3% improvement on this model

License

MIT License