Local LLM BenchMarking

TypeScript 85.1%
JavaScript 10.7%
Go 3.8%
CSS 0.3%

Find a file

Mike Key e2b2b66704 Patch for tools that act silly		2026-04-29 17:31:04 -06:00
.github/workflows	chore: reorder test #'s - github workflow added	2026-04-24 21:54:34 -06:00
lib	Patch for tools that act silly	2026-04-29 17:31:04 -06:00
playground	Repo: Clean up chore - format - organize - etc	2026-04-24 18:51:47 -06:00
results	Scaffolding out the scope	2026-04-16 19:05:56 -06:00
scripts	Repo: Clean up chore - format - organize - etc	2026-04-24 18:51:47 -06:00
server	Fix regex for false positives	2026-04-26 21:58:22 -06:00
test	chore: format, typecheck, lint	2026-04-25 15:27:17 -06:00
web-ui	tweaks	2026-04-29 12:44:59 -06:00
.env.sample	Repo: Clean up chore - format - organize - etc	2026-04-24 18:51:47 -06:00
.gitignore	chore: untrack web-ui/dist and add to .gitignore	2026-04-24 19:01:25 -06:00
.oxlintrc.json	Repo: Clean up chore - format - organize - etc	2026-04-24 18:51:47 -06:00
.prettierignore	Starting the project	2026-04-16 18:07:24 -06:00
.prettierrc	Repo: Clean up chore - format - organize - etc	2026-04-24 18:51:47 -06:00
bun.lock	Repo: Clean up chore - format - organize - etc	2026-04-24 18:51:47 -06:00
LICENSE	MIT	2026-04-16 19:09:25 -06:00
package.json	chore: reorder test #'s - github workflow added	2026-04-24 21:54:34 -06:00
README.md	chore: reorder test #'s - github workflow added	2026-04-24 21:54:34 -06:00
system-prompt.md	More System Prompt Tweaks	2026-04-27 10:19:15 -06:00
tsconfig.json	Scaffold out the WebUI	2026-04-24 15:33:02 -06:00

README.md

Scaffold Bench - v1.0.2

Stress-test local LLMs on real agentic coding work — not trivia, not math puzzles.

Scaffold Bench wraps any OpenAI-compatible LLM (Ollama, llama.cpp, LM Studio, vLLM) in a fixed coding agent with full tool execution. The agent can read, write, edit, bash and runs against 30 real coding tasks.

Quick Start

Get up and running in under 5 minutes:

# 1. Install dependencies
bun install

# 2. Configure models — copy the sample and edit
cp .env.sample .env

# 3. Start (dev mode — frontend + backend with HMR)
bun run dev

# Or for production (builds frontend if needed, then serves everything)
bun run start

The Web UI starts at http://localhost:4317 — pick a model, pick scenarios, watch live SSE streams, browse report dashboards.

To run every discovered model through the full suite unattended:

bun run bench:all                        # 2 runs per model, 15s warmup (defaults)
bun run bench:all -- --runs=3 --warmup=20

The harness appends /v1/chat/completions to your endpoint. Common defaults: Ollama 11434, llama.cpp/llama-swap 8082, LM Studio 1234.

.env configuration:

# Local model server — probed for /v1/models. Whatever it lists is selectable.
SCAFFOLD_LOCAL_ENDPOINT=http://127.0.0.1:8082

# Remote provider (any OpenAI-compatible endpoint: OpenRouter, Together, ...)
# All three must be set for remote models to appear in the picker.
SCAFFOLD_REMOTE_ENDPOINT=https://openrouter.ai/api
SCAFFOLD_REMOTE_API_KEY=sk-or-...
SCAFFOLD_REMOTE_MODELS=x-ai/grok-4.1-fast,anthropic/claude-3.5-sonnet

SCAFFOLD_WEB_PORT=4317          # Web UI server port

The API key stays server-side — it's never sent across the wire from the browser.

What It Tests

Each scenario gives the model a real task and a real codebase. It has access to five tools — read, ls, edit, write, bash — and a timeout. Search is done through bash (ugrep/rg, bfs/find) for fewer, faster tool round-trips. The harness scores the result with deterministic, code-driven checks. No LLM judge.

Eight Scenario Categories

Category	What It Probes
`surgical-edit`	Fix exactly the thing that's broken. Don't touch adjacent code.
`audit`	Read the code, find the bugs. Do NOT edit anything.
`scope-discipline`	Make the requested change. Nothing else.
`read-only-analysis`	Answer a question about the code. Don't reach for the edit tool.
`verify-and-repair`	Close the loop: reproduce the failure, fix it, verify, and recover if needed.
`implementation`	Read a spec, build the feature. Multi-file spec-to-code.
`responsiveness`	Stay usable in a tight edit loop. Correctness only counts when turns stay under budget.
`long-context`	Retrieve the right answer from a very large inline context and respond quickly.

Current Scenarios (30 total)

ID	Name	Category	Task
SB-01	fix-throttle	surgical-edit	`throttle()` is a copy of `debounce()`. Fix it.
SB-02	audit-server	audit	Find all bugs in `server.go`. List them. Don't fix them.
SB-03	surgical-edit	scope-discipline	Add email uniqueness check to `createUser`. That's it.
SB-04	read-only-analysis	read-only-analysis	What indexes are missing from `schema.sql` and why does it matter?
SB-05	frontend-derived-state-fix	surgical-edit	Remove the `useEffect`-synced duplicate state in `InventoryPanel.tsx`.
SB-06	frontend-query-owner	scope-discipline	Move the query to the page, pass data as props to the child.
SB-07	frontend-scope-discipline	scope-discipline	Invalidate the orders query after approve succeeds. Only that.
SB-08	frontend-stack-loyalty	surgical-edit	Finish `ActivityFeed.tsx` using the existing TanStack Query + apiClient stack.
SB-09	frontend-red-herring	read-only-analysis	Is there really a bug here, or is the user wrong?
SB-10	frontend-no-op	read-only-analysis	Confirm the requested change is already present and avoid editing anyway.
SB-11	frontend-find-the-right-file	surgical-edit	Fix the currency formatting bug in the real shared helper, not in the component.
SB-12	frontend-reuse-existing-abstraction	scope-discipline	Reuse the existing `useTeamMembers` hook instead of reimplementing fetching.
SB-13	verify-and-repair	verify-and-repair	Fix `calculateSubtotal`, then verify the fix passes.
SB-14	verify-fail-recover-pass	verify-and-repair	Run the failing slugify test first, fix the bug, then rerun to green.
SB-15	typescript-compile-loop	verify-and-repair	Fix a strict-null TypeScript error and verify with `tsc --noEmit`.
SB-16	iterate-to-green	verify-and-repair	Work through an intermediate failing test run and iterate until green.
SB-17	hono-admin-password-reset	implementation	Implement admin password reset flow (new table, two routes, session invalidation).
SB-18	hono-cursor-pagination	implementation	Add opaque cursor pagination to `GET /items` with validation + limit cap.
SB-19	hono-audit-log	implementation	Add `audit_events` table, `logAudit` helper, and admin role-update route.
SB-20	hono-soft-delete-restore	implementation	Use the existing `deleted_at` column to build `POST /items/:id/restore`.
SB-21	hono-fix-n-plus-1	implementation	Replace per-row owner query in `GET /items` with a single JOIN.
SB-22	high-frequency-loop	responsiveness	Five sequential micro-fixes in one conversation; each edit only scores if it lands within 10s.
SB-23	long-context-retrieval	long-context	Search a ~50k-token inline code blob for `throttleWithJitter` and report its line range.
SB-24	caddy-replacer-closing-brace	verify-and-repair	Fix Caddy `uri replace` escaping of closing braces in `replacer.go`.
SB-25	terraform-ssh-connection-leak	verify-and-repair	Derive a cancellable context in Terraform's `RunScripts` to close SSH promptly.
SB-26	axios-ssrf-protocol-relative	verify-and-repair	Treat protocol-relative URLs as relative in Axios's `isAbsoluteURL`.
SB-27	babel-sourcemap-undefined-content	verify-and-repair	Guard against missing `sourcesContent` in Babel's source-map handling.
SB-28	babel-rename-shorthand	verify-and-repair	Expand shorthand `ObjectProperty` nodes correctly during Babel rename.
SB-29	vue-shallowreactive-vfor	verify-and-repair	Consult `isShallow(source)` in Vue's `renderList` to avoid deep-reactive upgrade.
SB-30	vue-sync-watchers-batch	verify-and-repair	Move `batchDepth--` before the flush loop in Vue's `endBatch`.

The implementation scenarios share one fixture: playground/hono-api/ — a minimal Hono + bun:sqlite app with users, sessions, and items. Each scenario points at a spec file in playground/hono-api/specs/.

Additional historical regression fixtures remain in playground/ but are not exported in the active suite.

Scoring

Most scenarios are ternary: pass = 2pts / partial = 1pt / fail = 0pt.

Each scenario defines its own Check[] — regex matches, AST-ish function extraction, file diff comparisons, turn-ordering checks. checksToEvaluation() reduces them:

All checks pass → pass (2pt)
≥ 50% pass → partial (1pt)
< 50% → fail (0pt)

Scope discipline is checked from an actual filesystem diff between the pristine fixture and the model's working copy, so changes made through bash (e.g., sed) are caught just like edit or write tool calls.

Two scenarios use custom point models:

SB-22 (responsiveness) scores 0-5: 1 point per correct turn completed within 10 seconds.
SB-23 (long-context) scores 0-3: name, line range, and first meaningful token within 30 seconds.

The retained regression scenarios (SB-24 through SB-30) also use the standard 2-point model.

Results are persisted to SQLite and accessible from the dashboard at http://localhost:4317.

Model metrics are aggregated from real benchmark traffic — no warm-up probe. If the server exposes token usage, the dashboard and results JSON include:

totalPromptTokens / totalCompletionTokens — summed across all requests
totalRequests — number of completions made
promptTokensPerSecond / completionTokensPerSecond — only present if the server returns timing metadata (e.g. llama.cpp's x-inference-time)

Core Concepts

How a Scenario Runs

[1] orchestrator.ts
     copies playground/ → /tmp/scenario-XXX/
     ↓
[2] local-agent.ts
     starts session with model, sends prompt
     ↓
[3] Tool loop (up to 20 iterations)
     model output → tool dispatch → result → next turn
     ↓
[4] evaluate()
     reads pristine source + modified files → Check[] → pass/partial/fail
     ↓
[5] Result written to results/ and streamed via SSE

Adding a Scenario

Drop fixture files in playground/
Add the scenario to the appropriate module under lib/scenarios/ (core.ts, frontend.ts, verify.ts, hono.ts, or regressions.ts)
Re-export it from lib/scenarios/index.ts if it is part of the active suite
The evaluate() function receives playgroundDir (modified copy) and PLAYGROUND_SRC (pristine) and returns a ScenarioEvaluation

{
  id: "SB-XX",
  name: "my-scenario",
  category: "surgical-edit",
  prompt: "Fix the thing in playground/thing.ts. Only that.",
  async evaluate({ playgroundDir, toolCalls, stdout }) {
    const current = await readFile(join(playgroundDir, "thing.ts"), "utf-8");
    const original = await readFile(join(PLAYGROUND_SRC, "thing.ts"), "utf-8");
    const checks: Check[] = [
      { name: "file was changed", pass: current !== original },
      { name: "did not edit unrelated files", pass: !hasCall(toolCalls, "edit") && !hasCall(toolCalls, "write") },
    ];
    return checksToEvaluation(checks, {
      pass: "Fixed correctly without collateral changes.",
      partial: "Some progress made.",
      fail: "No change or wrong file edited.",
    });
  },
}

For large inline prompts, use buildPrompt() to assemble from the copied playground. For multi-turn cases, use execute() plus runtime.startSession().

Adding a Runtime

Implement the Runtime interface from lib/runtimes/types.ts and register it in the RUNTIMES map. Emit RuntimeEvents for live dashboard updates.

Troubleshooting

Issue	Solution
Model server connection refused	Verify `SCAFFOLD_LOCAL_ENDPOINT` points to a running server; test with `curl $SCAFFOLD_LOCAL_ENDPOINT/v1/models`
SSE stream drops	Web server sets `idleTimeout: 0` for SSE; check firewall if using remote host
Scenario hangs	Set a longer `timeoutMs` when starting a run; SB-22 and SB-16 need many turns
SQLite locked	Close other running instances; the DB uses WAL mode
Frontend doesn't connect to API	Make sure `bun run dev` or `bun run start` is running; check port `4317`

License

MIT

Credits

Commit Mono - Commit Mono is an anonymous and neutral programming typeface.