pi-chrome browser-control benchmark
This benchmark measures how well Chrome-control tools let agents do real browser work: DOM discovery, trusted input, keyboard/focus, scroll, drag/drop, files, frames, clipboard, and observability.
Open a row, drive it with chrome_* tools, then return here. Verdicts are read from localStorage. Expected outcomes come from manifest.json.
| id | gate | category | verdict | expected | baseline | risk | goal / notes | open |
|---|
Long-horizon hermetic tasks
These WebArena/BrowserGym-inspired tasks use deterministic in-page graders and fresh $RUN_ID state. They test multi-step browser work beyond event fidelity.
| id | difficulty | category | intent | requires | open |
|---|
Agent run loop
python3 -m http.server 8765
# For each manifest entry:
# 1. chrome_navigate http://127.0.0.1:8765/<file>
# 2. chrome_snapshot before acting; prefer uid selectors.
# 3. Execute recipe with selected mode, adapting descriptive frame/shadow selectors to tool uids.
# 4. chrome_evaluate JSON.stringify({v:window.__verdict,r:window.__reason,e:window.__events?.slice(-20)})
# 5. Compare verdict to manifest.expected[mode]; CONDITIONAL means inspect prerequisites/notes.