# Improvement Queue Dedup

The improvement queue accumulates candidate gate adjustments for operator
review. Without semantic dedup the same suggestion appears multiple times
whenever its wording shifts. `scripts/queue_dedup.py` collapses
near-duplicates.

## Backends

- `trigram` (default, stdlib-only): character-trigram Sørensen-Dice over the
  candidate's text. Threshold defaults to 0.80. The function name was
  historically `jaccard`; the actual formula is `2·|A∩B| / (|A|+|B|)`, which
  separates near-paraphrases more cleanly than true Jaccard at the same
  threshold.
- `embed`: `sentence-transformers` (BAAI/bge-small-en-v1.5) with cosine
  similarity. Falls back to trigram with a stderr notice if the dependency
  is missing. Note: the `--threshold` flag has a single default (0.80)
  shared by both backends. Cosine similarity at 0.80 is roughly comparable
  to trigram-Dice at 0.80 on short paraphrases; if you switch backends and
  want a more conservative cut, pass `--threshold 0.85`.

## Invocation

```bash
python3 scripts/queue_dedup.py \
  --queue "<state>/repos/<repo-id>/improvement-queue.jsonl" \
  --backend trigram \
  --threshold 0.80 \
  --keep oldest
```

`--keep oldest` preserves the earliest `ts` row in each cluster (default).
`--keep newest` flips this. `--dry-run` reports `would_remove=N kept=N
backend=...` and writes nothing.

`refresh_learning_state.py` invokes dedup after appending new candidates;
operators rarely run `queue_dedup` directly.

## Safety

- Refuses to operate on a `--queue` that is a symlink or non-regular file
  (exit 2), reusing the `assert_regular_file_destination` helper from
  `bin/collect_hook_event`.
- Acquires `fcntl.LOCK_EX` before reading or writing.
- Rewrites the queue in place under the lock; not crash-atomic, but
  serialized against other lock holders.

## Tuning

Trigram-Dice threshold 0.70 is aggressive; 0.85 conservative. Start at 0.80.
Embedding threshold 0.85 with bge-small-en is comparable. Sample the diff
before lowering thresholds — overly aggressive dedup hides distinct gates.
