# Closure-cycle reclamation

Status: **implemented.** The env↔fn cycle is reclaimed by a cycle
collector built on honest `Env` refcounts. This note records the
original investigation (kept below — it explains *why* the design looks
the way it does), the as-built design, and the invariants maintainers
must preserve.

## The leak (historical)

A closure captures its entire defining `Env`; that env, in turn, holds a
reference to the closure (the `define`/lambda binding lives *in* the env
the closure captures). The two kept each other's refcount above zero, so
neither was ever reclaimed:

```
define outer as:
    x is 5
    define inner as:   # inner captures outer's env E (uses x)
        return x
    return inner       # E binds `inner`; `inner`.closure == E  → cycle
f is outer of null
```

It **accumulated** — measured before the collector under
`ASAN_OPTIONS=detect_leaks=1`:

| program | leaked allocations (before) | now |
|---|---|---|
| 1 escaping closure | ~12 | 0 |
| 500 closures kept in a list | 5,553 | 0 |
| 500 closures created & discarded in a loop | 6,004 | 0 (collected mid-run) |

Beyond the leak itself, the unbounded growth cost time: a 100k-iteration
closure-churn loop ran ~40% faster after the collector (0.075s vs 0.12s)
with peak RSS dropping from ~124 MB to ~4 MB.

## Why the cheap fixes were wrong (still true)

The cycle is two owning edges: `E → fn` (the binding) and `fn → E` (the
closure's `env_refcount`). Breaking *either* edge unconditionally is
unsafe:

- **Weak self-binding** (make `E`'s binding to its own capturing fn
  non-owning): UAF in the **non-escaping** case. When the binding is the
  *only* reference to the fn (e.g. internal mutual recursion that never
  escapes), weakening it frees the fn while it is still bound and
  callable — the next recursive lookup dereferences freed memory.
- **Weaken the `fn → E` edge** (a self-bound fn doesn't hold
  `env_refcount`): UAF in the **escaping** case. A returned closure
  outlives its frame; with the edge weakened, `E` is freed while the
  live, returned fn still points into it.

Both directions are individually load-bearing, so the fix had to be a
real collector, not an edge-weakening patch.

## The as-built design

Two stages, exactly as the original investigation recommended.

### Stage 1 — honest `Env` refcounts

`env_refcount` is a true owner count. The owners:

1. **Creator** — the call frame (`owns_env`) or the C caller
   (`thread_entry`, `call_eigs_fn`, the `dispatch` builtin, ext_http
   handlers, `OP_IMPORT`'s module env). `env_new` returns with
   refcount 1; the creator's release is `env_decref`.
2. **Closures** — `make_fn` increfs; `free_value(VAL_FN)` decrefs.
3. **Child envs** — `Env::parent` is an owned edge: `env_new` increfs
   the parent, the destructor decrefs it. Loop envs
   (`OP_LOOP_ENV_FRESH`/`END`) move the frame's ref explicitly: FRESH
   transfers the frame's ref into the child's parent edge (and flips
   `owns_env` to 1 for borrowed base-frame envs); END re-takes a frame
   ref on the parent before releasing the child.
4. **Parked recycled call envs** — `chunk->env_cache` holds the single
   ref (transferred from the returning frame; `vm_park_call_env`
   requires `env_refcount == 1` exactly). The parked env keeps its owned
   parent ref, so the take-side `parent == closure` compare can never
   see a recycled pointer.

`env_free`'s conditional, captured-gated semantics are gone; every
release site is a plain `env_decref`. `env_destroy_final` remains only
as a force-destroy escape hatch (no current callers in main).

### Stage 2 — the collector (`gc_collect_cycles`, eigenscript.c)

- **Registry.** `OP_CLOSURE` calls `env_mark_captured`, which links the
  env into a per-thread intrusive list (`gc_next`/`gc_prev`/
  `in_gc_list` on `Env`). `g_global_env` is never registered. The env
  destructor and `env_destroy_final` unlink.
- **Trigger.** Registration is the only way the candidate universe
  grows, so the threshold check lives there — zero cost on the
  dispatch/call hot paths. Threshold: collect when the registry reaches
  `max(64, 2 × live-after-last-collect)`. One more collection runs at
  exit (below).
- **Universe.** Everything reachable from registered envs over **owned
  edges only**: env value slots, `env->parent`, `fn->closure`,
  `fn->chunk` (the OP_CLOSURE ref), `chunk->functions[]`,
  `chunk->env_cache`, list items, dict values. Three node kinds: values
  (LIST/DICT/FN — everything else is a leaf), envs, chunks. Chunks are
  on real cycles: `fn → chunk → env_cache → parent → E → fn` is exactly
  the shape a recycled call env creates. `g_global_env` is a stop node —
  it is permanently rooted by its creator ref, and traversing into it
  would drag the entire heap into every collection.
- **Roots without root enumeration.** For each node, count the
  references arriving from inside the universe. Any node whose refcount
  exceeds that count has an external holder — a VM stack slot, a frame's
  env ref, a C caller's ref, the trace tape, a route table — *every one
  of which is itself a counted ref*, so no VM-state walking is needed
  and a "missed root" is impossible by construction. If accounting ever
  finds more internal references than refcount (an uncounted edge got
  traversed — a bug), the whole collection aborts: the failure mode is a
  leak, never a free.
- **Sweep.** Mark from roots; unmarked nodes are cyclic garbage. Pin
  them (+1 each), clear their outgoing edges with ordinary decrefs, then
  unpin — each node frees through its normal destructor
  (`free_value`/`env_decref`/`chunk_decref`), so observer-alias
  clearing (`g_last_observer`) and freelist recycling behave as usual.
- **Exit.** `gc_collect_at_exit(global)` (main.c, both REPL and script
  paths): snapshot the global scope's container values with one pinning
  ref apiece, `env_clear(global)`, collect with the pins accounted
  (root iff `rc > internal + pinned`), release the pins, then
  `env_decref(global)` drops the creator ref. The snapshot is what
  reclaims pure value→value cycles bound at global scope (a list
  appended to itself) that are unreachable from the captured-env
  registry.
- **Threads.** `builtin_spawn` calls `gc_disable_for_threads()` before
  flipping `g_vm_multithreaded`: the registry is drained on the main
  thread and registration stops. Spawned programs keep the old leak
  behavior (their leaks stay *visible* to LSan — drained envs are
  unreachable at exit, nothing is masked). Cross-thread refcounts stay
  correct (atomic); only the collector is off.

### Maintainer invariants (violations are UAF or silent leaks)

- **Every owning edge into an `Env` must go through
  `env_incref`/`env_decref`.** A new uncounted holder (a global Env
  pointer that outlives its creator, a cache that stashes an env) breaks
  the root derivation in the *free* direction. If you must hold an env,
  take a ref.
- **`GC_FOR_EACH_CHILD` and `gc_clear_node` move in lockstep** with the
  runtime's ownership model. A new owning edge out of Value/Env/Chunk
  (a new container type, a new chunk field that holds values or envs)
  must be added to both — or deliberately left untraversed, which is
  safe (the target counts as externally rooted and leaks) but should be
  a recorded decision.
- **Only traverse counted edges.** Adding a borrowed/uncounted pointer
  to the walker makes `internal > rc` possible; the sanity abort
  catches it at runtime, but that means collection silently stops
  working.
- **The frame's env ref follows `frame->env`.** LOOP_ENV_FRESH/END are
  a ref *move*, not a leak-fix opportunity; returning mid-loop relies on
  the child's parent chain cascading the release up to `fn_env`.
- `tests/test_closure_cycles.eigs` is gated **strictly** by
  run_all_tests.sh section [87]: any LeakSanitizer exit there fails the
  suite (rc_ok's leak tolerance does not apply). Keep it that way.

## What still leaks (known, tolerated)

- **Spawned-thread programs** — the collector is disabled at first
  `spawn`. Cross-thread collection needs a stop-the-world handshake or
  per-thread heaps; out of scope.
- **Mid-run pure value cycles in function scopes** (a self-referential
  list built and discarded inside a function, never touching a captured
  env). These never enter the registry. They were leak-equivalent before
  the collector; the exit snapshot only covers global bindings.
- A **parked recycled call env pins its parent** (the callee's closure
  env) until the chunk dies or the cycle through the chunk is collected
  — at most one retained env per chunk, released at chunk death.

The suite's tolerated-leak tally (the `NOTE:` line in run_all_tests.sh)
dropped from 28 to 13 with the collector; the remainder are the classes
above plus a handful of pre-existing non-closure shapes, all
byte-identical to the pre-collector baseline.
