Fix warm-started sandbox pods killed immediately by cleanup loop by udithishanka · Pull Request #5392 · jaseci-labs/jaseci

udithishanka · 2026-03-30T18:20:13Z

Summary

Warm pool pods are pre-created and sit idle until claimed for a sandbox. When claimed, cleanup_expired() (runs every 60s) was using the pod's K8s creation_timestamp to calculate age — but this timestamp is from when the warm pod was originally created (often 30+ minutes ago), not when it was claimed. So a freshly claimed sandbox appeared expired immediately and was killed within a minute of starting to serve.

Changes

TTL calculation uses claim time for warm-started pods: cleanup_expired() now checks the registry's created_at (set at claim time) for pods with jac-sandbox-pool=active, so TTL counts from when the sandbox actually started serving.
Restart policy changed to Always: Cold-start and warm-pool pods now use restart_policy="Always" instead of "Never", so if jac start --dev exits unexpectedly, K8s restarts the container instead of permanently killing the pod.
Release notes and tests added: 3 new tests covering restart policy and TTL claim-time logic.

Test plan

test_cold_start_pod_spec_uses_restart_policy_always — passes
test_warm_pod_spec_uses_restart_policy_always — passes
test_cleanup_expired_uses_registry_created_at_for_claimed_warm_pods — passes
Full test suite (37 tests) — passes
Verified on jac-builder-dev: sandbox survives multiple AI chat turns without being killed

jac start --dev exits the process during HMR recompilation. With restart_policy=Never, the pod dies permanently and the sandbox is lost. With Always, K8s restarts the container and the sandbox survives file changes from AI agents. Applies to both cold-start and warm-pool pod specs.

cleanup_expired() was using the pod's K8s creation_timestamp to determine if a sandbox had expired. For warm-started sandboxes, the pod was pre-created in the warm pool (30+ minutes ago), so it appeared expired immediately after being claimed for active use. Now uses the registry's created_at (set at claim time) for pods with jac-sandbox-pool=active, so TTL counts from when the sandbox actually started serving, not when the warm pod was pre-created.

- Release notes for warm-start TTL fix and restart_policy change - Tests: cold/warm pod restart_policy=Always, registry created_at used for claimed warm pod TTL calculation

In multi-pod deployments, cleanup_expired() runs on each pod but the sandbox registry is in-memory. When an active sandbox was created on Pod A, Pod B's cleanup loop has no registry entry for it. Previously it fell back to the pod's K8s creation_timestamp (from warm pool, 30+ min old) → appeared expired → deleted everyone's sandboxes simultaneously every 60 seconds. Now if an active (claimed) pod has no local registry entry, it's skipped entirely — only the pod that created the sandbox can decide when to delete it.

When a warm pod is returned to the pool and claimed by another user, the old user's files persisted in /app since tar extraction overlays without clearing first. Add rm -rf /app/* before extraction to ensure a clean slate.

Add stateless methods alongside existing v1 API for use by jac-builder's graph-based sandbox reconciler. V1 methods untouched for backward compat. New methods on KubernetesSandbox: - create_with_rollback(): K8s resource creation with automatic cleanup on failure - cleanup_resources(): Stateless cleanup without touching internal registry - query_pod_status(): Direct K8s API pod status query - list_all_sandbox_pods(): List all pods for drift detection - count_warm_pods(): K8s label-based warm pool count - claim_warm_pod_v2(): Atomic label patch for warm pod claims Updated SandboxEnvironment interface with v2 method signatures.

In horizontally-scaled jac-builder deployments, two pods could independently claim the same warm pod because _ensure_warm_pool() re-added claimed pods from stale K8s list results and _claim_warm_pod() had no label verification. Changes: - Proxy: skip warm pods without jac-sandbox-id label (no phantom routes) - _ensure_warm_pool: fresh individual K8s read per pod + rebuild local pool to remove pods claimed by other instances - _claim_warm_pod: pre-patch label check rejects already-claimed pods, post-patch verify detects concurrent claim overwrites, return pod to pool on failure instead of leaking - _inject_code: kill stale Vite/jac processes and wipe caches before injecting new user's code into a recycled warm pod

…ailure - _inject_code: use SIGKILL (-9) instead of SIGTERM for immediate process death, then wait for port 8000/8001 to be free before injecting code. Wipe all of /tmp/ (not just /tmp/code.* and /tmp/.vite*). - create() warm path: on injection failure, delete the claimed pod directly by name instead of via _cleanup_k8s_resources which can't find it (registry entry doesn't exist yet at that point).

Replace per-pod _warm_pool list with direct K8s queries. Each claim now lists warm pods from K8s, iterates candidates with fresh reads, patches labels, and verifies — no shared mutable state between pods. - _claim_warm_pod: query K8s directly, iterate candidates on failure - _ensure_warm_pool: simplified to count + create deficit (no pool tracking) - cleanup_expired: removed _warm_pool references - Removed _warm_pool field from class declaration Net result: 59 additions, 132 deletions. Simpler, no cross-pod races.

Stateless warm pool — remove in-memory _warm_pool list

…artifacts

… pod cleanup - Pre-build admin UI at image build time so non-root sandboxes can serve /admin - Preserve /app/.jac/admin during warm pod reclaim cleanup

…ory for artifacts" This reverts commit 2e7b02b.

Second rm -rf /app/* in tar extraction was also wiping pre-built admin files.

This reverts commit 47f0c3c.

…_dist jac-scale 0.2.12 ships pre-built admin assets in admin/_dist/ (PR jaseci-labs#5433). build_admin_client() copies them instantly on first /admin/ access — no runtime build needed, works as non-root. Removed: - Dockerfile admin pre-build RUN block (no longer needed) - find -not -path /app/.jac/admin exclusion in cleanup (no longer needed) Reverted to simple rm -rf /app/* since admin regenerates from _dist

udithishanka added 3 commits March 30, 2026 23:17

add release notes and tests for sandbox restart policy + TTL fix

16a4d18

- Release notes for warm-start TTL fix and restart_policy change - Tests: cold/warm pod restart_policy=Always, registry created_at used for claimed warm pod TTL calculation

udithishanka changed the title ~~Fix/sandbox restart policy~~ Fix warm-started sandbox pods killed immediately by cleanup loop Mar 30, 2026

udithishanka and others added 16 commits March 31, 2026 13:11

Fix cross-user file leak in warm pod recycling

fc82d58

When a warm pod is returned to the pool and claimed by another user, the old user's files persisted in /app since tar extraction overlays without clearing first. Add rm -rf /app/* before extraction to ensure a clean slate.

fix: move docstrings before method definitions (Jac syntax)

dd5f645

Merge pull request #3 from udithishanka/fix/stateless-warm-pool

8bdf4b1

Stateless warm pool — remove in-memory _warm_pool list

fix: remove docstrings from function bodies (Jac syntax)

2433783

fix: update admin client build process to use writable directory for …

2e7b02b

…artifacts

fix: remove dead port-free wait loop (ss not installed in sandbox image)

445be71

fix: pre-build admin dashboard in Dockerfile and preserve during warm…

343d33d

… pod cleanup - Pre-build admin UI at image build time so non-root sandboxes can serve /admin - Preserve /app/.jac/admin during warm pod reclaim cleanup

Revert "fix: update admin client build process to use writable direct…

6637f5b

…ory for artifacts" This reverts commit 2e7b02b.

fix: preserve admin dashboard in code injection cleanup path

47f0c3c

Second rm -rf /app/* in tar extraction was also wiping pre-built admin files.

Revert "fix: preserve admin dashboard in code injection cleanup path"

740a4f7

This reverts commit 47f0c3c.

udithishanka force-pushed the fix/sandbox-restart-policy branch from 4a1cbfe to 7122bd8 Compare April 3, 2026 18:01

Source .env in sandbox pod startup for user env vars

7399eca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix warm-started sandbox pods killed immediately by cleanup loop#5392

Fix warm-started sandbox pods killed immediately by cleanup loop#5392
udithishanka wants to merge 20 commits intojaseci-labs:mainfrom
udithishanka:fix/sandbox-restart-policy

udithishanka commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

udithishanka commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

udithishanka commented Mar 30, 2026 •

edited

Loading