Fix warm-started sandbox pods killed immediately by cleanup loop#5392
Open
udithishanka wants to merge 20 commits intojaseci-labs:mainfrom
Open
Fix warm-started sandbox pods killed immediately by cleanup loop#5392udithishanka wants to merge 20 commits intojaseci-labs:mainfrom
udithishanka wants to merge 20 commits intojaseci-labs:mainfrom
Conversation
jac start --dev exits the process during HMR recompilation. With restart_policy=Never, the pod dies permanently and the sandbox is lost. With Always, K8s restarts the container and the sandbox survives file changes from AI agents. Applies to both cold-start and warm-pool pod specs.
cleanup_expired() was using the pod's K8s creation_timestamp to determine if a sandbox had expired. For warm-started sandboxes, the pod was pre-created in the warm pool (30+ minutes ago), so it appeared expired immediately after being claimed for active use. Now uses the registry's created_at (set at claim time) for pods with jac-sandbox-pool=active, so TTL counts from when the sandbox actually started serving, not when the warm pod was pre-created.
- Release notes for warm-start TTL fix and restart_policy change - Tests: cold/warm pod restart_policy=Always, registry created_at used for claimed warm pod TTL calculation
In multi-pod deployments, cleanup_expired() runs on each pod but the sandbox registry is in-memory. When an active sandbox was created on Pod A, Pod B's cleanup loop has no registry entry for it. Previously it fell back to the pod's K8s creation_timestamp (from warm pool, 30+ min old) → appeared expired → deleted everyone's sandboxes simultaneously every 60 seconds. Now if an active (claimed) pod has no local registry entry, it's skipped entirely — only the pod that created the sandbox can decide when to delete it.
When a warm pod is returned to the pool and claimed by another user, the old user's files persisted in /app since tar extraction overlays without clearing first. Add rm -rf /app/* before extraction to ensure a clean slate.
Add stateless methods alongside existing v1 API for use by jac-builder's graph-based sandbox reconciler. V1 methods untouched for backward compat. New methods on KubernetesSandbox: - create_with_rollback(): K8s resource creation with automatic cleanup on failure - cleanup_resources(): Stateless cleanup without touching internal registry - query_pod_status(): Direct K8s API pod status query - list_all_sandbox_pods(): List all pods for drift detection - count_warm_pods(): K8s label-based warm pool count - claim_warm_pod_v2(): Atomic label patch for warm pod claims Updated SandboxEnvironment interface with v2 method signatures.
In horizontally-scaled jac-builder deployments, two pods could independently claim the same warm pod because _ensure_warm_pool() re-added claimed pods from stale K8s list results and _claim_warm_pod() had no label verification. Changes: - Proxy: skip warm pods without jac-sandbox-id label (no phantom routes) - _ensure_warm_pool: fresh individual K8s read per pod + rebuild local pool to remove pods claimed by other instances - _claim_warm_pod: pre-patch label check rejects already-claimed pods, post-patch verify detects concurrent claim overwrites, return pod to pool on failure instead of leaking - _inject_code: kill stale Vite/jac processes and wipe caches before injecting new user's code into a recycled warm pod
…ailure - _inject_code: use SIGKILL (-9) instead of SIGTERM for immediate process death, then wait for port 8000/8001 to be free before injecting code. Wipe all of /tmp/ (not just /tmp/code.* and /tmp/.vite*). - create() warm path: on injection failure, delete the claimed pod directly by name instead of via _cleanup_k8s_resources which can't find it (registry entry doesn't exist yet at that point).
Replace per-pod _warm_pool list with direct K8s queries. Each claim now lists warm pods from K8s, iterates candidates with fresh reads, patches labels, and verifies — no shared mutable state between pods. - _claim_warm_pod: query K8s directly, iterate candidates on failure - _ensure_warm_pool: simplified to count + create deficit (no pool tracking) - cleanup_expired: removed _warm_pool references - Removed _warm_pool field from class declaration Net result: 59 additions, 132 deletions. Simpler, no cross-pod races.
Stateless warm pool — remove in-memory _warm_pool list
… pod cleanup - Pre-build admin UI at image build time so non-root sandboxes can serve /admin - Preserve /app/.jac/admin during warm pod reclaim cleanup
…ory for artifacts" This reverts commit 2e7b02b.
Second rm -rf /app/* in tar extraction was also wiping pre-built admin files.
This reverts commit 47f0c3c.
…_dist jac-scale 0.2.12 ships pre-built admin assets in admin/_dist/ (PR jaseci-labs#5433). build_admin_client() copies them instantly on first /admin/ access — no runtime build needed, works as non-root. Removed: - Dockerfile admin pre-build RUN block (no longer needed) - find -not -path /app/.jac/admin exclusion in cleanup (no longer needed) Reverted to simple rm -rf /app/* since admin regenerates from _dist
4a1cbfe to
7122bd8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Warm pool pods are pre-created and sit idle until claimed for a sandbox. When claimed,
cleanup_expired()(runs every 60s) was using the pod's K8screation_timestampto calculate age — but this timestamp is from when the warm pod was originally created (often 30+ minutes ago), not when it was claimed. So a freshly claimed sandbox appeared expired immediately and was killed within a minute of starting to serve.Changes
TTL calculation uses claim time for warm-started pods:
cleanup_expired()now checks the registry'screated_at(set at claim time) for pods withjac-sandbox-pool=active, so TTL counts from when the sandbox actually started serving.Restart policy changed to Always: Cold-start and warm-pool pods now use
restart_policy="Always"instead of"Never", so ifjac start --devexits unexpectedly, K8s restarts the container instead of permanently killing the pod.Release notes and tests added: 3 new tests covering restart policy and TTL claim-time logic.
Test plan
test_cold_start_pod_spec_uses_restart_policy_always— passestest_warm_pod_spec_uses_restart_policy_always— passestest_cleanup_expired_uses_registry_created_at_for_claimed_warm_pods— passes