Skip to content

Harden pass 1 runtime and lock Ubuntu 22.04 artifacts#5

Merged
dawsonblock merged 3 commits intomainfrom
codex/pass1-artifacts-and-hardening
Mar 28, 2026
Merged

Harden pass 1 runtime and lock Ubuntu 22.04 artifacts#5
dawsonblock merged 3 commits intomainfrom
codex/pass1-artifacts-and-hardening

Conversation

@dawsonblock
Copy link
Copy Markdown
Owner

@dawsonblock dawsonblock commented Mar 28, 2026

This pull request introduces several improvements and clarifications to the XBOOT project, focusing on deployment safety, artifact pinning, environment configuration, and documentation. The most significant changes are the addition of explicit artifact compatibility and verification steps, enhanced environment variable handling for resource limits, and updated documentation reflecting these changes.

Deployment and Release Process Improvements:

  • The deployment script (deploy/deploy.sh) and systemd service (deploy/zeroboot.service) now run a staged startup verification (verify-startup) before switching releases, ensuring all templates and resources are valid before going live. [1] [2]
  • New environment variables (ZEROBOOT_MIN_FREE_BYTES, ZEROBOOT_MIN_FREE_INODES) are introduced and set in both deployment and systemd service files to enforce minimum disk space and inode availability. [1] [2]
  • The deployment script now sets all relevant environment variables for the service, including new ones for resource limits and artifact validation, and ensures rollback includes a full health and execution check. [1] [2]

Artifact Pinning and Compatibility:

  • A new compatibility matrix is documented in docs/COMPATIBILITY.md, specifying exact versions and sources for Firecracker, the guest kernel, rootfs, Python, and Node.js runtimes.
  • The README.md now includes explicit artifact pinning, instructions for fetching official artifacts, and updated build instructions for Python and Node.js guest images. [1] [2]

Environment and Build System Enhancements:

  • The CI workflow (.github/workflows/ci.yml) now checks syntax for new supervisor/child worker scripts, includes new shell scripts in syntax checks, and adds a release bundle step that excludes build artifacts using .gitattributes. [1] [2]
  • Rust dependencies for KVM and related libraries are now conditionally included only for Linux targets, improving cross-platform compatibility. [1] [2]

Documentation and Usability Updates:

  • The README.md has been updated to clarify the intended use (controlled internal use, offline-only), the artifact matrix, and the deployment process, including new environment variables and the startup verification step. [1] [2] [3] [4] [5]
  • API documentation (docs/API.md) now describes the new hashed API key format, bearer authentication, rate limits, overload responses, and how to generate API key records. [1] [2]

Other Notable Changes:

  • .gitattributes is updated to exclude build and temporary directories from release bundles.
  • Minor terminology and feature clarifications in documentation (e.g., "production-hardened" to "controlled-internal", "supervisor/child subprocess isolation"). [1] [2] [3]

These changes collectively improve the safety, clarity, and reproducibility of XBOOT deployments and make the artifact requirements and system expectations explicit.

Summary by CodeRabbit

  • New Features

    • Added startup verification before service launch to ensure safe deployment
    • Introduced disk/inode watermark thresholds for resource admission control
    • Updated API authentication to use hashed API key records instead of plain bearer tokens
    • Added rate limiting and overload rejection handling with 429/503 status codes
  • Documentation

    • Updated README reflecting controlled internal use and offline-only scope (Ubuntu 22.04, Firecracker 1.12.0)
    • Added COMPATIBILITY.md documenting pinned artifact matrix and fail-closed startup conditions
    • Extended API documentation with authentication record format and overload behavior
  • Chores

    • Pinned runtime versions: Firecracker 1.12.0, Python 3.10.12, Node.js 20.20.2, kernel 5.10.223

Copilot AI review requested due to automatic review settings March 28, 2026 07:02
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 28, 2026

Warning

Rate limit exceeded

@dawsonblock has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 26 minutes and 18 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 26 minutes and 18 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 69bc71db-db0f-4741-8c36-5e52f877fe43

📥 Commits

Reviewing files that changed from the base of the PR and between 87694e0 and e7ac163.

📒 Files selected for processing (2)
  • README.md
  • docs/DEPLOYMENT.md
📝 Walkthrough

Walkthrough

This PR introduces a production hardening update for a Firecracker-based microVM system, transitioning from a development build to a "first hardened release." Key changes include refactoring guest workers into supervisor/child models, pinning all runtime artifacts (Firecracker 1.12.0, Python 3.10.12, Node.js 20.20.2), implementing startup verification and disk/inode watermark admission checks, updating API key handling to use hashed records, adopting symlink-based release management, and adding cross-platform stubs for non-Linux builds.

Changes

Cohort / File(s) Summary
Guest Worker Architecture Refactoring
guest/worker_supervisor.py, guest/worker_supervisor.js, guest/worker_child.py, guest/worker_child.js, guest/init.c
Split single-entrypoint workers into supervisor/child model. Supervisors now spawn child processes via spawnSync/subprocess with JSON payloads and structured response parsing; children receive execution params (code, stdin, timeout, limits) from stdin instead of environment variables, enforce per-request temp directories and resource quotas, and support configurable stdout/stderr/tmp truncation with markers.
Artifact & Manifest Pinning
manifests/firecracker.lock.json, manifests/kernel.manifest, manifests/python-build.lock.json, manifests/node-build.lock.json, manifests/python-guest.manifest, manifests/node-guest.manifest, manifests/runtime-artifacts.lock.json
Replace REQUIRED placeholders with concrete pinned versions: Firecracker 1.12.0, kernel 5.10.223, Ubuntu 22.04 rootfs, Python 3.10.12, Node.js 20.20.2. New runtime-artifacts.lock.json provides comprehensive artifact matrix with download URLs and SHA-256 checksums. Update worker paths from worker.py/worker_node.js to worker_supervisor.py/worker_supervisor.js.
Artifact Fetching & Installation
scripts/fetch_official_artifacts.sh, scripts/install_node_runtime.sh, scripts/create_release_bundle.sh, scripts/make_api_keys.py, scripts/build_guest_rootfs.sh, scripts/preflight.sh
Add new scripts to fetch/verify pinned artifacts and install Node runtime into rootfs templates. Update make_api_keys.py to generate hashed API key records instead of plain tokens (HMAC-SHA256 with pepper). Enhance preflight.sh with disk/inode watermark checks and Firecracker binary SHA-256 validation. Glob-pattern copy guest files in build_guest_rootfs.sh.
Deployment & Service Configuration
deploy/deploy.sh, deploy/zeroboot.service
Introduce staged startup verification step before smoke tests. Add environment variables for resource thresholds (ZEROBOOT_MIN_FREE_BYTES, ZEROBOOT_MIN_FREE_INODES), API key/pepper files, and Firecracker version/binary allowlists. Switch to /var/lib/zeroboot/current/ symlink-based release paths. Extend rollback verification to include both /v1/ready and /v1/exec checks. Change request-log queue from unbounded to bounded channel in systemd service.
Rust Startup Verification & Configuration
src/startup.rs, src/config.rs, src/main.rs
Introduce new startup module with template spec parsing, Firecracker binary verification, template artifact validation, and disk watermark enforcement. Add StorageConfig struct for min-free-bytes/inodes and request-log queue capacity. Add new verify-startup CLI command. Switch request-log channel to bounded with drop/write-failure tracking.
Rust API & Admission Control
src/api/handlers.rs
Integrate runtime admission enforcement before auth/rate-limiting. Add check_runtime_admission() that verifies disk/inode watermarks and returns 503 if violated. Add new metrics for disk pressure rejections and log write failures. Update health classification to use lowercased substring matching. Refactor metrics_handler with helper function for Prometheus output.
Rust VMM Cross-Platform Support
src/vmm/mod.rs, src/vmm/kvm_stub.rs, src/vmm/vmstate_stub.rs, src/vmm/vmstate.rs
Conditionally expose KVM and vmstate modules on Linux only; add non-Linux stub implementations. Stubs return errors indicating features are Linux-only. Minor cleanup of unused imports and trailing commas in error handling.
Testing & Documentation Validation
tests/test_docs_sanity.py, tests/test_guest_worker_subprocess.py, tests/test_repo_sanity.py, tests/test_worker_protocol.py
Add new test_docs_sanity.py to validate documentation consistency (current paths, hashed API records, offline-only scope). Update worker subprocess tests to use JSON stdin payloads and frame-based response parsing. Strengthen manifest validation to disallow any unresolved =REQUIRED placeholders. Update worker supervisor test to verify new supervisor/child model.
CI, Configuration & Documentation
.github/workflows/ci.yml, .gitignore, .gitattributes, README.md, docs/API.md, docs/COMPATIBILITY.md, docs/DEPLOYMENT.md, src/signing.rs, src/template_manifest.rs
Expand CI sanity checks to compile/syntax-check supervisor/child scripts; add release bundle verification step. Update .gitignore and add .gitattributes for artifact exclusion. Revise README from "production-ready" to "controlled internal use, first hardened release" with pinned artifact matrix. Document API key record schema (id, prefix, hash, created_at, disabled_at, label). Add COMPATIBILITY.md detailing supported environment and fail-closed conditions. Update DEPLOYMENT.md with official artifact fetch, Node runtime install, and symlink-based release layout. Minor formatting/test refactors in signing and manifest validation modules.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #1: Introduces the initial worker supervisor/child architecture and corresponding Rust core changes (startup verification, admission control, config/API handler updates) that this PR builds upon and refines with artifact pinning and deployment integration.
  • PR #2: Implements overlapping production hardening infrastructure (manifest/signing policy, startup verification, vmm stubs, API key record generation) that directly parallels and complements the same subsystems modified here.
  • PR #3: Covers the same foundational production-hardening changes (supervisor/child model, startup template verification, disk watermark checks, API key hashing, vmm cross-platform stubs) indicating highly related code-level architecture alignment.

Poem

🐰 A rabbit hops through artifacts pinned,
supervisor and child now spin the wind,
watermarks guard the disk so fair,
startup verifies with utmost care.
From REQUIRED to concrete, we've hardened the flight—
first release secured, burning bright!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: a hardening pass and artifact locking for Ubuntu 22.04.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/pass1-artifacts-and-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the guest worker architecture into a supervisor/child model to improve isolation and introduces a comprehensive startup verification process, including disk and inode watermark checks. It also pins specific versions for Firecracker, the guest kernel, and the rootfs to ensure environment consistency. Review feedback identifies a regression in the rootfs build script due to renamed worker files and points out absolute local file paths used in documentation links that need to be converted to relative paths.

Comment thread scripts/build_guest_rootfs.sh
Comment thread README.md Outdated
Comment thread docs/DEPLOYMENT.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the XBOOT runtime and deployment flow by adding staged startup verification, enforcing disk/inode admission checks, pinning Ubuntu 22.04 artifact compatibility, and updating worker execution to a supervisor/child subprocess model.

Changes:

  • Add verify-startup command + shared startup checks (KVM/cgroup/firecracker/template verification + disk watermarks) and wire it into systemd + deploy flow.
  • Introduce disk/inode watermark enforcement (ZEROBOOT_MIN_FREE_BYTES, ZEROBOOT_MIN_FREE_INODES) and request-log backpressure/metrics.
  • Pin and document runtime artifacts (new lock file + fetch/install scripts) and update manifests/docs/tests accordingly.

Reviewed changes

Copilot reviewed 41 out of 43 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_worker_protocol.py Update worker protocol test to run the Python supervisor + child via env.
tests/test_repo_sanity.py Tighten manifest/lock placeholder checks and add runtime-artifacts.lock.json to lock validation.
tests/test_guest_worker_subprocess.py Replace conceptual isolation checks with real child-protocol tests and resource limit payloads.
tests/test_docs_sanity.py Add sanity checks to keep docs aligned with release layout + hashed API key docs.
src/vmm/vmstate_stub.rs Add non-Linux stub for vmstate validation.
src/vmm/vmstate.rs Minor formatting/cleanup in vmstate validation errors/imports.
src/vmm/mod.rs Gate KVM/vmstate modules by target OS and re-export stubs on non-Linux.
src/vmm/kvm_stub.rs Add non-Linux stub types/APIs for KVM restore/serial/exec paths.
src/template_manifest.rs Policy construction tweaks, clippy allow, and expanded manifest test fixtures.
src/startup.rs New startup verification + disk watermark enforcement and Firecracker binary validation.
src/signing.rs Formatting-only changes in canonicalization + tests.
src/main.rs Add verify-startup command; run startup verification during serve; add log backpressure + template health values.
src/config.rs Add StorageConfig (watermarks + log queue capacity) and integrate into config/env parsing + tests.
src/api/handlers.rs Add runtime admission checks on exec/batch/ready; bounded request-log sender + new metrics and template health classification.
scripts/preflight.sh Add watermark, Firecracker sha, and cgroup v2 checks to preflight.
scripts/make_api_keys.py Switch to hashed API key records (HMAC-SHA256) and print bearer tokens once.
scripts/install_node_runtime.sh New script to install pinned Node runtime into rootfs template with sha verification.
scripts/fetch_official_artifacts.sh New script to fetch and verify pinned Ubuntu 22.04 / FC 1.12.0 artifact set.
scripts/create_release_bundle.sh New script to create a source bundle (excluding build artifacts) for releases/CI.
scripts/build_guest_rootfs.sh Copy all worker assets and update manifest sha fields to supervisor worker names.
manifests/runtime-artifacts.lock.json New lock file documenting the pinned artifact/host matrix.
manifests/python-guest.manifest Replace REQUIRED placeholders with pinned Ubuntu rootfs + Python version; switch worker path to supervisor.
manifests/python-build.lock.json Pin python build inputs and supervisor worker path.
manifests/node-guest.manifest Replace REQUIRED placeholders with pinned Ubuntu rootfs + Node version; switch worker path to supervisor.
manifests/node-build.lock.json Pin node build inputs including runtime source URL/sha; switch worker path to supervisor.
manifests/kernel.manifest Fill pinned Firecracker/kernel versions and sha256.
manifests/firecracker.lock.json Fill pinned Firecracker/kernel versions and sha256 + source URLs.
guest/worker_supervisor.py Rewrite Python worker into supervisor that spawns a fresh child per request with resource limits and truncation markers.
guest/worker_supervisor.js Rewrite Node worker into supervisor using spawnSync per request and a child JSON protocol.
guest/worker_child.py Rewrite Python child to read JSON payload, apply limits, isolate environment/tmp, and emit framed responses.
guest/worker_child.js Rewrite Node child to read JSON payload, run code in VM context, enforce tmp quota post-run, and emit framed responses.
guest/init.c Update guest init to launch supervisor workers instead of legacy worker scripts.
docs/DEPLOYMENT.md Document pinned artifact fetch/install flow, new release layout, and startup verification/watermarks.
docs/COMPATIBILITY.md New compatibility matrix documenting the single supported hardened host/artifact set.
docs/API.md Document hashed API key record schema, bearer format, generation, and overload behaviors.
deploy/zeroboot.service Add ExecStartPre=verify-startup, switch to /current template paths, and set watermark env vars.
deploy/deploy.sh Add staged startup verification step, propagate new env vars, and strengthen rollback smoke checks.
README.md Update positioning, document pinned artifact matrix, supervisor/child model, and new env vars/deploy behavior.
Cargo.toml Make KVM-related deps Linux-only for better cross-platform builds.
Cargo.lock Update lockfile for dependency graph changes.
.gitignore Ignore additional build/release staging directories.
.github/workflows/ci.yml Expand syntax checks, add release bundle validation, and run cargo commands with --locked.
.gitattributes Add export-ignore rules to exclude build artifacts from source bundles.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/DEPLOYMENT.md Outdated
Comment thread README.md Outdated
Comment on lines +135 to +141
let flags = 0;
let stdout;
let stderr;
[stdout, stdoutTruncated] = truncateWithMarker(result.stdout || Buffer.alloc(0), MAX_STDOUT);
[stderr, stderrTruncated] = truncateWithMarker(result.stderr || Buffer.alloc(0), MAX_STDERR);
if (stdoutTruncated) flags |= FLAG_STDOUT_TRUNCATED;
if (stderrTruncated) flags |= FLAG_STDERR_TRUNCATED;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stdoutTruncated and stderrTruncated are assigned without being declared, which creates implicit globals in non-strict mode (and throws in strict mode). Declare them (e.g., with let) before assignment to avoid leaking globals / runtime errors if strict mode is enabled later.

Copilot uses AI. Check for mistakes.
Comment on lines +116 to +130
const [command, args] = childCommand(timeoutMs);
const result = spawnSync(command, args, {
input: payload,
env: minimalChildEnv(),
encoding: null,
timeout: Math.max(timeoutMs + 2000, 5000),
maxBuffer: MAX_STDOUT + MAX_STDERR + 4096,
});

const stderrStart = stdoutStart + stdoutLen;
const stderrEnd = Math.min(data.length, stderrStart + stderrLen);
if (stderrStart < data.length && stderrLen > 0) {
stderr = data.subarray(stderrStart, stderrEnd);
} else {
stderr = Buffer.alloc(0);
}
if (result.error && result.error.code === 'ETIMEDOUT') {
return [-1, 'timeout', Buffer.alloc(0), Buffer.from('execution timed out\n', 'utf8'), 0];
}
return [exitCode, errorType, stdout, stderr, flags];
if (result.error) {
return [-1, 'internal', Buffer.alloc(0), Buffer.from(String(result.error.message || result.error), 'utf8'), 0];
}
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spawnSync can set result.error when stdout/stderr exceed maxBuffer (e.g. ERR_CHILD_PROCESS_STDIO_MAXBUFFER / ENOBUFS). Currently that is returned as an internal error, losing the child output and bypassing the truncation logic below. Handle the maxBuffer overflow case explicitly by returning a normal runtime response with truncated stdout/stderr (and setting the truncation flags), or increase buffering and rely on your own truncation logic.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87694e0131

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/startup.rs
Comment thread src/startup.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
guest/worker_child.js (1)

52-58: ⚠️ Potential issue | 🟠 Major

Apply stdout/stderr caps while capturing output.

stdoutParts and stderrParts buffer everything in memory and only truncate at the end. A noisy script can exceed stdout_bytes/stderr_bytes and OOM the child before truncateWithMarker() ever runs.

Also applies to: 106-114

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@guest/worker_child.js` around lines 52 - 58, The current in-memory capture
buffers stdoutParts and stderrParts (filled by sandbox.console.log and
sandbox.console.error) never enforce stdout_bytes/stderr_bytes limits until
truncateWithMarker() runs, which can OOM; modify the capture so each push
respects the byte cap by tracking the accumulated byte count per stream and only
appending up to the remaining allowed bytes, and when the cap is reached append
the truncation marker immediately and stop adding further content; implement
this behavior via a small helper used by sandbox.console.log and
sandbox.console.error (and replicate the same capped-append logic for the other
capture block around lines 106-114) so truncation/makers are applied
incrementally rather than only at the end.
tests/test_worker_protocol.py (1)

60-66: ⚠️ Potential issue | 🟠 Major

Rename test to clarify behavior — timeout does not request recycle in supervisor architecture.

The test name test_python_timeout_requests_recycle is misleading. The assertion flags & 4 == 0 correctly verifies that the recycle flag is not set on timeout in the supervisor architecture, which is the intended behavior. However, other worker implementations (worker.py, worker_node.js) do request recycle on timeout.

Rename the test to test_python_timeout_no_recycle or similar to accurately reflect that the supervisor intentionally does not request recycling on timeout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_worker_protocol.py` around lines 60 - 66, Rename the test function
test_python_timeout_requests_recycle to a name that reflects the supervisor
behavior (e.g., test_python_timeout_no_recycle) and update any references; keep
the body and assertions (including the flags & 4 check and assertions on rid,
exit_code, error_type, and stderr) unchanged so the test continues to verify
that timeouts do not set the recycle flag in the supervisor architecture.
manifests/node-build.lock.json (1)

1-13: ⚠️ Potential issue | 🟠 Major

This lock file will fail the manifest-schema gate in CI.

.github/workflows/ci.yml Lines 47-59 validate every top-level manifests/*.json as a template manifest and require schema_version, template_id, and promotion_channel. This new *.lock.json intentionally does not carry that schema, so artifact-verify will exit non-zero as soon as it reaches this file. Either move build-lock files out of that glob or teach the workflow to skip *.lock.json.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/node-build.lock.json` around lines 1 - 13, The CI manifest-schema
validation is failing because build lock files matching *.lock.json are being
validated as template manifests; update the artifact-verify/manifest-schema
check so it skips *.lock.json (or change the manifest glob from manifests/*.json
to exclude *.lock.json), or move node-build.lock.json out of the
manifests/*.json glob; adjust the artifact-verify step (the job that runs the
manifest-schema gate) to explicitly ignore files matching *.lock.json or add a
separate glob for template manifests that requires schema_version, template_id
and promotion_channel.
src/api/handlers.rs (1)

1100-1110: ⚠️ Potential issue | 🟡 Minor

Classify failed probes from the detail you actually return.

Line 1100 derives health from startup_status.detail, but Lines 1105-1109 return probe.stderr when the probe fails. That can produce /health responses where detail and health describe different failure modes. Build the final detail first, then classify from that same string.

Suggested fix
-        let health = classify_template_health(ready, &startup_status.detail);
+        let detail = if ready {
+            "probe ok".into()
+        } else {
+            probe.stderr
+        };
+        let health = classify_template_health(ready, &detail);
         templates.insert(
             name.clone(),
             TemplateStatus {
                 ready,
-                detail: if ready {
-                    "probe ok".into()
-                } else {
-                    probe.stderr
-                },
+                detail,
                 health,
             },
         );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/handlers.rs` around lines 1100 - 1110, Create the final detail string
before computing health and then call classify_template_health on that same
string; specifically, compute a local variable (e.g., detail) set to "probe ok"
when ready is true or to probe.stderr when not, use that variable for
TemplateStatus.detail, and pass it to classify_template_health instead of
startup_status.detail so health and detail reflect the same failure text; update
the code paths around classify_template_health, TemplateStatus { detail: ... },
and any use of startup_status.detail to use the new detail variable.
🧹 Nitpick comments (11)
scripts/build_guest_rootfs.sh (1)

61-62: Line 61 may fail if legacy files are removed.

Line 61 explicitly references worker.py and worker_node.js, but the architecture is moving to supervisor scripts. If these legacy files are eventually removed from guest/, this line will cause the build to fail.

Consider removing line 61 since line 62 already handles all .py and .js files:

♻️ Proposed fix
-chmod 0644 "$STAGING_DIR/zeroboot/worker.py" "$STAGING_DIR/zeroboot/worker_node.js"
 chmod 0644 "$STAGING_DIR"/zeroboot/*.py "$STAGING_DIR"/zeroboot/*.js
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build_guest_rootfs.sh` around lines 61 - 62, The explicit chmod call
for "$STAGING_DIR/zeroboot/worker.py" and "$STAGING_DIR/zeroboot/worker_node.js"
is brittle and will fail if those legacy files are removed; remove the specific
chmod invocation (the first chmod line) and rely on the subsequent glob chmod
("chmod 0644 \"$STAGING_DIR\"/zeroboot/*.py \"$STAGING_DIR\"/zeroboot/*.js") to
set permissions for present .py/.js files, or alternatively guard the explicit
chmod with an existence check for worker.py and worker_node.js if you prefer to
keep it.
manifests/runtime-artifacts.lock.json (1)

14-17: Consider HTTPS for artifact downloads.

The kernel and rootfs URLs use HTTP instead of HTTPS. While SHA256 verification provides integrity checking, HTTPS adds transport-level security and authenticity verification. For a production hardening PR, consider using HTTPS endpoints if available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/runtime-artifacts.lock.json` around lines 14 - 17, Update the
artifact download URLs under the "kernel" entry (the "url" value) — and any
similar entries such as "rootfs" if present — to use HTTPS instead of HTTP so
transport-level security and authenticity are enforced; keep the existing sha256
verification intact and verify the HTTPS endpoints serve the same artifacts
before committing the URL changes (adjust only the "url" string values for
kernel/rootfs entries).
scripts/preflight.sh (1)

47-50: cgroup v2 requirement is already documented in COMPATIBILITY.md — the prerequisite is listed in the startup verification section at line 43. However, the documentation could be expanded to explain which systems lack v2 support (e.g., older Ubuntu 18.04/20.04, certain container environments) and the migration implications for affected deployments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/preflight.sh` around lines 47 - 50, The preflight cgroup v2 check in
scripts/preflight.sh is fine but COMPATIBILITY.md needs expanded docs: update
the startup verification section (where cgroup v2 is already mentioned) to list
typical systems lacking v2 (e.g., older Ubuntu 18.04/20.04 kernels, some minimal
container hosts, certain cloud images) and explain migration implications and
remediation steps (kernel upgrades, enabling systemd cgroup v2, container
runtime configuration, and fallbacks). Reference the existing cgroup v2 check
from scripts/preflight.sh when adding the guidance so readers can correlate the
runtime check with the documented prerequisites.
src/vmm/mod.rs (1)

6-8: Avoid exposing stub module names as public API.

Lines 6 and 13 can be private modules; the canonical public surface is already provided through pub use ... as vmstate/kvm.

Suggested fix
-#[cfg(not(target_os = "linux"))]
-pub mod vmstate_stub;
+#[cfg(not(target_os = "linux"))]
+mod vmstate_stub;
 #[cfg(not(target_os = "linux"))]
 pub use vmstate_stub as vmstate;
@@
-#[cfg(not(target_os = "linux"))]
-pub mod kvm_stub;
+#[cfg(not(target_os = "linux"))]
+mod kvm_stub;
 #[cfg(not(target_os = "linux"))]
 pub use kvm_stub as kvm;

Also applies to: 13-15

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/vmm/mod.rs` around lines 6 - 8, The stub modules are currently declared
as public (pub mod vmstate_stub) exposing their internal names; make the stub
modules private by changing the declaration to mod vmstate_stub (and likewise
mod kvm_stub for the similar block) while keeping the public re-exports (pub use
vmstate_stub as vmstate and pub use kvm_stub as kvm) so the canonical public API
remains vmstate/kvm without exposing the stub identifiers.
scripts/fetch_official_artifacts.sh (1)

43-43: Harden download behavior with retries and bounded timeouts.

Line 43 currently has no retry/timeout policy; transient network issues can cause brittle fetches.

Suggested fix
-  curl -fsSL "$url" -o "$dest"
+  curl --fail --show-error --silent --location \
+    --retry 5 --retry-all-errors --connect-timeout 15 --max-time 900 \
+    "$url" -o "$dest"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/fetch_official_artifacts.sh` at line 43, Replace the fragile curl
invocation `curl -fsSL "$url" -o "$dest"` with a hardened command that includes
bounded timeouts and retries: ensure you add options like --retry (e.g. 5),
--retry-delay (e.g. 2), --retry-connrefused, --connect-timeout (e.g. 5), and an
overall --max-time (e.g. 30) while preserving -f/-s/-S/-L behavior; update the
curl call in the script (the line containing `curl -fsSL "$url" -o "$dest"`) so
transient network errors are retried and long hangs are avoided.
scripts/create_release_bundle.sh (1)

6-8: Add temp directory cleanup on exit.

Line 6 allocates a temp directory but it is never removed. Add an EXIT trap to avoid leak buildup in repeated runs.

Suggested fix
 TMPDIR="$(mktemp -d)"
 STAGE="$TMPDIR/zeroboot"
+cleanup() { rm -rf "$TMPDIR"; }
+trap cleanup EXIT
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/create_release_bundle.sh` around lines 6 - 8, The script creates a
temporary directory via TMPDIR="$(mktemp -d)" and STAGE="$TMPDIR/zeroboot" but
never removes it; add an EXIT trap to remove TMPDIR when the script exits
(success or failure). Modify the top of create_release_bundle.sh to register a
trap 'trap "rm -rf \"$TMPDIR\"" EXIT' after TMPDIR is created (and ensure TMPDIR
is non-empty/valid before installing the trap), so cleanup always runs;
reference TMPDIR and STAGE to locate where to add the trap and perform any
necessary null-checking before removal.
scripts/install_node_runtime.sh (1)

19-21: Prefer one source of truth for the pinned Node artifact.

NODE_VERSION and NODE_SHA256 are now duplicated here, in scripts/fetch_official_artifacts.sh, and in manifests/runtime-artifacts.lock.json. The next Node bump can easily update one copy and miss the others, breaking fetch/install parity. Pull these values from the lock file, or generate both scripts from it, so the pin only lives in one place.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/install_node_runtime.sh` around lines 19 - 21, The NODE_VERSION and
NODE_SHA256 constants in scripts/install_node_runtime.sh are duplicated across
scripts/fetch_official_artifacts.sh and manifests/runtime-artifacts.lock.json;
change install_node_runtime.sh to read the Node pin from the single
source-of-truth lock file instead of hardcoding values: parse
manifests/runtime-artifacts.lock.json at runtime (or add a small helper that
reads the JSON and exports NODE_VERSION and NODE_SHA256), and then use those
parsed values in place of the existing NODE_VERSION/NODE_SHA256/NODE_TARBALL
assignments so the pin only lives in runtime-artifacts.lock.json and both
install_node_runtime.sh and fetch_official_artifacts.sh reference that same
source.
src/startup.rs (1)

111-121: KVM permission check may not correctly detect write access.

metadata.permissions().readonly() checks if the file's permission bits have the read-only flag set, but this doesn't account for:

  • Effective user permissions (ownership, group membership)
  • ACLs or SELinux/AppArmor policies

Consider using std::fs::OpenOptions to actually attempt opening /dev/kvm for writing:

Proposed fix
 fn verify_kvm() -> Result<()> {
     let kvm = Path::new("/dev/kvm");
     if !kvm.exists() {
         bail!("/dev/kvm is missing");
     }
-    let metadata = std::fs::metadata(kvm).context("stat /dev/kvm")?;
-    if metadata.permissions().readonly() {
-        bail!("/dev/kvm is not writable by the current user");
+    // Actually try to open for write to verify effective permissions
+    if std::fs::OpenOptions::new()
+        .write(true)
+        .open(kvm)
+        .is_err()
+    {
+        bail!("/dev/kvm is not writable by the current user");
     }
     Ok(())
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/startup.rs` around lines 111 - 121, The verify_kvm function's permission
check should attempt an actual write-open instead of relying on
metadata.permissions(); replace the readonly() check in verify_kvm with a real
open attempt using std::fs::OpenOptions (e.g.,
OpenOptions::new().write(true).open("/dev/kvm")) and return a contextualized
error (using .context or bail!) if opening for write fails; keep the initial
existence check and ensure the function still returns Ok(()) on success.
src/config.rs (2)

411-415: Test assertion is overly permissive and may mask regressions.

The test prod_without_release_channel_fails now accepts either KEYRING or channel errors, but validation runs in order and should fail on keyring first (since keyring_path points to a non-existent file). This creates confusion about what the test is actually verifying.

Consider tightening the assertion to match the actual expected failure:

-        assert!(
-            err.contains("KEYRING") || err.contains("channel"),
-            "expected channel-or-keyring error: {}",
-            err
-        );
+        assert!(
+            err.contains("KEYRING"),
+            "expected keyring error (file doesn't exist): {}",
+            err
+        );

Alternatively, if the intent is to test the release channel validation, provide a valid keyring file path that exists in the test environment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 411 - 415, The test
prod_without_release_channel_fails is too permissive: it asserts
err.contains("KEYRING") || err.contains("channel"), masking which validation
failed; either tighten the assertion to expect the keyring failure (make the
assertion require err.contains("KEYRING") and remove the "channel" branch) since
keyring_path points to a non-existent file, or adjust the test to supply a valid
keyring file (create a temp keyring artifact and set keyring_path to it) and
then assert that the error contains "channel" to specifically test
release-channel validation; update the assertion and test setup around
keyring_path and the assert! call accordingly.

447-451: Same permissive assertion issue as above.

The test prod_with_log_code_true_fails accepts either KEYRING or log_code errors. Since the keyring file /etc/zeroboot/keyring likely doesn't exist during tests, this will always match on KEYRING rather than testing the log_code validation path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 447 - 451, The test prod_with_log_code_true_fails
is too permissive (accepts either "KEYRING" or "log_code") and thus often passes
due to a missing /etc/zeroboot/keyring; change the assertion to require the
"log_code" validation error only (e.g., assert!(err.contains("log_code"),
"...")) so the test specifically verifies the log_code failure path, or
alternatively set up a controlled keyring fixture (create a temporary keyring
file or mock the keyring lookup in the test) so the error cannot come from
KEYRING and the existing assertion can remain.
src/main.rs (1)

113-125: Minor: Redundant variable assignment.

The firecracker_version variable at line 113-114 is assigned but then the same value is re-extracted inside the if let block at line 116. Consider simplifying:

-    let firecracker_version =
-        config.and_then(|c| c.artifacts.allowed_firecracker_version.as_deref());
-    if let Some(cfg) = config {
-        if let Some(firecracker_version) = cfg.artifacts.allowed_firecracker_version.as_deref() {
+    if let Some(cfg) = config {
+        if let Some(firecracker_version) = cfg.artifacts.allowed_firecracker_version.as_deref() {

The outer firecracker_version variable appears unused.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 113 - 125, Remove the redundant extraction by
reusing the earlier firecracker_version variable: delete the inner nested
extraction and change the conditional to check the already-computed let
firecracker_version = config.and_then(|c|
c.artifacts.allowed_firecracker_version.as_deref()); then replace the nested
if-let with if let Some(firecracker_version) = firecracker_version { call
vmstate::pre_restore_validate(&state_data, Some(firecracker_version),
manifest.vcpu_count) and handle the Err path as before }, so you no longer
re-extract from config while still calling vmstate::pre_restore_validate with
the same symbols (firecracker_version, state_data, manifest.vcpu_count).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deploy/deploy.sh`:
- Line 137: The deploy script currently calls /v1/ready and /v1/exec without
auth which will 401 in ZEROBOOT_AUTH_MODE=prod; change the probe sequence to
call /ready for readiness (matching docs/CI) and replace the direct /v1/exec
probe with a call to the CLI helper that provides a valid probe token (use
`zeroboot test-exec`) or otherwise supply an explicit probe bearer token,
ensuring the auth header is included when invoking the execution probe so
hardened/prod installs don't false-fail; update the SSH curl invocation in
deploy.sh to use /ready and to invoke `zeroboot test-exec` (or include the probe
token) instead of raw /v1/exec.
- Around line 83-84: The staged verification call to verify-startup is forcing
ZEROBOOT_KEYRING_PATH and unconditionally setting
ZEROBOOT_ALLOWED_FIRECRACKER_VERSION, ZEROBOOT_ALLOWED_FC_BINARY_SHA256 and
ZEROBOOT_RELEASE_CHANNEL to empty values, which can make the check run with a
different policy than the service; update the ssh invocation used for
verify-startup to mirror the same conditional export/persistence logic used
later (the lines that persist knobs) so you only pass ZEROBOOT_KEYRING_PATH and
the ZEROBOOT_ALLOWED_* / ZEROBOOT_RELEASE_CHANNEL environment variables when
they are actually configured, or source the same persisted config before
invoking verify-startup to ensure both the staged check and the service run with
the identical config surface.

In `@docs/DEPLOYMENT.md`:
- Line 26: Replace the hardcoded absolute local path to the lockfile with a
repository-relative link: change the sentence that references
"/Users/dawsonblock/.../manifests/runtime-artifacts.lock.json" so it points to
the relative file path "manifests/runtime-artifacts.lock.json" (or a repo-root
URL) and update the markdown link target for "runtime-artifacts.lock.json"
accordingly to avoid exposing local filesystem paths.

In `@guest/init.c`:
- Around line 233-238: main() currently starts workers with legacy script paths
("worker.py" and "worker_node.js") while restart_worker() uses supervisor paths
("worker_supervisor.py" and "worker_supervisor.js"), causing inconsistent
behavior; update the initial argv arrays in main() for the Python and Node
workers to match restart_worker() by using "python3",
"/zeroboot/worker_supervisor.py" for the Python worker and "node",
"/zeroboot/worker_supervisor.js" for the Node worker (locate where main() calls
start_worker() for worker->name "python" and "node") so startup and restart use
the same supervisor scripts.

In `@guest/worker_child.js`:
- Around line 63-64: The sandbox currently exposes setTimeout, clearTimeout, and
Promise while relying on vm.runInNewContext(..., {timeout}) (symbols:
setTimeout, clearTimeout, Promise, runInNewContext, timeout), which lets
untrusted code schedule async callbacks that escape the VM timeout; remove these
async primitives from the sandbox to prevent scheduling (delete setTimeout,
clearTimeout, Promise from the context) or, if you must allow async work, run
the untrusted script in a separate worker/child process and enforce an external
timeout that kills/terminates the worker (use Worker.terminate() or kill the
child) instead of relying on runInNewContext's timeout. Ensure you update the
sandbox object creation where setTimeout/clearTimeout/Promise are injected and
add external process/worker termination logic if choosing the second option.

In `@guest/worker_supervisor.js`:
- Around line 125-142: Before composing the fallback response, explicitly handle
result.signal: detect if result.signal is set (the spawnSync result), truncate
stdout/stderr with truncateWithMarker as done below, compute flags using
FLAG_STDOUT_TRUNCATED / FLAG_STDERR_TRUNCATED, and then return a non-zero exit
code and a 'signaled' status (include the signal name or numeric mapping, e.g.,
use 128 + signal number if available, or -1 if not) along with the truncated
stdout, stderr, and flags; place this branch before the final return that
currently uses result.status || 0 and references parseChildResponse,
truncateWithMarker, FLAG_STDOUT_TRUNCATED, and FLAG_STDERR_TRUNCATED.

In `@manifests/firecracker.lock.json`:
- Around line 7-8: The manifest currently pins the kernel artifact with an
insecure transport by using the "kernel_url" value starting with "http://";
update the "kernel_url" to use "https://" (preserving the same host/path) so
consumers fetch the pinned kernel over TLS, and leave the "kernel_sha256" intact
to continue verifying integrity (i.e., change the scheme of the value referenced
by kernel_url to https).

In `@manifests/node-guest.manifest`:
- Line 7: The manifest pins the supervisor at /zeroboot/worker_supervisor.js but
guest/init.c still hardcodes first-boot worker paths (/zeroboot/worker_node.js
and /zeroboot/worker.py), causing a mismatch; update guest/init.c to start the
same supervisor-specified worker path used for restarts by either reading the
manifest entry or using /zeroboot/worker_supervisor.js (replace references to
/zeroboot/worker_node.js and /zeroboot/worker.py), and ensure the restart logic
and any spawn/exec calls reference the same symbol so initial boot and runtime
use the identical supervisor binary.

In `@README.md`:
- Around line 105-106: Replace the hardcoded absolute path
"/Users/dawsonblock/Downloads/XBOOT-main-2/manifests/runtime-artifacts.lock.json"
in README.md with a repository-relative path (for example
"manifests/runtime-artifacts.lock.json" or
"./manifests/runtime-artifacts.lock.json"); update the markdown link text so the
referenced file is relative (e.g.,
[runtime-artifacts.lock.json](manifests/runtime-artifacts.lock.json)) to ensure
the link works for other users cloning the repo.

In `@scripts/build_guest_rootfs.sh`:
- Around line 70-71: The initial startup paths in init.c's main() still point to
/zeroboot/worker.py and /zeroboot/worker_node.js but the manifest and
restart_worker() use the supervisor entrypoints; update main() to launch the
supervisor entrypoints instead by replacing those initial paths with
/zeroboot/worker_supervisor.py and /zeroboot/worker_supervisor.js so they match
the SHA256-written files and the restart_worker() behavior.

In `@scripts/create_release_bundle.sh`:
- Around line 13-25: The fallback branch that runs rsync (the else block
invoking rsync -a "$ROOT" "$STAGE" with many --exclude flags) assumes rsync
exists; add a preflight check before calling rsync by testing command -v rsync
(or type rsync >/dev/null 2>&1) and if not found, print a clear error message
and exit non‑zero (or implement a safe fallback like using tar/cp with the same
excludes); ensure the check appears immediately before the rsync invocation and
that the error message references rsync so users know what to install.

In `@scripts/fetch_official_artifacts.sh`:
- Around line 30-35: The preflight checks currently validate curl and tar but
miss verifying a SHA-256 tool; add a check alongside the existing curl/tar
checks to ensure either sha256sum or shasum is available and fail early with a
clear error message if neither exists. Update the preflight logic (the same area
that checks for curl and tar) to run command -v sha256sum >/dev/null 2>&1 ||
command -v shasum >/dev/null 2>&1 and call exit 1 with a descriptive error if it
fails; this ensures the sha256_file() function (which expects sha256sum or
shasum) will always have a valid tool.

In `@scripts/make_api_keys.py`:
- Around line 14-16: scripts/make_api_keys.py currently allows a custom --prefix
that may contain '.' which breaks the verifier expecting tokens in the form
prefix.secret; add validation where the prefix is composed (variables
prefix_prefix and prefix and when building token) to reject or sanitize any
prefix containing a '.' and raise a clear error/exit if found (e.g., check the
user-provided prefix before concatenating with uuid and also before creating
tokens at the other occurrences referenced around the prefix_prefix/prefix/token
construction) so no generated token contains a dot in its prefix portion.
- Around line 36-45: Remove the inline --pepper argument and require the pepper
be supplied via --pepper-file or stdin: delete the
parser.add_argument("--pepper") line and the code path that reads args.pepper;
instead read pepper from Path(args.pepper_file).read_text().strip(), and allow
reading from stdin when args.pepper_file == "-" or when no file arg is given but
data is piped in (use sys.stdin.read().strip()); if neither a pepper file nor
stdin content is present then raise SystemExit("provide --pepper-file or pipe
pepper via stdin"). Keep references to parser, args, pepper_file and the pepper
variable so the rest of the script uses the loaded pepper.

In `@src/api/handlers.rs`:
- Around line 510-515: The current try_send call on state.request_log_tx treats
all errors the same; update the error handling around
state.request_log_tx.try_send(line.to_string()) to match on Tokio's TrySendError
variants and increment the appropriate metric: if the error is
TrySendError::Full increment state.metrics.request_log_drops.fetch_add(1,
Ordering::Relaxed), but if it is TrySendError::Closed increment
state.metrics.request_log_write_failures.fetch_add(1, Ordering::Relaxed) (or
create request_log_write_failures if missing); ensure you still drop/ignore the
message after handling each variant so the logging path distinguishes
backpressure (Full) from logger shutdown (Closed).

In `@src/startup.rs`:
- Around line 236-238: The explicit u64 casts on statvfs fields are redundant;
update the allocation where free_bytes and free_inodes are set to use the stat
fields directly (replace (stat.f_bavail as u64).saturating_mul(stat.f_frsize as
u64) with stat.f_bavail.saturating_mul(stat.f_frsize) and replace stat.f_favail
as u64 with stat.f_favail) in the code that constructs the struct containing
free_bytes and free_inodes (references: free_bytes, free_inodes, stat).
- Around line 152-162: The code currently compares the raw version_output stdout
(variable version) against config.artifacts.allowed_firecracker_version, but
firecracker prints a "Firecracker v" prefix and possible git suffixes; parse the
version string from version_output.stdout (e.g., strip a leading "Firecracker v"
or apply a regex to capture the semantic version and optional git suffix like
"-695-g...-dirty") and normalize it before comparing to
allowed_firecracker_version. Update the logic around version (the
String::from_utf8_lossy -> version variable) to extract and trim the actual
numeric version token, then compare that parsed_version to
config.artifacts.allowed_firecracker_version.as_deref() in the existing if let
Some(...) block.

In `@tests/test_guest_worker_subprocess.py`:
- Around line 70-80: The test test_temp_directory_is_scoped_per_request
currently only writes a file and checks it exists in a single run; modify it to
run the child twice using self.run_child (call it twice with the same small
snippet that prints os.environ['ZEROBOOT_TMPDIR'] and optionally creates a
file), capture both stdout paths, assert the two printed ZEROBOOT_TMPDIR values
are different (ensuring per-request isolation), and additionally ensure the
first request's temp directory is cleaned up after exit by asserting the first
path no longer exists (or that the file created in the first run is absent) when
the second run starts; reference the test_temp_directory_is_scoped_per_request
function, self.run_child helper, and the ZEROBOOT_TMPDIR env var to locate where
to change the test.

In `@tests/test_repo_sanity.py`:
- Around line 20-22: The placeholder check in the loop currently only looks for
the literal token "=REQUIRED" and misses variants like "key = REQUIRED"; update
the conditional that filters lines before self.fail to detect "REQUIRED" with
optional spaces around the '=' (e.g. use a regex search for patterns like
"\b=\s*REQUIRED\b" or check for both '=' and 'REQUIRED' spaced) so any "key =
REQUIRED" or "key= REQUIRED" variants trigger the unresolved-placeholder
failure; adjust the condition around the variables line and path.name
accordingly.

---

Outside diff comments:
In `@guest/worker_child.js`:
- Around line 52-58: The current in-memory capture buffers stdoutParts and
stderrParts (filled by sandbox.console.log and sandbox.console.error) never
enforce stdout_bytes/stderr_bytes limits until truncateWithMarker() runs, which
can OOM; modify the capture so each push respects the byte cap by tracking the
accumulated byte count per stream and only appending up to the remaining allowed
bytes, and when the cap is reached append the truncation marker immediately and
stop adding further content; implement this behavior via a small helper used by
sandbox.console.log and sandbox.console.error (and replicate the same
capped-append logic for the other capture block around lines 106-114) so
truncation/makers are applied incrementally rather than only at the end.

In `@manifests/node-build.lock.json`:
- Around line 1-13: The CI manifest-schema validation is failing because build
lock files matching *.lock.json are being validated as template manifests;
update the artifact-verify/manifest-schema check so it skips *.lock.json (or
change the manifest glob from manifests/*.json to exclude *.lock.json), or move
node-build.lock.json out of the manifests/*.json glob; adjust the
artifact-verify step (the job that runs the manifest-schema gate) to explicitly
ignore files matching *.lock.json or add a separate glob for template manifests
that requires schema_version, template_id and promotion_channel.

In `@src/api/handlers.rs`:
- Around line 1100-1110: Create the final detail string before computing health
and then call classify_template_health on that same string; specifically,
compute a local variable (e.g., detail) set to "probe ok" when ready is true or
to probe.stderr when not, use that variable for TemplateStatus.detail, and pass
it to classify_template_health instead of startup_status.detail so health and
detail reflect the same failure text; update the code paths around
classify_template_health, TemplateStatus { detail: ... }, and any use of
startup_status.detail to use the new detail variable.

In `@tests/test_worker_protocol.py`:
- Around line 60-66: Rename the test function
test_python_timeout_requests_recycle to a name that reflects the supervisor
behavior (e.g., test_python_timeout_no_recycle) and update any references; keep
the body and assertions (including the flags & 4 check and assertions on rid,
exit_code, error_type, and stderr) unchanged so the test continues to verify
that timeouts do not set the recycle flag in the supervisor architecture.

---

Nitpick comments:
In `@manifests/runtime-artifacts.lock.json`:
- Around line 14-17: Update the artifact download URLs under the "kernel" entry
(the "url" value) — and any similar entries such as "rootfs" if present — to use
HTTPS instead of HTTP so transport-level security and authenticity are enforced;
keep the existing sha256 verification intact and verify the HTTPS endpoints
serve the same artifacts before committing the URL changes (adjust only the
"url" string values for kernel/rootfs entries).

In `@scripts/build_guest_rootfs.sh`:
- Around line 61-62: The explicit chmod call for
"$STAGING_DIR/zeroboot/worker.py" and "$STAGING_DIR/zeroboot/worker_node.js" is
brittle and will fail if those legacy files are removed; remove the specific
chmod invocation (the first chmod line) and rely on the subsequent glob chmod
("chmod 0644 \"$STAGING_DIR\"/zeroboot/*.py \"$STAGING_DIR\"/zeroboot/*.js") to
set permissions for present .py/.js files, or alternatively guard the explicit
chmod with an existence check for worker.py and worker_node.js if you prefer to
keep it.

In `@scripts/create_release_bundle.sh`:
- Around line 6-8: The script creates a temporary directory via TMPDIR="$(mktemp
-d)" and STAGE="$TMPDIR/zeroboot" but never removes it; add an EXIT trap to
remove TMPDIR when the script exits (success or failure). Modify the top of
create_release_bundle.sh to register a trap 'trap "rm -rf \"$TMPDIR\"" EXIT'
after TMPDIR is created (and ensure TMPDIR is non-empty/valid before installing
the trap), so cleanup always runs; reference TMPDIR and STAGE to locate where to
add the trap and perform any necessary null-checking before removal.

In `@scripts/fetch_official_artifacts.sh`:
- Line 43: Replace the fragile curl invocation `curl -fsSL "$url" -o "$dest"`
with a hardened command that includes bounded timeouts and retries: ensure you
add options like --retry (e.g. 5), --retry-delay (e.g. 2), --retry-connrefused,
--connect-timeout (e.g. 5), and an overall --max-time (e.g. 30) while preserving
-f/-s/-S/-L behavior; update the curl call in the script (the line containing
`curl -fsSL "$url" -o "$dest"`) so transient network errors are retried and long
hangs are avoided.

In `@scripts/install_node_runtime.sh`:
- Around line 19-21: The NODE_VERSION and NODE_SHA256 constants in
scripts/install_node_runtime.sh are duplicated across
scripts/fetch_official_artifacts.sh and manifests/runtime-artifacts.lock.json;
change install_node_runtime.sh to read the Node pin from the single
source-of-truth lock file instead of hardcoding values: parse
manifests/runtime-artifacts.lock.json at runtime (or add a small helper that
reads the JSON and exports NODE_VERSION and NODE_SHA256), and then use those
parsed values in place of the existing NODE_VERSION/NODE_SHA256/NODE_TARBALL
assignments so the pin only lives in runtime-artifacts.lock.json and both
install_node_runtime.sh and fetch_official_artifacts.sh reference that same
source.

In `@scripts/preflight.sh`:
- Around line 47-50: The preflight cgroup v2 check in scripts/preflight.sh is
fine but COMPATIBILITY.md needs expanded docs: update the startup verification
section (where cgroup v2 is already mentioned) to list typical systems lacking
v2 (e.g., older Ubuntu 18.04/20.04 kernels, some minimal container hosts,
certain cloud images) and explain migration implications and remediation steps
(kernel upgrades, enabling systemd cgroup v2, container runtime configuration,
and fallbacks). Reference the existing cgroup v2 check from scripts/preflight.sh
when adding the guidance so readers can correlate the runtime check with the
documented prerequisites.

In `@src/config.rs`:
- Around line 411-415: The test prod_without_release_channel_fails is too
permissive: it asserts err.contains("KEYRING") || err.contains("channel"),
masking which validation failed; either tighten the assertion to expect the
keyring failure (make the assertion require err.contains("KEYRING") and remove
the "channel" branch) since keyring_path points to a non-existent file, or
adjust the test to supply a valid keyring file (create a temp keyring artifact
and set keyring_path to it) and then assert that the error contains "channel" to
specifically test release-channel validation; update the assertion and test
setup around keyring_path and the assert! call accordingly.
- Around line 447-451: The test prod_with_log_code_true_fails is too permissive
(accepts either "KEYRING" or "log_code") and thus often passes due to a missing
/etc/zeroboot/keyring; change the assertion to require the "log_code" validation
error only (e.g., assert!(err.contains("log_code"), "...")) so the test
specifically verifies the log_code failure path, or alternatively set up a
controlled keyring fixture (create a temporary keyring file or mock the keyring
lookup in the test) so the error cannot come from KEYRING and the existing
assertion can remain.

In `@src/main.rs`:
- Around line 113-125: Remove the redundant extraction by reusing the earlier
firecracker_version variable: delete the inner nested extraction and change the
conditional to check the already-computed let firecracker_version =
config.and_then(|c| c.artifacts.allowed_firecracker_version.as_deref()); then
replace the nested if-let with if let Some(firecracker_version) =
firecracker_version { call vmstate::pre_restore_validate(&state_data,
Some(firecracker_version), manifest.vcpu_count) and handle the Err path as
before }, so you no longer re-extract from config while still calling
vmstate::pre_restore_validate with the same symbols (firecracker_version,
state_data, manifest.vcpu_count).

In `@src/startup.rs`:
- Around line 111-121: The verify_kvm function's permission check should attempt
an actual write-open instead of relying on metadata.permissions(); replace the
readonly() check in verify_kvm with a real open attempt using
std::fs::OpenOptions (e.g., OpenOptions::new().write(true).open("/dev/kvm")) and
return a contextualized error (using .context or bail!) if opening for write
fails; keep the initial existence check and ensure the function still returns
Ok(()) on success.

In `@src/vmm/mod.rs`:
- Around line 6-8: The stub modules are currently declared as public (pub mod
vmstate_stub) exposing their internal names; make the stub modules private by
changing the declaration to mod vmstate_stub (and likewise mod kvm_stub for the
similar block) while keeping the public re-exports (pub use vmstate_stub as
vmstate and pub use kvm_stub as kvm) so the canonical public API remains
vmstate/kvm without exposing the stub identifiers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 34318ed8-b85b-4d7d-9a25-cf463da71f6e

📥 Commits

Reviewing files that changed from the base of the PR and between f4584be and 87694e0.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (42)
  • .gitattributes
  • .github/workflows/ci.yml
  • .gitignore
  • Cargo.toml
  • README.md
  • deploy/deploy.sh
  • deploy/zeroboot.service
  • docs/API.md
  • docs/COMPATIBILITY.md
  • docs/DEPLOYMENT.md
  • guest/init.c
  • guest/worker_child.js
  • guest/worker_child.py
  • guest/worker_supervisor.js
  • guest/worker_supervisor.py
  • manifests/firecracker.lock.json
  • manifests/kernel.manifest
  • manifests/node-build.lock.json
  • manifests/node-guest.manifest
  • manifests/python-build.lock.json
  • manifests/python-guest.manifest
  • manifests/runtime-artifacts.lock.json
  • scripts/build_guest_rootfs.sh
  • scripts/create_release_bundle.sh
  • scripts/fetch_official_artifacts.sh
  • scripts/install_node_runtime.sh
  • scripts/make_api_keys.py
  • scripts/preflight.sh
  • src/api/handlers.rs
  • src/config.rs
  • src/main.rs
  • src/signing.rs
  • src/startup.rs
  • src/template_manifest.rs
  • src/vmm/kvm_stub.rs
  • src/vmm/mod.rs
  • src/vmm/vmstate.rs
  • src/vmm/vmstate_stub.rs
  • tests/test_docs_sanity.py
  • tests/test_guest_worker_subprocess.py
  • tests/test_repo_sanity.py
  • tests/test_worker_protocol.py

Comment thread deploy/deploy.sh
Comment thread deploy/deploy.sh
Comment thread docs/DEPLOYMENT.md Outdated
Comment thread guest/init.c
Comment thread guest/worker_child.js
Comment thread src/api/handlers.rs
Comment thread src/startup.rs
Comment thread src/startup.rs
Comment thread tests/test_guest_worker_subprocess.py
Comment thread tests/test_repo_sanity.py
dawsonblock and others added 2 commits March 28, 2026 02:37
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@dawsonblock dawsonblock merged commit 1ce73a7 into main Mar 28, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants