Harden pass 1 runtime and lock Ubuntu 22.04 artifacts#5
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 26 minutes and 18 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR introduces a production hardening update for a Firecracker-based microVM system, transitioning from a development build to a "first hardened release." Key changes include refactoring guest workers into supervisor/child models, pinning all runtime artifacts (Firecracker 1.12.0, Python 3.10.12, Node.js 20.20.2), implementing startup verification and disk/inode watermark admission checks, updating API key handling to use hashed records, adopting symlink-based release management, and adding cross-platform stubs for non-Linux builds. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the guest worker architecture into a supervisor/child model to improve isolation and introduces a comprehensive startup verification process, including disk and inode watermark checks. It also pins specific versions for Firecracker, the guest kernel, and the rootfs to ensure environment consistency. Review feedback identifies a regression in the rootfs build script due to renamed worker files and points out absolute local file paths used in documentation links that need to be converted to relative paths.
There was a problem hiding this comment.
Pull request overview
This PR hardens the XBOOT runtime and deployment flow by adding staged startup verification, enforcing disk/inode admission checks, pinning Ubuntu 22.04 artifact compatibility, and updating worker execution to a supervisor/child subprocess model.
Changes:
- Add
verify-startupcommand + shared startup checks (KVM/cgroup/firecracker/template verification + disk watermarks) and wire it into systemd + deploy flow. - Introduce disk/inode watermark enforcement (
ZEROBOOT_MIN_FREE_BYTES,ZEROBOOT_MIN_FREE_INODES) and request-log backpressure/metrics. - Pin and document runtime artifacts (new lock file + fetch/install scripts) and update manifests/docs/tests accordingly.
Reviewed changes
Copilot reviewed 41 out of 43 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_worker_protocol.py | Update worker protocol test to run the Python supervisor + child via env. |
| tests/test_repo_sanity.py | Tighten manifest/lock placeholder checks and add runtime-artifacts.lock.json to lock validation. |
| tests/test_guest_worker_subprocess.py | Replace conceptual isolation checks with real child-protocol tests and resource limit payloads. |
| tests/test_docs_sanity.py | Add sanity checks to keep docs aligned with release layout + hashed API key docs. |
| src/vmm/vmstate_stub.rs | Add non-Linux stub for vmstate validation. |
| src/vmm/vmstate.rs | Minor formatting/cleanup in vmstate validation errors/imports. |
| src/vmm/mod.rs | Gate KVM/vmstate modules by target OS and re-export stubs on non-Linux. |
| src/vmm/kvm_stub.rs | Add non-Linux stub types/APIs for KVM restore/serial/exec paths. |
| src/template_manifest.rs | Policy construction tweaks, clippy allow, and expanded manifest test fixtures. |
| src/startup.rs | New startup verification + disk watermark enforcement and Firecracker binary validation. |
| src/signing.rs | Formatting-only changes in canonicalization + tests. |
| src/main.rs | Add verify-startup command; run startup verification during serve; add log backpressure + template health values. |
| src/config.rs | Add StorageConfig (watermarks + log queue capacity) and integrate into config/env parsing + tests. |
| src/api/handlers.rs | Add runtime admission checks on exec/batch/ready; bounded request-log sender + new metrics and template health classification. |
| scripts/preflight.sh | Add watermark, Firecracker sha, and cgroup v2 checks to preflight. |
| scripts/make_api_keys.py | Switch to hashed API key records (HMAC-SHA256) and print bearer tokens once. |
| scripts/install_node_runtime.sh | New script to install pinned Node runtime into rootfs template with sha verification. |
| scripts/fetch_official_artifacts.sh | New script to fetch and verify pinned Ubuntu 22.04 / FC 1.12.0 artifact set. |
| scripts/create_release_bundle.sh | New script to create a source bundle (excluding build artifacts) for releases/CI. |
| scripts/build_guest_rootfs.sh | Copy all worker assets and update manifest sha fields to supervisor worker names. |
| manifests/runtime-artifacts.lock.json | New lock file documenting the pinned artifact/host matrix. |
| manifests/python-guest.manifest | Replace REQUIRED placeholders with pinned Ubuntu rootfs + Python version; switch worker path to supervisor. |
| manifests/python-build.lock.json | Pin python build inputs and supervisor worker path. |
| manifests/node-guest.manifest | Replace REQUIRED placeholders with pinned Ubuntu rootfs + Node version; switch worker path to supervisor. |
| manifests/node-build.lock.json | Pin node build inputs including runtime source URL/sha; switch worker path to supervisor. |
| manifests/kernel.manifest | Fill pinned Firecracker/kernel versions and sha256. |
| manifests/firecracker.lock.json | Fill pinned Firecracker/kernel versions and sha256 + source URLs. |
| guest/worker_supervisor.py | Rewrite Python worker into supervisor that spawns a fresh child per request with resource limits and truncation markers. |
| guest/worker_supervisor.js | Rewrite Node worker into supervisor using spawnSync per request and a child JSON protocol. |
| guest/worker_child.py | Rewrite Python child to read JSON payload, apply limits, isolate environment/tmp, and emit framed responses. |
| guest/worker_child.js | Rewrite Node child to read JSON payload, run code in VM context, enforce tmp quota post-run, and emit framed responses. |
| guest/init.c | Update guest init to launch supervisor workers instead of legacy worker scripts. |
| docs/DEPLOYMENT.md | Document pinned artifact fetch/install flow, new release layout, and startup verification/watermarks. |
| docs/COMPATIBILITY.md | New compatibility matrix documenting the single supported hardened host/artifact set. |
| docs/API.md | Document hashed API key record schema, bearer format, generation, and overload behaviors. |
| deploy/zeroboot.service | Add ExecStartPre=verify-startup, switch to /current template paths, and set watermark env vars. |
| deploy/deploy.sh | Add staged startup verification step, propagate new env vars, and strengthen rollback smoke checks. |
| README.md | Update positioning, document pinned artifact matrix, supervisor/child model, and new env vars/deploy behavior. |
| Cargo.toml | Make KVM-related deps Linux-only for better cross-platform builds. |
| Cargo.lock | Update lockfile for dependency graph changes. |
| .gitignore | Ignore additional build/release staging directories. |
| .github/workflows/ci.yml | Expand syntax checks, add release bundle validation, and run cargo commands with --locked. |
| .gitattributes | Add export-ignore rules to exclude build artifacts from source bundles. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let flags = 0; | ||
| let stdout; | ||
| let stderr; | ||
| [stdout, stdoutTruncated] = truncateWithMarker(result.stdout || Buffer.alloc(0), MAX_STDOUT); | ||
| [stderr, stderrTruncated] = truncateWithMarker(result.stderr || Buffer.alloc(0), MAX_STDERR); | ||
| if (stdoutTruncated) flags |= FLAG_STDOUT_TRUNCATED; | ||
| if (stderrTruncated) flags |= FLAG_STDERR_TRUNCATED; |
There was a problem hiding this comment.
stdoutTruncated and stderrTruncated are assigned without being declared, which creates implicit globals in non-strict mode (and throws in strict mode). Declare them (e.g., with let) before assignment to avoid leaking globals / runtime errors if strict mode is enabled later.
| const [command, args] = childCommand(timeoutMs); | ||
| const result = spawnSync(command, args, { | ||
| input: payload, | ||
| env: minimalChildEnv(), | ||
| encoding: null, | ||
| timeout: Math.max(timeoutMs + 2000, 5000), | ||
| maxBuffer: MAX_STDOUT + MAX_STDERR + 4096, | ||
| }); | ||
|
|
||
| const stderrStart = stdoutStart + stdoutLen; | ||
| const stderrEnd = Math.min(data.length, stderrStart + stderrLen); | ||
| if (stderrStart < data.length && stderrLen > 0) { | ||
| stderr = data.subarray(stderrStart, stderrEnd); | ||
| } else { | ||
| stderr = Buffer.alloc(0); | ||
| } | ||
| if (result.error && result.error.code === 'ETIMEDOUT') { | ||
| return [-1, 'timeout', Buffer.alloc(0), Buffer.from('execution timed out\n', 'utf8'), 0]; | ||
| } | ||
| return [exitCode, errorType, stdout, stderr, flags]; | ||
| if (result.error) { | ||
| return [-1, 'internal', Buffer.alloc(0), Buffer.from(String(result.error.message || result.error), 'utf8'), 0]; | ||
| } |
There was a problem hiding this comment.
spawnSync can set result.error when stdout/stderr exceed maxBuffer (e.g. ERR_CHILD_PROCESS_STDIO_MAXBUFFER / ENOBUFS). Currently that is returned as an internal error, losing the child output and bypassing the truncation logic below. Handle the maxBuffer overflow case explicitly by returning a normal runtime response with truncated stdout/stderr (and setting the truncation flags), or increase buffering and rely on your own truncation logic.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 87694e0131
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 19
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
guest/worker_child.js (1)
52-58:⚠️ Potential issue | 🟠 MajorApply stdout/stderr caps while capturing output.
stdoutPartsandstderrPartsbuffer everything in memory and only truncate at the end. A noisy script can exceedstdout_bytes/stderr_bytesand OOM the child beforetruncateWithMarker()ever runs.Also applies to: 106-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@guest/worker_child.js` around lines 52 - 58, The current in-memory capture buffers stdoutParts and stderrParts (filled by sandbox.console.log and sandbox.console.error) never enforce stdout_bytes/stderr_bytes limits until truncateWithMarker() runs, which can OOM; modify the capture so each push respects the byte cap by tracking the accumulated byte count per stream and only appending up to the remaining allowed bytes, and when the cap is reached append the truncation marker immediately and stop adding further content; implement this behavior via a small helper used by sandbox.console.log and sandbox.console.error (and replicate the same capped-append logic for the other capture block around lines 106-114) so truncation/makers are applied incrementally rather than only at the end.tests/test_worker_protocol.py (1)
60-66:⚠️ Potential issue | 🟠 MajorRename test to clarify behavior — timeout does not request recycle in supervisor architecture.
The test name
test_python_timeout_requests_recycleis misleading. The assertionflags & 4 == 0correctly verifies that the recycle flag is not set on timeout in the supervisor architecture, which is the intended behavior. However, other worker implementations (worker.py, worker_node.js) do request recycle on timeout.Rename the test to
test_python_timeout_no_recycleor similar to accurately reflect that the supervisor intentionally does not request recycling on timeout.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_worker_protocol.py` around lines 60 - 66, Rename the test function test_python_timeout_requests_recycle to a name that reflects the supervisor behavior (e.g., test_python_timeout_no_recycle) and update any references; keep the body and assertions (including the flags & 4 check and assertions on rid, exit_code, error_type, and stderr) unchanged so the test continues to verify that timeouts do not set the recycle flag in the supervisor architecture.manifests/node-build.lock.json (1)
1-13:⚠️ Potential issue | 🟠 MajorThis lock file will fail the manifest-schema gate in CI.
.github/workflows/ci.ymlLines 47-59 validate every top-levelmanifests/*.jsonas a template manifest and requireschema_version,template_id, andpromotion_channel. This new*.lock.jsonintentionally does not carry that schema, soartifact-verifywill exit non-zero as soon as it reaches this file. Either move build-lock files out of that glob or teach the workflow to skip*.lock.json.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manifests/node-build.lock.json` around lines 1 - 13, The CI manifest-schema validation is failing because build lock files matching *.lock.json are being validated as template manifests; update the artifact-verify/manifest-schema check so it skips *.lock.json (or change the manifest glob from manifests/*.json to exclude *.lock.json), or move node-build.lock.json out of the manifests/*.json glob; adjust the artifact-verify step (the job that runs the manifest-schema gate) to explicitly ignore files matching *.lock.json or add a separate glob for template manifests that requires schema_version, template_id and promotion_channel.src/api/handlers.rs (1)
1100-1110:⚠️ Potential issue | 🟡 MinorClassify failed probes from the detail you actually return.
Line 1100 derives
healthfromstartup_status.detail, but Lines 1105-1109 returnprobe.stderrwhen the probe fails. That can produce/healthresponses wheredetailandhealthdescribe different failure modes. Build the finaldetailfirst, then classify from that same string.Suggested fix
- let health = classify_template_health(ready, &startup_status.detail); + let detail = if ready { + "probe ok".into() + } else { + probe.stderr + }; + let health = classify_template_health(ready, &detail); templates.insert( name.clone(), TemplateStatus { ready, - detail: if ready { - "probe ok".into() - } else { - probe.stderr - }, + detail, health, }, );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/api/handlers.rs` around lines 1100 - 1110, Create the final detail string before computing health and then call classify_template_health on that same string; specifically, compute a local variable (e.g., detail) set to "probe ok" when ready is true or to probe.stderr when not, use that variable for TemplateStatus.detail, and pass it to classify_template_health instead of startup_status.detail so health and detail reflect the same failure text; update the code paths around classify_template_health, TemplateStatus { detail: ... }, and any use of startup_status.detail to use the new detail variable.
🧹 Nitpick comments (11)
scripts/build_guest_rootfs.sh (1)
61-62: Line 61 may fail if legacy files are removed.Line 61 explicitly references
worker.pyandworker_node.js, but the architecture is moving to supervisor scripts. If these legacy files are eventually removed fromguest/, this line will cause the build to fail.Consider removing line 61 since line 62 already handles all
.pyand.jsfiles:♻️ Proposed fix
-chmod 0644 "$STAGING_DIR/zeroboot/worker.py" "$STAGING_DIR/zeroboot/worker_node.js" chmod 0644 "$STAGING_DIR"/zeroboot/*.py "$STAGING_DIR"/zeroboot/*.js🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/build_guest_rootfs.sh` around lines 61 - 62, The explicit chmod call for "$STAGING_DIR/zeroboot/worker.py" and "$STAGING_DIR/zeroboot/worker_node.js" is brittle and will fail if those legacy files are removed; remove the specific chmod invocation (the first chmod line) and rely on the subsequent glob chmod ("chmod 0644 \"$STAGING_DIR\"/zeroboot/*.py \"$STAGING_DIR\"/zeroboot/*.js") to set permissions for present .py/.js files, or alternatively guard the explicit chmod with an existence check for worker.py and worker_node.js if you prefer to keep it.manifests/runtime-artifacts.lock.json (1)
14-17: Consider HTTPS for artifact downloads.The kernel and rootfs URLs use HTTP instead of HTTPS. While SHA256 verification provides integrity checking, HTTPS adds transport-level security and authenticity verification. For a production hardening PR, consider using HTTPS endpoints if available.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manifests/runtime-artifacts.lock.json` around lines 14 - 17, Update the artifact download URLs under the "kernel" entry (the "url" value) — and any similar entries such as "rootfs" if present — to use HTTPS instead of HTTP so transport-level security and authenticity are enforced; keep the existing sha256 verification intact and verify the HTTPS endpoints serve the same artifacts before committing the URL changes (adjust only the "url" string values for kernel/rootfs entries).scripts/preflight.sh (1)
47-50: cgroup v2 requirement is already documented in COMPATIBILITY.md — the prerequisite is listed in the startup verification section at line 43. However, the documentation could be expanded to explain which systems lack v2 support (e.g., older Ubuntu 18.04/20.04, certain container environments) and the migration implications for affected deployments.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/preflight.sh` around lines 47 - 50, The preflight cgroup v2 check in scripts/preflight.sh is fine but COMPATIBILITY.md needs expanded docs: update the startup verification section (where cgroup v2 is already mentioned) to list typical systems lacking v2 (e.g., older Ubuntu 18.04/20.04 kernels, some minimal container hosts, certain cloud images) and explain migration implications and remediation steps (kernel upgrades, enabling systemd cgroup v2, container runtime configuration, and fallbacks). Reference the existing cgroup v2 check from scripts/preflight.sh when adding the guidance so readers can correlate the runtime check with the documented prerequisites.src/vmm/mod.rs (1)
6-8: Avoid exposing stub module names as public API.Lines 6 and 13 can be private modules; the canonical public surface is already provided through
pub use ... as vmstate/kvm.Suggested fix
-#[cfg(not(target_os = "linux"))] -pub mod vmstate_stub; +#[cfg(not(target_os = "linux"))] +mod vmstate_stub; #[cfg(not(target_os = "linux"))] pub use vmstate_stub as vmstate; @@ -#[cfg(not(target_os = "linux"))] -pub mod kvm_stub; +#[cfg(not(target_os = "linux"))] +mod kvm_stub; #[cfg(not(target_os = "linux"))] pub use kvm_stub as kvm;Also applies to: 13-15
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/vmm/mod.rs` around lines 6 - 8, The stub modules are currently declared as public (pub mod vmstate_stub) exposing their internal names; make the stub modules private by changing the declaration to mod vmstate_stub (and likewise mod kvm_stub for the similar block) while keeping the public re-exports (pub use vmstate_stub as vmstate and pub use kvm_stub as kvm) so the canonical public API remains vmstate/kvm without exposing the stub identifiers.scripts/fetch_official_artifacts.sh (1)
43-43: Harden download behavior with retries and bounded timeouts.Line 43 currently has no retry/timeout policy; transient network issues can cause brittle fetches.
Suggested fix
- curl -fsSL "$url" -o "$dest" + curl --fail --show-error --silent --location \ + --retry 5 --retry-all-errors --connect-timeout 15 --max-time 900 \ + "$url" -o "$dest"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/fetch_official_artifacts.sh` at line 43, Replace the fragile curl invocation `curl -fsSL "$url" -o "$dest"` with a hardened command that includes bounded timeouts and retries: ensure you add options like --retry (e.g. 5), --retry-delay (e.g. 2), --retry-connrefused, --connect-timeout (e.g. 5), and an overall --max-time (e.g. 30) while preserving -f/-s/-S/-L behavior; update the curl call in the script (the line containing `curl -fsSL "$url" -o "$dest"`) so transient network errors are retried and long hangs are avoided.scripts/create_release_bundle.sh (1)
6-8: Add temp directory cleanup on exit.Line 6 allocates a temp directory but it is never removed. Add an
EXITtrap to avoid leak buildup in repeated runs.Suggested fix
TMPDIR="$(mktemp -d)" STAGE="$TMPDIR/zeroboot" +cleanup() { rm -rf "$TMPDIR"; } +trap cleanup EXIT🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/create_release_bundle.sh` around lines 6 - 8, The script creates a temporary directory via TMPDIR="$(mktemp -d)" and STAGE="$TMPDIR/zeroboot" but never removes it; add an EXIT trap to remove TMPDIR when the script exits (success or failure). Modify the top of create_release_bundle.sh to register a trap 'trap "rm -rf \"$TMPDIR\"" EXIT' after TMPDIR is created (and ensure TMPDIR is non-empty/valid before installing the trap), so cleanup always runs; reference TMPDIR and STAGE to locate where to add the trap and perform any necessary null-checking before removal.scripts/install_node_runtime.sh (1)
19-21: Prefer one source of truth for the pinned Node artifact.
NODE_VERSIONandNODE_SHA256are now duplicated here, inscripts/fetch_official_artifacts.sh, and inmanifests/runtime-artifacts.lock.json. The next Node bump can easily update one copy and miss the others, breaking fetch/install parity. Pull these values from the lock file, or generate both scripts from it, so the pin only lives in one place.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/install_node_runtime.sh` around lines 19 - 21, The NODE_VERSION and NODE_SHA256 constants in scripts/install_node_runtime.sh are duplicated across scripts/fetch_official_artifacts.sh and manifests/runtime-artifacts.lock.json; change install_node_runtime.sh to read the Node pin from the single source-of-truth lock file instead of hardcoding values: parse manifests/runtime-artifacts.lock.json at runtime (or add a small helper that reads the JSON and exports NODE_VERSION and NODE_SHA256), and then use those parsed values in place of the existing NODE_VERSION/NODE_SHA256/NODE_TARBALL assignments so the pin only lives in runtime-artifacts.lock.json and both install_node_runtime.sh and fetch_official_artifacts.sh reference that same source.src/startup.rs (1)
111-121: KVM permission check may not correctly detect write access.
metadata.permissions().readonly()checks if the file's permission bits have the read-only flag set, but this doesn't account for:
- Effective user permissions (ownership, group membership)
- ACLs or SELinux/AppArmor policies
Consider using
std::fs::OpenOptionsto actually attempt opening/dev/kvmfor writing:Proposed fix
fn verify_kvm() -> Result<()> { let kvm = Path::new("/dev/kvm"); if !kvm.exists() { bail!("/dev/kvm is missing"); } - let metadata = std::fs::metadata(kvm).context("stat /dev/kvm")?; - if metadata.permissions().readonly() { - bail!("/dev/kvm is not writable by the current user"); + // Actually try to open for write to verify effective permissions + if std::fs::OpenOptions::new() + .write(true) + .open(kvm) + .is_err() + { + bail!("/dev/kvm is not writable by the current user"); } Ok(()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/startup.rs` around lines 111 - 121, The verify_kvm function's permission check should attempt an actual write-open instead of relying on metadata.permissions(); replace the readonly() check in verify_kvm with a real open attempt using std::fs::OpenOptions (e.g., OpenOptions::new().write(true).open("/dev/kvm")) and return a contextualized error (using .context or bail!) if opening for write fails; keep the initial existence check and ensure the function still returns Ok(()) on success.src/config.rs (2)
411-415: Test assertion is overly permissive and may mask regressions.The test
prod_without_release_channel_failsnow accepts eitherKEYRINGorchannelerrors, but validation runs in order and should fail on keyring first (sincekeyring_pathpoints to a non-existent file). This creates confusion about what the test is actually verifying.Consider tightening the assertion to match the actual expected failure:
- assert!( - err.contains("KEYRING") || err.contains("channel"), - "expected channel-or-keyring error: {}", - err - ); + assert!( + err.contains("KEYRING"), + "expected keyring error (file doesn't exist): {}", + err + );Alternatively, if the intent is to test the release channel validation, provide a valid keyring file path that exists in the test environment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config.rs` around lines 411 - 415, The test prod_without_release_channel_fails is too permissive: it asserts err.contains("KEYRING") || err.contains("channel"), masking which validation failed; either tighten the assertion to expect the keyring failure (make the assertion require err.contains("KEYRING") and remove the "channel" branch) since keyring_path points to a non-existent file, or adjust the test to supply a valid keyring file (create a temp keyring artifact and set keyring_path to it) and then assert that the error contains "channel" to specifically test release-channel validation; update the assertion and test setup around keyring_path and the assert! call accordingly.
447-451: Same permissive assertion issue as above.The test
prod_with_log_code_true_failsaccepts eitherKEYRINGorlog_codeerrors. Since the keyring file/etc/zeroboot/keyringlikely doesn't exist during tests, this will always match onKEYRINGrather than testing thelog_codevalidation path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config.rs` around lines 447 - 451, The test prod_with_log_code_true_fails is too permissive (accepts either "KEYRING" or "log_code") and thus often passes due to a missing /etc/zeroboot/keyring; change the assertion to require the "log_code" validation error only (e.g., assert!(err.contains("log_code"), "...")) so the test specifically verifies the log_code failure path, or alternatively set up a controlled keyring fixture (create a temporary keyring file or mock the keyring lookup in the test) so the error cannot come from KEYRING and the existing assertion can remain.src/main.rs (1)
113-125: Minor: Redundant variable assignment.The
firecracker_versionvariable at line 113-114 is assigned but then the same value is re-extracted inside theif letblock at line 116. Consider simplifying:- let firecracker_version = - config.and_then(|c| c.artifacts.allowed_firecracker_version.as_deref()); - if let Some(cfg) = config { - if let Some(firecracker_version) = cfg.artifacts.allowed_firecracker_version.as_deref() { + if let Some(cfg) = config { + if let Some(firecracker_version) = cfg.artifacts.allowed_firecracker_version.as_deref() {The outer
firecracker_versionvariable appears unused.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.rs` around lines 113 - 125, Remove the redundant extraction by reusing the earlier firecracker_version variable: delete the inner nested extraction and change the conditional to check the already-computed let firecracker_version = config.and_then(|c| c.artifacts.allowed_firecracker_version.as_deref()); then replace the nested if-let with if let Some(firecracker_version) = firecracker_version { call vmstate::pre_restore_validate(&state_data, Some(firecracker_version), manifest.vcpu_count) and handle the Err path as before }, so you no longer re-extract from config while still calling vmstate::pre_restore_validate with the same symbols (firecracker_version, state_data, manifest.vcpu_count).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deploy/deploy.sh`:
- Line 137: The deploy script currently calls /v1/ready and /v1/exec without
auth which will 401 in ZEROBOOT_AUTH_MODE=prod; change the probe sequence to
call /ready for readiness (matching docs/CI) and replace the direct /v1/exec
probe with a call to the CLI helper that provides a valid probe token (use
`zeroboot test-exec`) or otherwise supply an explicit probe bearer token,
ensuring the auth header is included when invoking the execution probe so
hardened/prod installs don't false-fail; update the SSH curl invocation in
deploy.sh to use /ready and to invoke `zeroboot test-exec` (or include the probe
token) instead of raw /v1/exec.
- Around line 83-84: The staged verification call to verify-startup is forcing
ZEROBOOT_KEYRING_PATH and unconditionally setting
ZEROBOOT_ALLOWED_FIRECRACKER_VERSION, ZEROBOOT_ALLOWED_FC_BINARY_SHA256 and
ZEROBOOT_RELEASE_CHANNEL to empty values, which can make the check run with a
different policy than the service; update the ssh invocation used for
verify-startup to mirror the same conditional export/persistence logic used
later (the lines that persist knobs) so you only pass ZEROBOOT_KEYRING_PATH and
the ZEROBOOT_ALLOWED_* / ZEROBOOT_RELEASE_CHANNEL environment variables when
they are actually configured, or source the same persisted config before
invoking verify-startup to ensure both the staged check and the service run with
the identical config surface.
In `@docs/DEPLOYMENT.md`:
- Line 26: Replace the hardcoded absolute local path to the lockfile with a
repository-relative link: change the sentence that references
"/Users/dawsonblock/.../manifests/runtime-artifacts.lock.json" so it points to
the relative file path "manifests/runtime-artifacts.lock.json" (or a repo-root
URL) and update the markdown link target for "runtime-artifacts.lock.json"
accordingly to avoid exposing local filesystem paths.
In `@guest/init.c`:
- Around line 233-238: main() currently starts workers with legacy script paths
("worker.py" and "worker_node.js") while restart_worker() uses supervisor paths
("worker_supervisor.py" and "worker_supervisor.js"), causing inconsistent
behavior; update the initial argv arrays in main() for the Python and Node
workers to match restart_worker() by using "python3",
"/zeroboot/worker_supervisor.py" for the Python worker and "node",
"/zeroboot/worker_supervisor.js" for the Node worker (locate where main() calls
start_worker() for worker->name "python" and "node") so startup and restart use
the same supervisor scripts.
In `@guest/worker_child.js`:
- Around line 63-64: The sandbox currently exposes setTimeout, clearTimeout, and
Promise while relying on vm.runInNewContext(..., {timeout}) (symbols:
setTimeout, clearTimeout, Promise, runInNewContext, timeout), which lets
untrusted code schedule async callbacks that escape the VM timeout; remove these
async primitives from the sandbox to prevent scheduling (delete setTimeout,
clearTimeout, Promise from the context) or, if you must allow async work, run
the untrusted script in a separate worker/child process and enforce an external
timeout that kills/terminates the worker (use Worker.terminate() or kill the
child) instead of relying on runInNewContext's timeout. Ensure you update the
sandbox object creation where setTimeout/clearTimeout/Promise are injected and
add external process/worker termination logic if choosing the second option.
In `@guest/worker_supervisor.js`:
- Around line 125-142: Before composing the fallback response, explicitly handle
result.signal: detect if result.signal is set (the spawnSync result), truncate
stdout/stderr with truncateWithMarker as done below, compute flags using
FLAG_STDOUT_TRUNCATED / FLAG_STDERR_TRUNCATED, and then return a non-zero exit
code and a 'signaled' status (include the signal name or numeric mapping, e.g.,
use 128 + signal number if available, or -1 if not) along with the truncated
stdout, stderr, and flags; place this branch before the final return that
currently uses result.status || 0 and references parseChildResponse,
truncateWithMarker, FLAG_STDOUT_TRUNCATED, and FLAG_STDERR_TRUNCATED.
In `@manifests/firecracker.lock.json`:
- Around line 7-8: The manifest currently pins the kernel artifact with an
insecure transport by using the "kernel_url" value starting with "http://";
update the "kernel_url" to use "https://" (preserving the same host/path) so
consumers fetch the pinned kernel over TLS, and leave the "kernel_sha256" intact
to continue verifying integrity (i.e., change the scheme of the value referenced
by kernel_url to https).
In `@manifests/node-guest.manifest`:
- Line 7: The manifest pins the supervisor at /zeroboot/worker_supervisor.js but
guest/init.c still hardcodes first-boot worker paths (/zeroboot/worker_node.js
and /zeroboot/worker.py), causing a mismatch; update guest/init.c to start the
same supervisor-specified worker path used for restarts by either reading the
manifest entry or using /zeroboot/worker_supervisor.js (replace references to
/zeroboot/worker_node.js and /zeroboot/worker.py), and ensure the restart logic
and any spawn/exec calls reference the same symbol so initial boot and runtime
use the identical supervisor binary.
In `@README.md`:
- Around line 105-106: Replace the hardcoded absolute path
"/Users/dawsonblock/Downloads/XBOOT-main-2/manifests/runtime-artifacts.lock.json"
in README.md with a repository-relative path (for example
"manifests/runtime-artifacts.lock.json" or
"./manifests/runtime-artifacts.lock.json"); update the markdown link text so the
referenced file is relative (e.g.,
[runtime-artifacts.lock.json](manifests/runtime-artifacts.lock.json)) to ensure
the link works for other users cloning the repo.
In `@scripts/build_guest_rootfs.sh`:
- Around line 70-71: The initial startup paths in init.c's main() still point to
/zeroboot/worker.py and /zeroboot/worker_node.js but the manifest and
restart_worker() use the supervisor entrypoints; update main() to launch the
supervisor entrypoints instead by replacing those initial paths with
/zeroboot/worker_supervisor.py and /zeroboot/worker_supervisor.js so they match
the SHA256-written files and the restart_worker() behavior.
In `@scripts/create_release_bundle.sh`:
- Around line 13-25: The fallback branch that runs rsync (the else block
invoking rsync -a "$ROOT" "$STAGE" with many --exclude flags) assumes rsync
exists; add a preflight check before calling rsync by testing command -v rsync
(or type rsync >/dev/null 2>&1) and if not found, print a clear error message
and exit non‑zero (or implement a safe fallback like using tar/cp with the same
excludes); ensure the check appears immediately before the rsync invocation and
that the error message references rsync so users know what to install.
In `@scripts/fetch_official_artifacts.sh`:
- Around line 30-35: The preflight checks currently validate curl and tar but
miss verifying a SHA-256 tool; add a check alongside the existing curl/tar
checks to ensure either sha256sum or shasum is available and fail early with a
clear error message if neither exists. Update the preflight logic (the same area
that checks for curl and tar) to run command -v sha256sum >/dev/null 2>&1 ||
command -v shasum >/dev/null 2>&1 and call exit 1 with a descriptive error if it
fails; this ensures the sha256_file() function (which expects sha256sum or
shasum) will always have a valid tool.
In `@scripts/make_api_keys.py`:
- Around line 14-16: scripts/make_api_keys.py currently allows a custom --prefix
that may contain '.' which breaks the verifier expecting tokens in the form
prefix.secret; add validation where the prefix is composed (variables
prefix_prefix and prefix and when building token) to reject or sanitize any
prefix containing a '.' and raise a clear error/exit if found (e.g., check the
user-provided prefix before concatenating with uuid and also before creating
tokens at the other occurrences referenced around the prefix_prefix/prefix/token
construction) so no generated token contains a dot in its prefix portion.
- Around line 36-45: Remove the inline --pepper argument and require the pepper
be supplied via --pepper-file or stdin: delete the
parser.add_argument("--pepper") line and the code path that reads args.pepper;
instead read pepper from Path(args.pepper_file).read_text().strip(), and allow
reading from stdin when args.pepper_file == "-" or when no file arg is given but
data is piped in (use sys.stdin.read().strip()); if neither a pepper file nor
stdin content is present then raise SystemExit("provide --pepper-file or pipe
pepper via stdin"). Keep references to parser, args, pepper_file and the pepper
variable so the rest of the script uses the loaded pepper.
In `@src/api/handlers.rs`:
- Around line 510-515: The current try_send call on state.request_log_tx treats
all errors the same; update the error handling around
state.request_log_tx.try_send(line.to_string()) to match on Tokio's TrySendError
variants and increment the appropriate metric: if the error is
TrySendError::Full increment state.metrics.request_log_drops.fetch_add(1,
Ordering::Relaxed), but if it is TrySendError::Closed increment
state.metrics.request_log_write_failures.fetch_add(1, Ordering::Relaxed) (or
create request_log_write_failures if missing); ensure you still drop/ignore the
message after handling each variant so the logging path distinguishes
backpressure (Full) from logger shutdown (Closed).
In `@src/startup.rs`:
- Around line 236-238: The explicit u64 casts on statvfs fields are redundant;
update the allocation where free_bytes and free_inodes are set to use the stat
fields directly (replace (stat.f_bavail as u64).saturating_mul(stat.f_frsize as
u64) with stat.f_bavail.saturating_mul(stat.f_frsize) and replace stat.f_favail
as u64 with stat.f_favail) in the code that constructs the struct containing
free_bytes and free_inodes (references: free_bytes, free_inodes, stat).
- Around line 152-162: The code currently compares the raw version_output stdout
(variable version) against config.artifacts.allowed_firecracker_version, but
firecracker prints a "Firecracker v" prefix and possible git suffixes; parse the
version string from version_output.stdout (e.g., strip a leading "Firecracker v"
or apply a regex to capture the semantic version and optional git suffix like
"-695-g...-dirty") and normalize it before comparing to
allowed_firecracker_version. Update the logic around version (the
String::from_utf8_lossy -> version variable) to extract and trim the actual
numeric version token, then compare that parsed_version to
config.artifacts.allowed_firecracker_version.as_deref() in the existing if let
Some(...) block.
In `@tests/test_guest_worker_subprocess.py`:
- Around line 70-80: The test test_temp_directory_is_scoped_per_request
currently only writes a file and checks it exists in a single run; modify it to
run the child twice using self.run_child (call it twice with the same small
snippet that prints os.environ['ZEROBOOT_TMPDIR'] and optionally creates a
file), capture both stdout paths, assert the two printed ZEROBOOT_TMPDIR values
are different (ensuring per-request isolation), and additionally ensure the
first request's temp directory is cleaned up after exit by asserting the first
path no longer exists (or that the file created in the first run is absent) when
the second run starts; reference the test_temp_directory_is_scoped_per_request
function, self.run_child helper, and the ZEROBOOT_TMPDIR env var to locate where
to change the test.
In `@tests/test_repo_sanity.py`:
- Around line 20-22: The placeholder check in the loop currently only looks for
the literal token "=REQUIRED" and misses variants like "key = REQUIRED"; update
the conditional that filters lines before self.fail to detect "REQUIRED" with
optional spaces around the '=' (e.g. use a regex search for patterns like
"\b=\s*REQUIRED\b" or check for both '=' and 'REQUIRED' spaced) so any "key =
REQUIRED" or "key= REQUIRED" variants trigger the unresolved-placeholder
failure; adjust the condition around the variables line and path.name
accordingly.
---
Outside diff comments:
In `@guest/worker_child.js`:
- Around line 52-58: The current in-memory capture buffers stdoutParts and
stderrParts (filled by sandbox.console.log and sandbox.console.error) never
enforce stdout_bytes/stderr_bytes limits until truncateWithMarker() runs, which
can OOM; modify the capture so each push respects the byte cap by tracking the
accumulated byte count per stream and only appending up to the remaining allowed
bytes, and when the cap is reached append the truncation marker immediately and
stop adding further content; implement this behavior via a small helper used by
sandbox.console.log and sandbox.console.error (and replicate the same
capped-append logic for the other capture block around lines 106-114) so
truncation/makers are applied incrementally rather than only at the end.
In `@manifests/node-build.lock.json`:
- Around line 1-13: The CI manifest-schema validation is failing because build
lock files matching *.lock.json are being validated as template manifests;
update the artifact-verify/manifest-schema check so it skips *.lock.json (or
change the manifest glob from manifests/*.json to exclude *.lock.json), or move
node-build.lock.json out of the manifests/*.json glob; adjust the
artifact-verify step (the job that runs the manifest-schema gate) to explicitly
ignore files matching *.lock.json or add a separate glob for template manifests
that requires schema_version, template_id and promotion_channel.
In `@src/api/handlers.rs`:
- Around line 1100-1110: Create the final detail string before computing health
and then call classify_template_health on that same string; specifically,
compute a local variable (e.g., detail) set to "probe ok" when ready is true or
to probe.stderr when not, use that variable for TemplateStatus.detail, and pass
it to classify_template_health instead of startup_status.detail so health and
detail reflect the same failure text; update the code paths around
classify_template_health, TemplateStatus { detail: ... }, and any use of
startup_status.detail to use the new detail variable.
In `@tests/test_worker_protocol.py`:
- Around line 60-66: Rename the test function
test_python_timeout_requests_recycle to a name that reflects the supervisor
behavior (e.g., test_python_timeout_no_recycle) and update any references; keep
the body and assertions (including the flags & 4 check and assertions on rid,
exit_code, error_type, and stderr) unchanged so the test continues to verify
that timeouts do not set the recycle flag in the supervisor architecture.
---
Nitpick comments:
In `@manifests/runtime-artifacts.lock.json`:
- Around line 14-17: Update the artifact download URLs under the "kernel" entry
(the "url" value) — and any similar entries such as "rootfs" if present — to use
HTTPS instead of HTTP so transport-level security and authenticity are enforced;
keep the existing sha256 verification intact and verify the HTTPS endpoints
serve the same artifacts before committing the URL changes (adjust only the
"url" string values for kernel/rootfs entries).
In `@scripts/build_guest_rootfs.sh`:
- Around line 61-62: The explicit chmod call for
"$STAGING_DIR/zeroboot/worker.py" and "$STAGING_DIR/zeroboot/worker_node.js" is
brittle and will fail if those legacy files are removed; remove the specific
chmod invocation (the first chmod line) and rely on the subsequent glob chmod
("chmod 0644 \"$STAGING_DIR\"/zeroboot/*.py \"$STAGING_DIR\"/zeroboot/*.js") to
set permissions for present .py/.js files, or alternatively guard the explicit
chmod with an existence check for worker.py and worker_node.js if you prefer to
keep it.
In `@scripts/create_release_bundle.sh`:
- Around line 6-8: The script creates a temporary directory via TMPDIR="$(mktemp
-d)" and STAGE="$TMPDIR/zeroboot" but never removes it; add an EXIT trap to
remove TMPDIR when the script exits (success or failure). Modify the top of
create_release_bundle.sh to register a trap 'trap "rm -rf \"$TMPDIR\"" EXIT'
after TMPDIR is created (and ensure TMPDIR is non-empty/valid before installing
the trap), so cleanup always runs; reference TMPDIR and STAGE to locate where to
add the trap and perform any necessary null-checking before removal.
In `@scripts/fetch_official_artifacts.sh`:
- Line 43: Replace the fragile curl invocation `curl -fsSL "$url" -o "$dest"`
with a hardened command that includes bounded timeouts and retries: ensure you
add options like --retry (e.g. 5), --retry-delay (e.g. 2), --retry-connrefused,
--connect-timeout (e.g. 5), and an overall --max-time (e.g. 30) while preserving
-f/-s/-S/-L behavior; update the curl call in the script (the line containing
`curl -fsSL "$url" -o "$dest"`) so transient network errors are retried and long
hangs are avoided.
In `@scripts/install_node_runtime.sh`:
- Around line 19-21: The NODE_VERSION and NODE_SHA256 constants in
scripts/install_node_runtime.sh are duplicated across
scripts/fetch_official_artifacts.sh and manifests/runtime-artifacts.lock.json;
change install_node_runtime.sh to read the Node pin from the single
source-of-truth lock file instead of hardcoding values: parse
manifests/runtime-artifacts.lock.json at runtime (or add a small helper that
reads the JSON and exports NODE_VERSION and NODE_SHA256), and then use those
parsed values in place of the existing NODE_VERSION/NODE_SHA256/NODE_TARBALL
assignments so the pin only lives in runtime-artifacts.lock.json and both
install_node_runtime.sh and fetch_official_artifacts.sh reference that same
source.
In `@scripts/preflight.sh`:
- Around line 47-50: The preflight cgroup v2 check in scripts/preflight.sh is
fine but COMPATIBILITY.md needs expanded docs: update the startup verification
section (where cgroup v2 is already mentioned) to list typical systems lacking
v2 (e.g., older Ubuntu 18.04/20.04 kernels, some minimal container hosts,
certain cloud images) and explain migration implications and remediation steps
(kernel upgrades, enabling systemd cgroup v2, container runtime configuration,
and fallbacks). Reference the existing cgroup v2 check from scripts/preflight.sh
when adding the guidance so readers can correlate the runtime check with the
documented prerequisites.
In `@src/config.rs`:
- Around line 411-415: The test prod_without_release_channel_fails is too
permissive: it asserts err.contains("KEYRING") || err.contains("channel"),
masking which validation failed; either tighten the assertion to expect the
keyring failure (make the assertion require err.contains("KEYRING") and remove
the "channel" branch) since keyring_path points to a non-existent file, or
adjust the test to supply a valid keyring file (create a temp keyring artifact
and set keyring_path to it) and then assert that the error contains "channel" to
specifically test release-channel validation; update the assertion and test
setup around keyring_path and the assert! call accordingly.
- Around line 447-451: The test prod_with_log_code_true_fails is too permissive
(accepts either "KEYRING" or "log_code") and thus often passes due to a missing
/etc/zeroboot/keyring; change the assertion to require the "log_code" validation
error only (e.g., assert!(err.contains("log_code"), "...")) so the test
specifically verifies the log_code failure path, or alternatively set up a
controlled keyring fixture (create a temporary keyring file or mock the keyring
lookup in the test) so the error cannot come from KEYRING and the existing
assertion can remain.
In `@src/main.rs`:
- Around line 113-125: Remove the redundant extraction by reusing the earlier
firecracker_version variable: delete the inner nested extraction and change the
conditional to check the already-computed let firecracker_version =
config.and_then(|c| c.artifacts.allowed_firecracker_version.as_deref()); then
replace the nested if-let with if let Some(firecracker_version) =
firecracker_version { call vmstate::pre_restore_validate(&state_data,
Some(firecracker_version), manifest.vcpu_count) and handle the Err path as
before }, so you no longer re-extract from config while still calling
vmstate::pre_restore_validate with the same symbols (firecracker_version,
state_data, manifest.vcpu_count).
In `@src/startup.rs`:
- Around line 111-121: The verify_kvm function's permission check should attempt
an actual write-open instead of relying on metadata.permissions(); replace the
readonly() check in verify_kvm with a real open attempt using
std::fs::OpenOptions (e.g., OpenOptions::new().write(true).open("/dev/kvm")) and
return a contextualized error (using .context or bail!) if opening for write
fails; keep the initial existence check and ensure the function still returns
Ok(()) on success.
In `@src/vmm/mod.rs`:
- Around line 6-8: The stub modules are currently declared as public (pub mod
vmstate_stub) exposing their internal names; make the stub modules private by
changing the declaration to mod vmstate_stub (and likewise mod kvm_stub for the
similar block) while keeping the public re-exports (pub use vmstate_stub as
vmstate and pub use kvm_stub as kvm) so the canonical public API remains
vmstate/kvm without exposing the stub identifiers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 34318ed8-b85b-4d7d-9a25-cf463da71f6e
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (42)
.gitattributes.github/workflows/ci.yml.gitignoreCargo.tomlREADME.mddeploy/deploy.shdeploy/zeroboot.servicedocs/API.mddocs/COMPATIBILITY.mddocs/DEPLOYMENT.mdguest/init.cguest/worker_child.jsguest/worker_child.pyguest/worker_supervisor.jsguest/worker_supervisor.pymanifests/firecracker.lock.jsonmanifests/kernel.manifestmanifests/node-build.lock.jsonmanifests/node-guest.manifestmanifests/python-build.lock.jsonmanifests/python-guest.manifestmanifests/runtime-artifacts.lock.jsonscripts/build_guest_rootfs.shscripts/create_release_bundle.shscripts/fetch_official_artifacts.shscripts/install_node_runtime.shscripts/make_api_keys.pyscripts/preflight.shsrc/api/handlers.rssrc/config.rssrc/main.rssrc/signing.rssrc/startup.rssrc/template_manifest.rssrc/vmm/kvm_stub.rssrc/vmm/mod.rssrc/vmm/vmstate.rssrc/vmm/vmstate_stub.rstests/test_docs_sanity.pytests/test_guest_worker_subprocess.pytests/test_repo_sanity.pytests/test_worker_protocol.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This pull request introduces several improvements and clarifications to the XBOOT project, focusing on deployment safety, artifact pinning, environment configuration, and documentation. The most significant changes are the addition of explicit artifact compatibility and verification steps, enhanced environment variable handling for resource limits, and updated documentation reflecting these changes.
Deployment and Release Process Improvements:
deploy/deploy.sh) and systemd service (deploy/zeroboot.service) now run a staged startup verification (verify-startup) before switching releases, ensuring all templates and resources are valid before going live. [1] [2]ZEROBOOT_MIN_FREE_BYTES,ZEROBOOT_MIN_FREE_INODES) are introduced and set in both deployment and systemd service files to enforce minimum disk space and inode availability. [1] [2]Artifact Pinning and Compatibility:
docs/COMPATIBILITY.md, specifying exact versions and sources for Firecracker, the guest kernel, rootfs, Python, and Node.js runtimes.README.mdnow includes explicit artifact pinning, instructions for fetching official artifacts, and updated build instructions for Python and Node.js guest images. [1] [2]Environment and Build System Enhancements:
.github/workflows/ci.yml) now checks syntax for new supervisor/child worker scripts, includes new shell scripts in syntax checks, and adds a release bundle step that excludes build artifacts using.gitattributes. [1] [2]Documentation and Usability Updates:
README.mdhas been updated to clarify the intended use (controlled internal use, offline-only), the artifact matrix, and the deployment process, including new environment variables and the startup verification step. [1] [2] [3] [4] [5]docs/API.md) now describes the new hashed API key format, bearer authentication, rate limits, overload responses, and how to generate API key records. [1] [2]Other Notable Changes:
.gitattributesis updated to exclude build and temporary directories from release bundles.These changes collectively improve the safety, clarity, and reproducibility of XBOOT deployments and make the artifact requirements and system expectations explicit.
Summary by CodeRabbit
New Features
Documentation
Chores