Skip to content

refactor(locomo): use clack prompts, add full context, partial run#51

Merged
kwaa merged 16 commits intomainfrom
refactor/locomo
Mar 28, 2026
Merged

refactor(locomo): use clack prompts, add full context, partial run#51
kwaa merged 16 commits intomainfrom
refactor/locomo

Conversation

@kwaa
Copy link
Copy Markdown
Member

@kwaa kwaa commented Mar 28, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the LoCoMo benchmark tool, introducing an interactive CLI powered by @clack/prompts, a checkpointing system for resuming interrupted runs, and a "Full Context" baseline comparison mode. The orchestration logic has been moved to a new runner.ts, and state persistence is handled via checkpoint.ts. Feedback suggests enhancing type safety by utilizing Zod for JSON parsing in checkpoint.ts and cli.ts, restoring progress reporting during the ingestion phase, and deduplicating the runWithConcurrency utility across the codebase.

@kwaa
Copy link
Copy Markdown
Member Author

kwaa commented Mar 28, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the LoCoMo benchmark into an interactive CLI tool using @clack/prompts, replacing the previous argument-based execution. Key enhancements include a new checkpointing system for resuming interrupted runs, the introduction of a 'Full Context' baseline for comparison against plast-mem, and a modularized architecture that separates benchmark orchestration, statistics calculation, and schema validation. The README has been updated to document the new interactive options and the revised results JSON structure. Feedback is provided regarding the README's JSON example, suggesting a more descriptive representation of the nested statistics to improve clarity.

kwaa and others added 2 commits March 28, 2026 16:59
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@kwaa
Copy link
Copy Markdown
Member Author

kwaa commented Mar 28, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the LoCoMo benchmark into an interactive CLI using @clack/prompts, adding a checkpointing system for resuming runs and a 'Full Context' baseline comparison. The logic is modularized into specialized files for runner orchestration, checkpointing, and Zod-based schema validation. Feedback suggests removing an unused function and adopting a more idiomatic Array.from syntax for task initialization.

@kwaa kwaa merged commit b80669d into main Mar 28, 2026
@kwaa kwaa deleted the refactor/locomo branch March 28, 2026 09:13
@kwaa
Copy link
Copy Markdown
Member Author

kwaa commented Mar 28, 2026

/gemini summary

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

This pull request significantly refactors the locomo benchmark, transforming it into a more user-friendly and robust evaluation tool. The primary goal was to enhance the benchmark's usability through interactive prompts, provide a clearer understanding of model performance by introducing a full context baseline for comparison, and ensure reliability with a new checkpointing system. These changes collectively streamline the benchmarking process, making it easier to configure, run, and analyze results, while also improving the overall code architecture.

Highlights

  • Interactive CLI with Clack Prompts: The benchmark CLI has been refactored to use @clack/prompts for an interactive user experience, allowing dynamic selection of models, sample scopes, comparison modes, and LLM judge scoring at runtime.
  • Full Context Baseline: A new 'Full Context Baseline' has been introduced as an optional comparison mode, which feeds the entire chronological conversation transcript directly to the answer model without retrieval or reranking, providing a non-RAG style evaluation.
  • Checkpointing and Partial Runs: A robust checkpointing system has been implemented, enabling the benchmark to save its progress at granular stages (ingest, eval, score) and resume from the latest compatible checkpoint, facilitating partial runs and recovery from interruptions.
  • Improved Benchmark Output Structure: The benchmark output JSON structure has been enhanced to clearly separate results and statistics for different variants (plast-mem, full context) and include a dedicated comparison section, providing a more comprehensive overview.
  • Code Refactoring and Modularity: Key functionalities like concurrency management, checkpoint handling, full context building, and data schema validation have been extracted into dedicated modules (concurrency.ts, checkpoint.ts, full-context.ts, schemas.ts), improving code organization and maintainability.
Changelog
  • benchmarks/locomo/README.md
    • Updated usage instructions to reflect the new interactive CLI.
    • Revised the pipeline description to include optional Full Context QA and checkpoint persistence.
    • Added detailed sections for 'Interactive Options', 'Resume / Checkpoint', and 'Full Context Baseline'.
    • Updated the output JSON structure example to include variants and comparison fields.
    • Expanded the 'Source Files' table to list newly added modules like checkpoint.ts, full-context.ts, and runner.ts.
  • benchmarks/locomo/package.json
    • Added @clack/prompts dependency for interactive CLI features.
    • Removed picospinner dependency, as its functionality is now handled by @clack/prompts.
  • benchmarks/locomo/src/checkpoint.ts
    • Added a new module to define interfaces and functions for managing benchmark checkpoints, including creation, loading, saving, and fingerprinting of run configurations and sample progress.
  • benchmarks/locomo/src/cli.ts
    • Rewrote the CLI entry point to use @clack/prompts for interactive configuration and flow control.
    • Integrated checkpoint loading and saving logic, allowing users to resume previous benchmark runs.
    • Removed old argument parsing logic and direct QA/evaluation stages, delegating execution to the new runner.ts module.
    • Added error handling for missing OPENAI_CHAT_MODEL and OPENAI_API_KEY environment variables.
  • benchmarks/locomo/src/concurrency.ts
    • Added a new utility module to encapsulate the runWithConcurrency function, promoting code reuse and reducing duplication.
  • benchmarks/locomo/src/full-context.ts
    • Added a new module to construct a full chronological transcript of a conversation for the 'Full Context Baseline' evaluation.
  • benchmarks/locomo/src/ingest.ts
    • Updated to use @clack/prompts spinner for progress reporting during ingestion.
    • Refactored to import runWithConcurrency from the new shared concurrency.ts module.
    • Modified getOrderedSessions to be exported for use by other modules.
    • Removed direct file I/O for conversation IDs, as this is now managed by the checkpoint system.
  • benchmarks/locomo/src/runner.ts
    • Added a new module to orchestrate the benchmark execution, including sample ingestion, variant evaluation (plast-mem and full context), scoring, and checkpoint persistence.
    • Implemented functions to build the final benchmark output structure and print summary statistics.
  • benchmarks/locomo/src/schemas.ts
    • Added a new module to define Zod schemas for robust validation of LoCoMo samples, QA pairs, benchmark run configurations, and checkpoint structures.
  • benchmarks/locomo/src/stats.ts
    • Modified to use @clack/prompts for logging and improved table rendering for statistics.
    • Introduced new functions (computeComparison, renderComparison) to calculate and display comparison metrics between benchmark variants.
    • Refactored stat printing functions for better readability and modularity.
  • benchmarks/locomo/src/types.ts
    • Updated with new interfaces for BenchmarkComparisonMetric, BenchmarkComparisonSummary, BenchmarkMeta, BenchmarkOutput, BenchmarkVariant, BenchmarkVariantOutput, and PendingQAResult to support the new output structure and checkpointing.
  • benchmarks/locomo/src/wait.ts
    • Modified to use @clack/prompts for logging status messages.
    • Refactored internal helper functions (collectStatuses, flushReadyConversations, removeCompletedConversations) for better organization and clarity.
  • cspell.config.yaml
    • Added new words 'antfu', 'clack', 'conv', and 'vitejs' to the cspell configuration.
  • docs/architecture/daily_memory_optimizations.md
    • Removed trailing newline at the end of the file.
  • docs/architecture/longmemeval_optimizations.md
    • Removed trailing newline at the end of the file.
  • docs/architecture/retrieve_memory.md
    • Corrected indentation for the 'content' field in a JSON example.
  • docs/todo/README.md
    • Removed trailing newline at the end of the file.
  • pnpm-lock.yaml
    • Updated various dependency versions, including @antfu/eslint-config and eslint-plugin-jsonc.
    • Added @clack/prompts and zod as direct dependencies for benchmarks/locomo.
    • Removed picospinner dependency.
    • Added new transitive dependencies like eslint-compat-utils, espree, graphemer, and jsonc-eslint-parser.
  • pnpm-workspace.yaml
    • Added shellEmulator, trustPolicy, and trustPolicyExclude configurations.
    • Updated @antfu/eslint-config version in the catalog.
    • Added @clack/prompts to the catalog.
Activity
  • kwaa requested a review of the pull request.
  • kwaa requested a summary of the pull request.
  • The bot identified an unsafe use of JSON.parse in loadCheckpoint and suggested using Zod for robust parsing, which was addressed by the addition of src/schemas.ts and src/checkpoint.ts.
  • The bot noted an unsafe use of JSON.parse in loadSamples and recommended Zod for validation, which was addressed by the addition of src/schemas.ts.
  • The bot pointed out the removal of progress reporting during sample ingestion and suggested restoring it, which was likely addressed by the new @clack/prompts spinner implementation in src/ingest.ts.
  • The bot highlighted the duplication of runWithConcurrency and recommended extracting it to a shared utility, which was addressed by the creation of src/concurrency.ts.
  • The bot suggested clarifying the stats JSON output example in the documentation, which was addressed by the updated benchmarks/locomo/README.md and src/stats.ts changes.
  • The bot identified an unused isCheckpointCompatible function, which was likely removed or refactored as part of the new checkpointing system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant