refactor(locomo): use clack prompts, add full context, partial run by kwaa · Pull Request #51 · moeru-ai/plast-mem

kwaa · 2026-03-28T08:35:22Z

No description provided.

gemini-code-assist

Code Review

This pull request refactors the LoCoMo benchmark tool, introducing an interactive CLI powered by @clack/prompts, a checkpointing system for resuming interrupted runs, and a "Full Context" baseline comparison mode. The orchestration logic has been moved to a new runner.ts, and state persistence is handled via checkpoint.ts. Feedback suggests enhancing type safety by utilizing Zod for JSON parsing in checkpoint.ts and cli.ts, restoring progress reporting during the ingestion phase, and deduplicating the runWithConcurrency utility across the codebase.

benchmarks/locomo/src/checkpoint.ts

benchmarks/locomo/src/cli.ts

benchmarks/locomo/src/ingest.ts

benchmarks/locomo/src/runner.ts

kwaa · 2026-03-28T08:49:04Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the LoCoMo benchmark into an interactive CLI tool using @clack/prompts, replacing the previous argument-based execution. Key enhancements include a new checkpointing system for resuming interrupted runs, the introduction of a 'Full Context' baseline for comparison against plast-mem, and a modularized architecture that separates benchmark orchestration, statistics calculation, and schema validation. The README has been updated to document the new interactive options and the revised results JSON structure. Feedback is provided regarding the README's JSON example, suggesting a more descriptive representation of the nested statistics to improve clarity.

benchmarks/locomo/README.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

kwaa · 2026-03-28T09:02:18Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the LoCoMo benchmark into an interactive CLI using @clack/prompts, adding a checkpointing system for resuming runs and a 'Full Context' baseline comparison. The logic is modularized into specialized files for runner orchestration, checkpointing, and Zod-based schema validation. Feedback suggests removing an unused function and adopting a more idiomatic Array.from syntax for task initialization.

benchmarks/locomo/src/checkpoint.ts

benchmarks/locomo/src/concurrency.ts

kwaa · 2026-03-28T09:13:47Z

/gemini summary

gemini-code-assist · 2026-03-28T09:26:02Z

Summary of Changes

This pull request significantly refactors the locomo benchmark, transforming it into a more user-friendly and robust evaluation tool. The primary goal was to enhance the benchmark's usability through interactive prompts, provide a clearer understanding of model performance by introducing a full context baseline for comparison, and ensure reliability with a new checkpointing system. These changes collectively streamline the benchmarking process, making it easier to configure, run, and analyze results, while also improving the overall code architecture.

Highlights

Interactive CLI with Clack Prompts: The benchmark CLI has been refactored to use @clack/prompts for an interactive user experience, allowing dynamic selection of models, sample scopes, comparison modes, and LLM judge scoring at runtime.
Full Context Baseline: A new 'Full Context Baseline' has been introduced as an optional comparison mode, which feeds the entire chronological conversation transcript directly to the answer model without retrieval or reranking, providing a non-RAG style evaluation.
Checkpointing and Partial Runs: A robust checkpointing system has been implemented, enabling the benchmark to save its progress at granular stages (ingest, eval, score) and resume from the latest compatible checkpoint, facilitating partial runs and recovery from interruptions.
Improved Benchmark Output Structure: The benchmark output JSON structure has been enhanced to clearly separate results and statistics for different variants (plast-mem, full context) and include a dedicated comparison section, providing a more comprehensive overview.
Code Refactoring and Modularity: Key functionalities like concurrency management, checkpoint handling, full context building, and data schema validation have been extracted into dedicated modules (concurrency.ts, checkpoint.ts, full-context.ts, schemas.ts), improving code organization and maintainability.

Changelog

benchmarks/locomo/README.md
- Updated usage instructions to reflect the new interactive CLI.
- Revised the pipeline description to include optional Full Context QA and checkpoint persistence.
- Added detailed sections for 'Interactive Options', 'Resume / Checkpoint', and 'Full Context Baseline'.
- Updated the output JSON structure example to include variants and comparison fields.
- Expanded the 'Source Files' table to list newly added modules like checkpoint.ts, full-context.ts, and runner.ts.
benchmarks/locomo/package.json
- Added @clack/prompts dependency for interactive CLI features.
- Removed picospinner dependency, as its functionality is now handled by @clack/prompts.
benchmarks/locomo/src/checkpoint.ts
- Added a new module to define interfaces and functions for managing benchmark checkpoints, including creation, loading, saving, and fingerprinting of run configurations and sample progress.
benchmarks/locomo/src/cli.ts
- Rewrote the CLI entry point to use @clack/prompts for interactive configuration and flow control.
- Integrated checkpoint loading and saving logic, allowing users to resume previous benchmark runs.
- Removed old argument parsing logic and direct QA/evaluation stages, delegating execution to the new runner.ts module.
- Added error handling for missing OPENAI_CHAT_MODEL and OPENAI_API_KEY environment variables.
benchmarks/locomo/src/concurrency.ts
- Added a new utility module to encapsulate the runWithConcurrency function, promoting code reuse and reducing duplication.
benchmarks/locomo/src/full-context.ts
- Added a new module to construct a full chronological transcript of a conversation for the 'Full Context Baseline' evaluation.
benchmarks/locomo/src/ingest.ts
- Updated to use @clack/prompts spinner for progress reporting during ingestion.
- Refactored to import runWithConcurrency from the new shared concurrency.ts module.
- Modified getOrderedSessions to be exported for use by other modules.
- Removed direct file I/O for conversation IDs, as this is now managed by the checkpoint system.
benchmarks/locomo/src/runner.ts
- Added a new module to orchestrate the benchmark execution, including sample ingestion, variant evaluation (plast-mem and full context), scoring, and checkpoint persistence.
- Implemented functions to build the final benchmark output structure and print summary statistics.
benchmarks/locomo/src/schemas.ts
- Added a new module to define Zod schemas for robust validation of LoCoMo samples, QA pairs, benchmark run configurations, and checkpoint structures.
benchmarks/locomo/src/stats.ts
- Modified to use @clack/prompts for logging and improved table rendering for statistics.
- Introduced new functions (computeComparison, renderComparison) to calculate and display comparison metrics between benchmark variants.
- Refactored stat printing functions for better readability and modularity.
benchmarks/locomo/src/types.ts
- Updated with new interfaces for BenchmarkComparisonMetric, BenchmarkComparisonSummary, BenchmarkMeta, BenchmarkOutput, BenchmarkVariant, BenchmarkVariantOutput, and PendingQAResult to support the new output structure and checkpointing.
benchmarks/locomo/src/wait.ts
- Modified to use @clack/prompts for logging status messages.
- Refactored internal helper functions (collectStatuses, flushReadyConversations, removeCompletedConversations) for better organization and clarity.
cspell.config.yaml
- Added new words 'antfu', 'clack', 'conv', and 'vitejs' to the cspell configuration.
docs/architecture/daily_memory_optimizations.md
- Removed trailing newline at the end of the file.
docs/architecture/longmemeval_optimizations.md
- Removed trailing newline at the end of the file.
docs/architecture/retrieve_memory.md
- Corrected indentation for the 'content' field in a JSON example.
docs/todo/README.md
- Removed trailing newline at the end of the file.
pnpm-lock.yaml
- Updated various dependency versions, including @antfu/eslint-config and eslint-plugin-jsonc.
- Added @clack/prompts and zod as direct dependencies for benchmarks/locomo.
- Removed picospinner dependency.
- Added new transitive dependencies like eslint-compat-utils, espree, graphemer, and jsonc-eslint-parser.
pnpm-workspace.yaml
- Added shellEmulator, trustPolicy, and trustPolicyExclude configurations.
- Updated @antfu/eslint-config version in the catalog.
- Added @clack/prompts to the catalog.

Activity

kwaa requested a review of the pull request.
kwaa requested a summary of the pull request.
The bot identified an unsafe use of JSON.parse in loadCheckpoint and suggested using Zod for robust parsing, which was addressed by the addition of src/schemas.ts and src/checkpoint.ts.
The bot noted an unsafe use of JSON.parse in loadSamples and recommended Zod for validation, which was addressed by the addition of src/schemas.ts.
The bot pointed out the removal of progress reporting during sample ingestion and suggested restoring it, which was likely addressed by the new @clack/prompts spinner implementation in src/ingest.ts.
The bot highlighted the duplication of runWithConcurrency and recommended extracting it to a shared utility, which was addressed by the creation of src/concurrency.ts.
The bot suggested clarifying the stats JSON output example in the documentation, which was addressed by the updated benchmarks/locomo/README.md and src/stats.ts changes.
The bot identified an unused isCheckpointCompatible function, which was likely removed or refactored as part of the new checkpointing system.

kwaa added 11 commits March 28, 2026 15:01

refactor(locomo): use clack prompts, add full context

827923d

chore(deps): revert antfu config

4246eb0

chore: lint code

6f73219

refactor: clean

73666f1

refactor(locomo): subset by default

cb29d4d

refactor(locomo): simplify

a95007a

refactor(locomo): more option

6be29eb

refactor(locomo): clean

a3d725e

fix(locomo): checkpoint path

8fb929c

refactor: replace console

6b27139

refactor(locomo): improve log

2483910

gemini-code-assist bot reviewed Mar 28, 2026

View reviewed changes

benchmarks/locomo/src/checkpoint.ts Show resolved Hide resolved

benchmarks/locomo/src/cli.ts Show resolved Hide resolved

benchmarks/locomo/src/ingest.ts Outdated Show resolved Hide resolved

benchmarks/locomo/src/runner.ts Outdated Show resolved Hide resolved

kwaa added 2 commits March 28, 2026 16:43

refactor(locomo): improve log

a940c6e

refactor: improve

032616d

gemini-code-assist bot reviewed Mar 28, 2026

View reviewed changes

benchmarks/locomo/README.md Outdated Show resolved Hide resolved

kwaa and others added 2 commits March 28, 2026 16:59

fix(locomo): skip category 5

7e77775

Update benchmarks/locomo/README.md

e0017f5

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist bot reviewed Mar 28, 2026

View reviewed changes

benchmarks/locomo/src/checkpoint.ts Outdated Show resolved Hide resolved

benchmarks/locomo/src/concurrency.ts Outdated Show resolved Hide resolved

refactor: apply review suggest

f83201c

kwaa merged commit b80669d into main Mar 28, 2026

kwaa deleted the refactor/locomo branch March 28, 2026 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(locomo): use clack prompts, add full context, partial run#51

refactor(locomo): use clack prompts, add full context, partial run#51
kwaa merged 16 commits intomainfrom
refactor/locomo

kwaa commented Mar 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kwaa commented Mar 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

kwaa commented Mar 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

kwaa commented Mar 28, 2026

Uh oh!

gemini-code-assist bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kwaa commented Mar 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kwaa commented Mar 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

kwaa commented Mar 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

kwaa commented Mar 28, 2026

Uh oh!

gemini-code-assist bot commented Mar 28, 2026

Summary of Changes

Highlights

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant