docs: expand seed dataset docs for filesystem sources by stepwise-ai-dev · Pull Request #452 · NVIDIA-NeMo/DataDesigner

stepwise-ai-dev · 2026-03-23T16:39:58Z

Summary

expand the seed dataset concept docs to cover the shipped filesystem-backed seed sources
document DirectorySeedSource and FileContentsSeedSource, including file_pattern, recursive, encoding, and the seeded columns they expose
fix the broken preview example to use data_designer.preview(...)

Closes #448.

Validation

cross-checked the docs against packages/data-designer-config/src/data_designer/config/seed_source.py
cross-checked the docs against packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
cross-checked the documented behavior against existing interface and seed reader tests
did not run make test locally because this machine has uv 0.5.11, which cannot parse the repo's newer tool.uv.required-version / uv.lock format

github-actions · 2026-03-23T16:40:09Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

greptile-apps · 2026-03-23T16:42:28Z

Greptile Summary

This PR expands the seed dataset concept docs to cover the two shipped filesystem-backed seed sources (DirectorySeedSource and FileContentsSeedSource), and fixes a broken variable name in the preview example (designer → data_designer).

Column schemas for both new sources are verified correct against DirectorySeedReader and FileContentsSeedReader in seed_reader.py — source_kind, source_path, relative_path, file_name for directory, plus content for file contents, with the correct string literals (\"directory_file\" / \"file_contents\")
The file_pattern, recursive, and encoding parameter defaults and behaviors documented here match their FileSystemSeedSource field definitions in seed_source.py
The hard-coded source count ("three") has been replaced with open-ended phrasing ("multiple ways … including:"), avoiding stale-count issues as more sources are added
The plugin tip link (../plugins/example.md) resolves to an existing file
One minor documentation gap: the file_pattern note doesn't mention that matching is case-sensitive (the engine calls fnmatchcase) or that the default pattern is \"*\"

Confidence Score: 5/5

Safe to merge; all documented behaviours are verified correct against the implementation.

The only finding is a P2 documentation gap (missing case-sensitivity and default-value callout in the filesystem matching note). All column schemas, default values, and code examples have been cross-checked against the implementation and are accurate. The preview fix is correct.

No files require special attention.

Important Files Changed

Filename	Overview
docs/concepts/seed-datasets.md	Expands seed dataset docs with DirectorySeedSource and FileContentsSeedSource sections; column schemas verified correct against implementation; preview fix is accurate

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    U([User config]) --> LS[LocalFileSeedSource]
    U --> HF[HuggingFaceSeedSource]
    U --> DF[DataFrameSeedSource]
    U --> DS[DirectorySeedSource]
    U --> FC[FileContentsSeedSource]
    U --> AR[AgentRolloutSeedSource]

    DS -->|file_pattern + recursive| DM["build_manifest()\n→ one row per matched file"]
    FC -->|file_pattern + recursive + encoding| FM["build_manifest()\n→ one row per matched file"]
    FM --> HY["hydrate_row()\n→ reads file bytes → content column"]

    DM --> DCols["source_kind · source_path\nrelative_path · file_name"]
    HY --> FCols["source_kind · source_path\nrelative_path · file_name · content"]

    DCols --> Engine[DataDesigner Engine]
    FCols --> Engine
    LS --> Engine
    HF --> Engine
    DF --> Engine
    AR --> Engine

    Engine --> Jinja["Jinja2 template rendering\n{{ relative_path }}, {{ content }}, …"]

Prompt To Fix All With AI

This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 130-132

Comment:
**Case-sensitivity and default `file_pattern` not mentioned**

The note documents that `file_pattern` matches basenames only and that `recursive=True` is the default, but omits two useful facts:

1. **Case-sensitivity**: The engine uses `fnmatchcase` (see `seed_reader.py`, line 256), which is case-sensitive on all platforms. A pattern like `"*.MD"` will not match `readme.md`. This is worth calling out since most OS-level file search tools are case-insensitive on macOS and Windows.

2. **Default value**: The `file_pattern` field defaults to `"*"` (matches all files). The note already covers the `recursive` default — symmetrically documenting the `file_pattern` default would help users who omit it.

```suggestion
!!! note "Filesystem matching"
    `file_pattern` matches file names only, not relative paths, and the match is **case-sensitive** on all platforms (e.g. `"*.MD"` will not match `readme.md`). The default pattern is `"*"` (matches every file). `recursive=True` is the default, so nested subdirectories are searched unless you turn it off.
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (3): Last reviewed commit: "docs: avoid stale seed source count" | Re-trigger Greptile}

greptile-apps · 2026-03-23T16:42:32Z

docs/concepts/seed-datasets.md

+`FileContentsSeedSource` adds one extra seeded column:
+
+- `content` - decoded text contents of the matched file


source_kind value undocumented, implicitly misleading

The section says FileContentsSeedSource exposes "the same metadata as DirectorySeedSource". The DirectorySeedSource section explicitly documents source_kind as always "directory_file", so a reader will naturally infer that FileContentsSeedSource also emits source_kind = "directory_file".

In fact, the implementation sets a different value:

# packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544 _build_metadata_record( context=context, relative_path=relative_path, source_kind="file_contents", # ← different from "directory_file" )

Any user who filters or branches on source_kind (e.g. {% if source_kind == "directory_file" %}) would get silent wrong behaviour when using FileContentsSeedSource.

Please list the full column schema explicitly, matching the runtime implementation (output_columns on FileContentsSeedReader):

Suggested change

`FileContentsSeedSource` adds one extra seeded column:

- `content` - decoded text contents of the matched file

`FileContentsSeedSource` exposes these seeded columns:

- `source_kind` - always `"file_contents"`

- `source_path` - full path to the matched file

- `relative_path` - path relative to the configured directory

- `file_name` - basename of the matched file

- `content` - decoded text contents of the matched file

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/concepts/seed-datasets.md Line: 161-163 Comment: **`source_kind` value undocumented, implicitly misleading** The section says `FileContentsSeedSource` exposes "the same metadata as `DirectorySeedSource`". The `DirectorySeedSource` section explicitly documents `source_kind` as always `"directory_file"`, so a reader will naturally infer that `FileContentsSeedSource` also emits `source_kind = "directory_file"`. In fact, the implementation sets a different value: ```python # packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544 _build_metadata_record( context=context, relative_path=relative_path, source_kind="file_contents", # ← different from "directory_file" ) ``` Any user who filters or branches on `source_kind` (e.g. `{% if source_kind == "directory_file" %}`) would get silent wrong behaviour when using `FileContentsSeedSource`. Please list the full column schema explicitly, matching the runtime implementation (`output_columns` on `FileContentsSeedReader`): ```suggestion `FileContentsSeedSource` exposes these seeded columns: - `source_kind` - always `"file_contents"` - `source_path` - full path to the matched file - `relative_path` - path relative to the configured directory - `file_name` - basename of the matched file - `content` - decoded text contents of the matched file ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed in b7ecc5d.

stepwise-ai-dev · 2026-03-23T16:44:36Z

I have read the DCO document and I hereby sign the DCO.

greptile-apps · 2026-03-23T16:57:46Z

docs/concepts/seed-datasets.md

 ## Seed Sources

-Data Designer supports three ways to provide seed data:
+Data Designer supports five ways to provide seed data:


Source count is off by one

The PR increments the count from three to five, but AgentRolloutSeedSource is also a shipped seed source — it has its own recipe at docs/recipes/trace_ingestion/agent_rollout_distillation.md and is featured on docs/recipes/cards.md. That makes six sources in total, so the sentence is factually wrong.

Either update the count to "six" and add a brief entry for AgentRolloutSeedSource, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").

Suggested change

Data Designer supports five ways to provide seed data:

Data Designer supports six ways to provide seed data:

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/concepts/seed-datasets.md Line: 57 Comment: **Source count is off by one** The PR increments the count from three to five, but `AgentRolloutSeedSource` is also a shipped seed source — it has its own recipe at `docs/recipes/trace_ingestion/agent_rollout_distillation.md` and is featured on `docs/recipes/cards.md`. That makes six sources in total, so the sentence is factually wrong. Either update the count to "six" and add a brief entry for `AgentRolloutSeedSource`, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:"). ```suggestion Data Designer supports six ways to provide seed data: ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed in e62977b by rephrasing the sentence to avoid a stale hardcoded count.

docs: expand seed dataset docs for filesystem sources

86234c7

stepwise-ai-dev requested a review from a team as a code owner March 23, 2026 16:39

greptile-apps bot reviewed Mar 23, 2026

View reviewed changes

stepwise-ai-dev and others added 2 commits March 23, 2026 09:44

Merge branch 'main' into stepwise-ai-dev/docs/448-seed-dataset-docs

0d18601

docs: clarify file contents seed source columns

b7ecc5d

greptile-apps bot reviewed Mar 23, 2026

View reviewed changes

docs: avoid stale seed source count

e62977b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: expand seed dataset docs for filesystem sources#452

docs: expand seed dataset docs for filesystem sources#452
stepwise-ai-dev wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
stepwise-ai-dev:stepwise-ai-dev/docs/448-seed-dataset-docs

stepwise-ai-dev commented Mar 23, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 23, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 23, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

stepwise-ai-dev Mar 23, 2026

Uh oh!

stepwise-ai-dev commented Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

stepwise-ai-dev Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		`FileContentsSeedSource` adds one extra seeded column:

		- `content` - decoded text contents of the matched file

-`FileContentsSeedSource` adds one extra seeded column:
-- `content` - decoded text contents of the matched file
+`FileContentsSeedSource` exposes these seeded columns:
+- `source_kind` - always `"file_contents"`
+- `source_path` - full path to the matched file
+- `relative_path` - path relative to the configured directory
+- `file_name` - basename of the matched file
+- `content` - decoded text contents of the matched file

	Data Designer supports five ways to provide seed data:
	Data Designer supports six ways to provide seed data:

Conversation

stepwise-ai-dev commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

github-actions bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

stepwise-ai-dev Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

stepwise-ai-dev commented Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

stepwise-ai-dev Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stepwise-ai-dev commented Mar 23, 2026 •

edited

Loading

github-actions bot commented Mar 23, 2026 •

edited

Loading

greptile-apps bot commented Mar 23, 2026 •

edited

Loading