Skip to content

docs: expand seed dataset docs for filesystem sources#452

Open
stepwise-ai-dev wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
stepwise-ai-dev:stepwise-ai-dev/docs/448-seed-dataset-docs
Open

docs: expand seed dataset docs for filesystem sources#452
stepwise-ai-dev wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
stepwise-ai-dev:stepwise-ai-dev/docs/448-seed-dataset-docs

Conversation

@stepwise-ai-dev
Copy link
Copy Markdown

@stepwise-ai-dev stepwise-ai-dev commented Mar 23, 2026

Summary

  • expand the seed dataset concept docs to cover the shipped filesystem-backed seed sources
  • document DirectorySeedSource and FileContentsSeedSource, including file_pattern, recursive, encoding, and the seeded columns they expose
  • fix the broken preview example to use data_designer.preview(...)

Closes #448.

Validation

  • cross-checked the docs against packages/data-designer-config/src/data_designer/config/seed_source.py
  • cross-checked the docs against packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
  • cross-checked the documented behavior against existing interface and seed reader tests
  • did not run make test locally because this machine has uv 0.5.11, which cannot parse the repo's newer tool.uv.required-version / uv.lock format

@stepwise-ai-dev stepwise-ai-dev requested a review from a team as a code owner March 23, 2026 16:39
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR expands the seed dataset concept docs to cover the two shipped filesystem-backed seed sources (DirectorySeedSource and FileContentsSeedSource), and fixes a broken variable name in the preview example (designerdata_designer).

  • Column schemas for both new sources are verified correct against DirectorySeedReader and FileContentsSeedReader in seed_reader.pysource_kind, source_path, relative_path, file_name for directory, plus content for file contents, with the correct string literals (\"directory_file\" / \"file_contents\")
  • The file_pattern, recursive, and encoding parameter defaults and behaviors documented here match their FileSystemSeedSource field definitions in seed_source.py
  • The hard-coded source count ("three") has been replaced with open-ended phrasing ("multiple ways … including:"), avoiding stale-count issues as more sources are added
  • The plugin tip link (../plugins/example.md) resolves to an existing file
  • One minor documentation gap: the file_pattern note doesn't mention that matching is case-sensitive (the engine calls fnmatchcase) or that the default pattern is \"*\"

Confidence Score: 5/5

Safe to merge; all documented behaviours are verified correct against the implementation.

The only finding is a P2 documentation gap (missing case-sensitivity and default-value callout in the filesystem matching note). All column schemas, default values, and code examples have been cross-checked against the implementation and are accurate. The preview fix is correct.

No files require special attention.

Important Files Changed

Filename Overview
docs/concepts/seed-datasets.md Expands seed dataset docs with DirectorySeedSource and FileContentsSeedSource sections; column schemas verified correct against implementation; preview fix is accurate

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    U([User config]) --> LS[LocalFileSeedSource]
    U --> HF[HuggingFaceSeedSource]
    U --> DF[DataFrameSeedSource]
    U --> DS[DirectorySeedSource]
    U --> FC[FileContentsSeedSource]
    U --> AR[AgentRolloutSeedSource]

    DS -->|file_pattern + recursive| DM["build_manifest()\n→ one row per matched file"]
    FC -->|file_pattern + recursive + encoding| FM["build_manifest()\n→ one row per matched file"]
    FM --> HY["hydrate_row()\n→ reads file bytes → content column"]

    DM --> DCols["source_kind · source_path\nrelative_path · file_name"]
    HY --> FCols["source_kind · source_path\nrelative_path · file_name · content"]

    DCols --> Engine[DataDesigner Engine]
    FCols --> Engine
    LS --> Engine
    HF --> Engine
    DF --> Engine
    AR --> Engine

    Engine --> Jinja["Jinja2 template rendering\n{{ relative_path }}, {{ content }}, …"]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 130-132

Comment:
**Case-sensitivity and default `file_pattern` not mentioned**

The note documents that `file_pattern` matches basenames only and that `recursive=True` is the default, but omits two useful facts:

1. **Case-sensitivity**: The engine uses `fnmatchcase` (see `seed_reader.py`, line 256), which is case-sensitive on all platforms. A pattern like `"*.MD"` will not match `readme.md`. This is worth calling out since most OS-level file search tools are case-insensitive on macOS and Windows.

2. **Default value**: The `file_pattern` field defaults to `"*"` (matches all files). The note already covers the `recursive` default — symmetrically documenting the `file_pattern` default would help users who omit it.

```suggestion
!!! note "Filesystem matching"
    `file_pattern` matches file names only, not relative paths, and the match is **case-sensitive** on all platforms (e.g. `"*.MD"` will not match `readme.md`). The default pattern is `"*"` (matches every file). `recursive=True` is the default, so nested subdirectories are searched unless you turn it off.
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (3): Last reviewed commit: "docs: avoid stale seed source count" | Re-trigger Greptile

Comment on lines +161 to +163
`FileContentsSeedSource` adds one extra seeded column:

- `content` - decoded text contents of the matched file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 source_kind value undocumented, implicitly misleading

The section says FileContentsSeedSource exposes "the same metadata as DirectorySeedSource". The DirectorySeedSource section explicitly documents source_kind as always "directory_file", so a reader will naturally infer that FileContentsSeedSource also emits source_kind = "directory_file".

In fact, the implementation sets a different value:

# packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544
_build_metadata_record(
    context=context,
    relative_path=relative_path,
    source_kind="file_contents",   # ← different from "directory_file"
)

Any user who filters or branches on source_kind (e.g. {% if source_kind == "directory_file" %}) would get silent wrong behaviour when using FileContentsSeedSource.

Please list the full column schema explicitly, matching the runtime implementation (output_columns on FileContentsSeedReader):

Suggested change
`FileContentsSeedSource` adds one extra seeded column:
- `content` - decoded text contents of the matched file
`FileContentsSeedSource` exposes these seeded columns:
- `source_kind` - always `"file_contents"`
- `source_path` - full path to the matched file
- `relative_path` - path relative to the configured directory
- `file_name` - basename of the matched file
- `content` - decoded text contents of the matched file
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 161-163

Comment:
**`source_kind` value undocumented, implicitly misleading**

The section says `FileContentsSeedSource` exposes "the same metadata as `DirectorySeedSource`". The `DirectorySeedSource` section explicitly documents `source_kind` as always `"directory_file"`, so a reader will naturally infer that `FileContentsSeedSource` also emits `source_kind = "directory_file"`.

In fact, the implementation sets a different value:

```python
# packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544
_build_metadata_record(
    context=context,
    relative_path=relative_path,
    source_kind="file_contents",   # ← different from "directory_file"
)
```

Any user who filters or branches on `source_kind` (e.g. `{% if source_kind == "directory_file" %}`) would get silent wrong behaviour when using `FileContentsSeedSource`.

Please list the full column schema explicitly, matching the runtime implementation (`output_columns` on `FileContentsSeedReader`):

```suggestion
`FileContentsSeedSource` exposes these seeded columns:

- `source_kind` - always `"file_contents"`
- `source_path` - full path to the matched file
- `relative_path` - path relative to the configured directory
- `file_name` - basename of the matched file
- `content` - decoded text contents of the matched file
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b7ecc5d.

@stepwise-ai-dev
Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

## Seed Sources

Data Designer supports three ways to provide seed data:
Data Designer supports five ways to provide seed data:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Source count is off by one

The PR increments the count from three to five, but AgentRolloutSeedSource is also a shipped seed source — it has its own recipe at docs/recipes/trace_ingestion/agent_rollout_distillation.md and is featured on docs/recipes/cards.md. That makes six sources in total, so the sentence is factually wrong.

Either update the count to "six" and add a brief entry for AgentRolloutSeedSource, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").

Suggested change
Data Designer supports five ways to provide seed data:
Data Designer supports six ways to provide seed data:
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 57

Comment:
**Source count is off by one**

The PR increments the count from three to five, but `AgentRolloutSeedSource` is also a shipped seed source — it has its own recipe at `docs/recipes/trace_ingestion/agent_rollout_distillation.md` and is featured on `docs/recipes/cards.md`. That makes six sources in total, so the sentence is factually wrong.

Either update the count to "six" and add a brief entry for `AgentRolloutSeedSource`, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").

```suggestion
Data Designer supports six ways to provide seed data:
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e62977b by rephrasing the sentence to avoid a stale hardcoded count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docs: expand seed dataset docs for filesystem seed sources and fix preview example typo

1 participant