docs: expand seed dataset docs for filesystem sources#452
docs: expand seed dataset docs for filesystem sources#452stepwise-ai-dev wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
Conversation
|
All contributors have signed the DCO ✍️ ✅ |
Greptile SummaryThis PR expands the seed dataset concept docs to cover the two shipped filesystem-backed seed sources (
|
| Filename | Overview |
|---|---|
| docs/concepts/seed-datasets.md | Expands seed dataset docs with DirectorySeedSource and FileContentsSeedSource sections; column schemas verified correct against implementation; preview fix is accurate |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
U([User config]) --> LS[LocalFileSeedSource]
U --> HF[HuggingFaceSeedSource]
U --> DF[DataFrameSeedSource]
U --> DS[DirectorySeedSource]
U --> FC[FileContentsSeedSource]
U --> AR[AgentRolloutSeedSource]
DS -->|file_pattern + recursive| DM["build_manifest()\n→ one row per matched file"]
FC -->|file_pattern + recursive + encoding| FM["build_manifest()\n→ one row per matched file"]
FM --> HY["hydrate_row()\n→ reads file bytes → content column"]
DM --> DCols["source_kind · source_path\nrelative_path · file_name"]
HY --> FCols["source_kind · source_path\nrelative_path · file_name · content"]
DCols --> Engine[DataDesigner Engine]
FCols --> Engine
LS --> Engine
HF --> Engine
DF --> Engine
AR --> Engine
Engine --> Jinja["Jinja2 template rendering\n{{ relative_path }}, {{ content }}, …"]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 130-132
Comment:
**Case-sensitivity and default `file_pattern` not mentioned**
The note documents that `file_pattern` matches basenames only and that `recursive=True` is the default, but omits two useful facts:
1. **Case-sensitivity**: The engine uses `fnmatchcase` (see `seed_reader.py`, line 256), which is case-sensitive on all platforms. A pattern like `"*.MD"` will not match `readme.md`. This is worth calling out since most OS-level file search tools are case-insensitive on macOS and Windows.
2. **Default value**: The `file_pattern` field defaults to `"*"` (matches all files). The note already covers the `recursive` default — symmetrically documenting the `file_pattern` default would help users who omit it.
```suggestion
!!! note "Filesystem matching"
`file_pattern` matches file names only, not relative paths, and the match is **case-sensitive** on all platforms (e.g. `"*.MD"` will not match `readme.md`). The default pattern is `"*"` (matches every file). `recursive=True` is the default, so nested subdirectories are searched unless you turn it off.
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (3): Last reviewed commit: "docs: avoid stale seed source count" | Re-trigger Greptile
docs/concepts/seed-datasets.md
Outdated
| `FileContentsSeedSource` adds one extra seeded column: | ||
|
|
||
| - `content` - decoded text contents of the matched file |
There was a problem hiding this comment.
source_kind value undocumented, implicitly misleading
The section says FileContentsSeedSource exposes "the same metadata as DirectorySeedSource". The DirectorySeedSource section explicitly documents source_kind as always "directory_file", so a reader will naturally infer that FileContentsSeedSource also emits source_kind = "directory_file".
In fact, the implementation sets a different value:
# packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544
_build_metadata_record(
context=context,
relative_path=relative_path,
source_kind="file_contents", # ← different from "directory_file"
)Any user who filters or branches on source_kind (e.g. {% if source_kind == "directory_file" %}) would get silent wrong behaviour when using FileContentsSeedSource.
Please list the full column schema explicitly, matching the runtime implementation (output_columns on FileContentsSeedReader):
| `FileContentsSeedSource` adds one extra seeded column: | |
| - `content` - decoded text contents of the matched file | |
| `FileContentsSeedSource` exposes these seeded columns: | |
| - `source_kind` - always `"file_contents"` | |
| - `source_path` - full path to the matched file | |
| - `relative_path` - path relative to the configured directory | |
| - `file_name` - basename of the matched file | |
| - `content` - decoded text contents of the matched file |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 161-163
Comment:
**`source_kind` value undocumented, implicitly misleading**
The section says `FileContentsSeedSource` exposes "the same metadata as `DirectorySeedSource`". The `DirectorySeedSource` section explicitly documents `source_kind` as always `"directory_file"`, so a reader will naturally infer that `FileContentsSeedSource` also emits `source_kind = "directory_file"`.
In fact, the implementation sets a different value:
```python
# packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544
_build_metadata_record(
context=context,
relative_path=relative_path,
source_kind="file_contents", # ← different from "directory_file"
)
```
Any user who filters or branches on `source_kind` (e.g. `{% if source_kind == "directory_file" %}`) would get silent wrong behaviour when using `FileContentsSeedSource`.
Please list the full column schema explicitly, matching the runtime implementation (`output_columns` on `FileContentsSeedReader`):
```suggestion
`FileContentsSeedSource` exposes these seeded columns:
- `source_kind` - always `"file_contents"`
- `source_path` - full path to the matched file
- `relative_path` - path relative to the configured directory
- `file_name` - basename of the matched file
- `content` - decoded text contents of the matched file
```
How can I resolve this? If you propose a fix, please make it concise.|
I have read the DCO document and I hereby sign the DCO. |
docs/concepts/seed-datasets.md
Outdated
| ## Seed Sources | ||
|
|
||
| Data Designer supports three ways to provide seed data: | ||
| Data Designer supports five ways to provide seed data: |
There was a problem hiding this comment.
The PR increments the count from three to five, but AgentRolloutSeedSource is also a shipped seed source — it has its own recipe at docs/recipes/trace_ingestion/agent_rollout_distillation.md and is featured on docs/recipes/cards.md. That makes six sources in total, so the sentence is factually wrong.
Either update the count to "six" and add a brief entry for AgentRolloutSeedSource, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").
| Data Designer supports five ways to provide seed data: | |
| Data Designer supports six ways to provide seed data: |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 57
Comment:
**Source count is off by one**
The PR increments the count from three to five, but `AgentRolloutSeedSource` is also a shipped seed source — it has its own recipe at `docs/recipes/trace_ingestion/agent_rollout_distillation.md` and is featured on `docs/recipes/cards.md`. That makes six sources in total, so the sentence is factually wrong.
Either update the count to "six" and add a brief entry for `AgentRolloutSeedSource`, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").
```suggestion
Data Designer supports six ways to provide seed data:
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in e62977b by rephrasing the sentence to avoid a stale hardcoded count.
Summary
DirectorySeedSourceandFileContentsSeedSource, includingfile_pattern,recursive,encoding, and the seeded columns they exposedata_designer.preview(...)Closes #448.
Validation
packages/data-designer-config/src/data_designer/config/seed_source.pypackages/data-designer-engine/src/data_designer/engine/resources/seed_reader.pymake testlocally because this machine hasuv 0.5.11, which cannot parse the repo's newertool.uv.required-version/uv.lockformat