Skip to content

Add embedding recipe: build domain-specific embeddings from raw documents#85

Draft
oliverholworthy wants to merge 3 commits intomainfrom
oholworthy/embed-recipe
Draft

Add embedding recipe: build domain-specific embeddings from raw documents#85
oliverholworthy wants to merge 3 commits intomainfrom
oholworthy/embed-recipe

Conversation

@oliverholworthy
Copy link
Copy Markdown

@oliverholworthy oliverholworthy commented Mar 10, 2026

Summary

  • Adds a complete embedding recipe with 6 stages: SDG, data prep, fine-tuning, evaluation, export, and deployment
  • Includes CLI commands under nemotron embed (sdg, prep, finetune, eval, export, deploy, run)
  • Adds docker executor support to nemo_runspec for local-docker execution
  • Adds pydantic-based config loading and config model introspection for --help
  • Includes sample data, tests, and comprehensive README documentation

Test plan

  • nemotron embed finetune --run local-docker launches and streams logs
  • nemotron embed finetune --dry-run shows config without executing
  • nemotron embed finetune --help displays config options from pydantic model
  • Unit tests pass: pytest tests/recipes/embed/
  • No regressions in existing nemotron nano3 commands

@oliverholworthy oliverholworthy force-pushed the oholworthy/embed-recipe branch from e90e6b4 to f0c63a4 Compare March 10, 2026 17:58
@oliverholworthy oliverholworthy self-assigned this Mar 10, 2026
@oliverholworthy oliverholworthy force-pushed the oholworthy/embed-recipe branch from f0c63a4 to 4fd435d Compare March 10, 2026 18:00
@oliverholworthy oliverholworthy changed the title Add embedding recipe for fine-tuning, evaluation, and deployment Add embedding recipe: build domain-specific embeddings from raw documents Mar 11, 2026
@bernardwin bernardwin requested a review from marcromeyn March 23, 2026 18:39
@@ -0,0 +1,18 @@
For weeks, the Amazon rainforest has been burning at a startling rate. Tens of thousands of fires have been recorded this year largely started by humans clearing land for logging, ranching or mining.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should publish the dummy data on huggingface hub?

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
@oliverholworthy oliverholworthy force-pushed the oholworthy/embed-recipe branch from 4fd435d to 67e75af Compare March 25, 2026 20:54
Move detailed documentation from the recipe README into
docs/nemotron/embed/ to follow the nano3/super3 pattern.
Add grid card and toctree entry in docs/index.md.

Signed-off-by: Oliver Holworthy <oholworthy@nvidia.com>
Remove bundled sample data from the repo and download it on demand from
HuggingFace (nvidia/Retrieval-Synthetic-NVDocs-v1). The SDG stage now
supports hf:// URIs in corpus_dir config, e.g.:

  hf://nvidia/Retrieval-Synthetic-NVDocs-v1@<sha>/sample_corpus/nv_pp_random

This keeps the repo lightweight while preserving zero-config quick start
— the default config auto-downloads the sample corpus on first run.

Signed-off-by: Oliver Holworthy <oholworthy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants