Skip to content

feat: add fr_FR locale to nemotron personas datasets#468

Open
johnnygreco wants to merge 4 commits intomainfrom
add-fr-fr-locale
Open

feat: add fr_FR locale to nemotron personas datasets#468
johnnygreco wants to merge 4 commits intomainfrom
add-fr-fr-locale

Conversation

@johnnygreco
Copy link
Copy Markdown
Contributor

Summary

  • Register the France locale (fr_FR, 2.71 GB) in NEMOTRON_PERSONAS_DATASET_SIZES, which auto-propagates to LOCALES_WITH_MANAGED_DATASETS, PersonaRepository, PersonSamplerParams validation, and the download service
  • Add 7 France-specific PII fields to dataset_based_person_fields.py: first_name_heritage, name_heritage, is_first_gen_immigrant, household_type, monthly_income_eur, commune, departement
  • Update person sampling docs with fr_FR locale listing, NGC download example, and field reference
  • Update persona repository tests for 8 locales

Register the France locale (fr_FR, 2.71 GB) in NEMOTRON_PERSONAS_DATASET_SIZES
and add 7 France-specific PII fields: first_name_heritage, name_heritage,
is_first_gen_immigrant, household_type, monthly_income_eur, commune, departement.
@johnnygreco johnnygreco requested a review from a team as a code owner March 25, 2026 20:52
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR registers the fr_FR (France) locale in the Nemotron Personas dataset ecosystem and adds 7 France-specific PII fields, completing the full integration path from configuration to CLI to documentation.

Key changes:

  • NEMOTRON_PERSONAS_DATASET_SIZES in constants.py gets "fr_FR": "2.71 GB", which auto-propagates to LOCALES_WITH_MANAGED_DATASETS, PersonSamplerParams validation, the download service, and the persona repository — no other registration steps are needed by design.
  • A new LOCALES_WITH_MANAGED_DATASETS_STR constant is introduced as a bonus clean-up, replacing two inline ', '.join(...) calls and fixing a pre-existing staleness bug in the CLI --locale help text (which previously omitted en_SG and pt_BR).
  • Seven France-specific PII fields (commune, departement, household_type, monthly_income_eur, first_name_heritage, name_heritage, is_first_gen_immigrant) are appended to PII_FIELDS in locale-alphabetical order.
  • All four affected test files are updated with the new locale count (7 → 8) and explicit fr_FR assertions.
  • Documentation covers the NGC download snippet, field reference, and supported-locale parameter table.

Confidence Score: 5/5

  • This PR is safe to merge — it follows the established locale-addition pattern exactly, includes all necessary test updates, and includes a net improvement to the CLI help text.
  • The change is mechanical and well-scoped: one dict entry drives the full propagation, the new LOCALES_WITH_MANAGED_DATASETS_STR constant is a clean DRY improvement, all test counts and assertions are updated consistently, and documentation is thorough. No logic changes, no new failure modes.
  • No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/config/utils/constants.py Adds fr_FR to NEMOTRON_PERSONAS_DATASET_SIZES in alphabetical order and introduces LOCALES_WITH_MANAGED_DATASETS_STR to DRY up repeated join calls; clean and consistent.
packages/data-designer-engine/src/data_designer/engine/sampling_gen/entities/dataset_based_person_fields.py Adds 7 France-specific PII fields (commune, departement, household_type, monthly_income_eur, first_name_heritage, name_heritage, is_first_gen_immigrant) in the correct locale-ordered position within PII_FIELDS.
packages/data-designer-config/src/data_designer/config/sampler_params.py Replaces two inline ', '.join(LOCALES_WITH_MANAGED_DATASETS) calls with the pre-computed LOCALES_WITH_MANAGED_DATASETS_STR constant; purely mechanical refactor with no logic change.
packages/data-designer/src/data_designer/cli/commands/download.py Replaces a stale hardcoded locale list in the CLI help text (previously missing en_SG and pt_BR) with the dynamic LOCALES_WITH_MANAGED_DATASETS_STR constant — a net improvement over the original.
docs/concepts/person_sampling.md Adds fr_FR to the supported-locales list, NGC download snippet, France-specific field reference table, and the parameter-table locale enum; documentation is comprehensive.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["NEMOTRON_PERSONAS_DATASET_SIZES\n(constants.py)\n+ fr_FR: 2.71 GB"] --> B["LOCALES_WITH_MANAGED_DATASETS\n(list of locale keys)"]
    A --> C["LOCALES_WITH_MANAGED_DATASETS_STR\n(comma-joined string) NEW"]
    B --> D["PersonSamplerParams.locale\nvalidation & field description"]
    B --> E["PersonaRepository._registry\n(locales list)"]
    B --> F["DownloadService\nget_available_locales()"]
    C --> D
    C --> G["CLI --locale help text\n(download.py)"]
    H["dataset_based_person_fields.py\n+ 7 fr_FR PII fields"] --> I["PII_FIELDS list\nused by sampling engine"]
Loading

Reviews (4): Last reviewed commit: "refactor: add LOCALES_WITH_MANAGED_DATAS..." | Re-trigger Greptile

Update hardcoded locale counts from 7 to 8 and add fr_FR assertions
in download controller and download service tests.
The --locale help text was hardcoded and already stale (missing en_SG,
pt_BR, fr_FR). Build it from LOCALES_WITH_MANAGED_DATASETS so it stays
in sync automatically.
Centralise the comma-joined locale list so it is defined once in
constants and reused in the CLI help text, PersonSamplerParams field
description, and locale validation error message.
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥖 🇫🇷 🚀

Copy link
Copy Markdown
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the updated review below!

@nabinchha
Copy link
Copy Markdown
Contributor

Nice work on this one, @johnnygreco — clean addition with a great opportunistic refactor. Here are my thoughts.

Summary

This PR registers the France locale (fr_FR, 2.71 GB) in the Nemotron Personas system, adds 7 France-specific PII fields, and centralizes the comma-joined locale string into LOCALES_WITH_MANAGED_DATASETS_STR — eliminating the previously hardcoded (and already stale) locale list in the CLI help text. The implementation matches the stated intent cleanly.

Findings

Warnings — Worth addressing

packages/data-designer/tests/cli/controllers/test_download_controller.py:91-99 — Missing fr_FR assertion in test_run_personas_with_all_flag

  • What: The test bumps the expected count from 7 to 8 and asserts every other locale is in downloaded_locales, but never asserts "fr_FR" in downloaded_locales.
  • Why: The count check (== 8) would catch a missing locale, but the per-locale assertions are there to make failures readable — omitting fr_FR breaks the pattern and means a test that downloads the wrong 8th locale would still pass. The sibling test test_determine_locales_with_all_flag (line 226) does assert fr_FR, so this looks like an oversight.
  • Suggestion: Add assert "fr_FR" in downloaded_locales between the en_SG and hi_Deva_IN assertions.

Suggestions — Take it or leave it

packages/data-designer/tests/cli/repositories/test_persona_repository.py:53-59test_get_by_code_all_locales doesn't cover fr_FR, en_SG, or pt_BR

  • What: The test cases only cover 5 of the 8 locales. This is pre-existing (en_SG and pt_BR were already missing), but since the test name says "all locales" and this PR adds another one, it's a natural time to fill the gap.
  • Why: The test name implies exhaustive coverage — adding the missing entries would make it match reality.
  • Suggestion: Could add fr_FR, en_SG, and pt_BR to the test cases. Totally fine to skip or do in a follow-up since it's pre-existing.

What Looks Good

  • The LOCALES_WITH_MANAGED_DATASETS_STR refactor is a great improvement. The CLI help text was already stale (missing en_SG, pt_BR, fr_FR), and centralizing the string eliminates this entire class of staleness bugs. Nice proactive fix.

  • Clean, minimal changeset. Adding a locale touches exactly the files it should — the constant, the PII fields list, the docs, and the tests. No unnecessary changes, no scope creep.

  • Thorough test updates. All three test files are updated with the new count and fr_FR assertions. The coverage for the new locale is solid.

Verdict

Needs changes — One warning: the missing fr_FR assertion in test_run_personas_with_all_flag. Quick one-liner fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants