feat: add fr_FR locale to nemotron personas datasets#468
feat: add fr_FR locale to nemotron personas datasets#468johnnygreco wants to merge 4 commits intomainfrom
Conversation
Register the France locale (fr_FR, 2.71 GB) in NEMOTRON_PERSONAS_DATASET_SIZES and add 7 France-specific PII fields: first_name_heritage, name_heritage, is_first_gen_immigrant, household_type, monthly_income_eur, commune, departement.
Greptile SummaryThis PR registers the Key changes:
|
| Filename | Overview |
|---|---|
| packages/data-designer-config/src/data_designer/config/utils/constants.py | Adds fr_FR to NEMOTRON_PERSONAS_DATASET_SIZES in alphabetical order and introduces LOCALES_WITH_MANAGED_DATASETS_STR to DRY up repeated join calls; clean and consistent. |
| packages/data-designer-engine/src/data_designer/engine/sampling_gen/entities/dataset_based_person_fields.py | Adds 7 France-specific PII fields (commune, departement, household_type, monthly_income_eur, first_name_heritage, name_heritage, is_first_gen_immigrant) in the correct locale-ordered position within PII_FIELDS. |
| packages/data-designer-config/src/data_designer/config/sampler_params.py | Replaces two inline ', '.join(LOCALES_WITH_MANAGED_DATASETS) calls with the pre-computed LOCALES_WITH_MANAGED_DATASETS_STR constant; purely mechanical refactor with no logic change. |
| packages/data-designer/src/data_designer/cli/commands/download.py | Replaces a stale hardcoded locale list in the CLI help text (previously missing en_SG and pt_BR) with the dynamic LOCALES_WITH_MANAGED_DATASETS_STR constant — a net improvement over the original. |
| docs/concepts/person_sampling.md | Adds fr_FR to the supported-locales list, NGC download snippet, France-specific field reference table, and the parameter-table locale enum; documentation is comprehensive. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["NEMOTRON_PERSONAS_DATASET_SIZES\n(constants.py)\n+ fr_FR: 2.71 GB"] --> B["LOCALES_WITH_MANAGED_DATASETS\n(list of locale keys)"]
A --> C["LOCALES_WITH_MANAGED_DATASETS_STR\n(comma-joined string) NEW"]
B --> D["PersonSamplerParams.locale\nvalidation & field description"]
B --> E["PersonaRepository._registry\n(locales list)"]
B --> F["DownloadService\nget_available_locales()"]
C --> D
C --> G["CLI --locale help text\n(download.py)"]
H["dataset_based_person_fields.py\n+ 7 fr_FR PII fields"] --> I["PII_FIELDS list\nused by sampling engine"]
Reviews (4): Last reviewed commit: "refactor: add LOCALES_WITH_MANAGED_DATAS..." | Re-trigger Greptile
Update hardcoded locale counts from 7 to 8 and add fr_FR assertions in download controller and download service tests.
The --locale help text was hardcoded and already stale (missing en_SG, pt_BR, fr_FR). Build it from LOCALES_WITH_MANAGED_DATASETS so it stays in sync automatically.
Centralise the comma-joined locale list so it is defined once in constants and reused in the CLI help text, PersonSamplerParams field description, and locale validation error message.
|
Nice work on this one, @johnnygreco — clean addition with a great opportunistic refactor. Here are my thoughts. SummaryThis PR registers the France locale ( FindingsWarnings — Worth addressing
Suggestions — Take it or leave it
What Looks Good
VerdictNeeds changes — One warning: the missing |
Summary
fr_FR, 2.71 GB) inNEMOTRON_PERSONAS_DATASET_SIZES, which auto-propagates toLOCALES_WITH_MANAGED_DATASETS,PersonaRepository,PersonSamplerParamsvalidation, and the download servicedataset_based_person_fields.py:first_name_heritage,name_heritage,is_first_gen_immigrant,household_type,monthly_income_eur,commune,departementfr_FRlocale listing, NGC download example, and field reference