YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SFT and PEFT with NeMo 1.0

source

docker run -d --gpus all -it --rm \
    --shm-size=16g \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    nvcr.io/nvidia/nemo:24.07

note: ensure enough space writable within docker runtime, or expose a volume -v <host/path>:<container/path>

git clone https://huggingface.co/vuiseng9/nemo1-sft-peft-gemma-7b
cd nemo1-sft-peft-gemma-7b

# PEFT:
./run_peft.sh

# Eval:
./run_eval.sh

# SFT:
./run_sft.sh

Dataset (Output of this section has been pushed to this repo)

git clone https://huggingface.co/datasets/databricks/databricks-dolly-15k

# script in container
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_dataprep/preprocess.py --input databricks-dolly-15k/databricks-dolly-15k.jsonl

# split into train/val/test
python3 split_train_val.py

Prepare Packed Dataset (optional, expect error, token not on cuda, just move it yourself)

# need the tokenizer
git clone https://huggingface.co/google/gemma-7b

HYDRA_FULL_ERROR=1 python /opt/NeMo/scripts/nlp_language_modeling/prepare_packed_ft_dataset.py \
       model.data.train_ds.file_names=[databricks-dolly-15k/training.jsonl] \
       model.data.train_ds.max_seq_length=2048 \
       +tokenizer_path=gemma-7b/tokenizer.model \
       +output_dir=databricks-dolly-15k/ \
       +pack_sizes=[2048,4096,8192]

# [NeMo I 2025-08-17 03:02:05 prepare_packed_ft_dataset:148] Done, output written to databricks-dolly-15k/packed_8192_seed0.npy
# [NeMo I 2025-08-17 03:02:05 prepare_packed_ft_dataset:150] 
#     โœ… Packed datasets with pack sizes [2048, 4096, 8192] are prepared successfully.
#     To train with packed sequences, you need to change three things in the SFT/PEFT config file
#     1. Turn on the packed_sequence flag 
#        > +model.data.train_ds.packed_sequence=True
#     2. Use the new dataset file instead of the original jsonl file
#        > model.data.train_ds.file_names=/path/to/packed_dataset.npy
#     3. Specify the packed sequence length. This should be one of the ``pack_sizes`` you specified during data preparation.
#        > model.data.train_ds.max_seq_length=<pack_size>
#     4. Adjust the batch sizes. 
#        Micro batch size has to be set to 1 as a nominal constraint. This is because batches are now concatenated 
#        in the preprocessing step. You can increase the pack_size to achieve the same purpose of increasing micro batch size.
#        Global batch size has to be reduced by the average number of sequences per pack `n`, 
#        where n = total number of sequences / total number of packs. This ensures that each gradient iteration 
#        sees (on average) the same number of sequences so that the recipe is maintained.
#        Please scroll up to see the value of n for each of your pack sizes.
#        > model.micro_batch_size=1
#        > model.global_batch_size=<previous GBS divided by n> 

Convert the Hugging Face Gemma model to .nemo model (done)

git clone https://huggingface.co/google/gemma-7b

python3 /opt/NeMo/scripts/checkpoint_converters/convert_gemma_hf_to_nemo.py \
--input_name_or_path gemma-7b/ \
--output_path gemma-7b.nemo \
--tokenizer_path gemma-7b/tokenizer.model
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support