DNA-models
Collection
4 items
β’
Updated
MethFormer is a masked regression transformer model trained to learn local and long-range patterns in DNA methylation (5mC and 5hmC) across genomic regions. Pretrained on binned methylation data, it is designed for downstream fine-tuning on tasks such as predicting MLL binding or chromatin state.
.
βββ config/ # config
βββ data/ # Binned methylation datasets (HuggingFace format)
βββ output/ # Pretrained models, logs, and checkpoints
βββ scripts/
β βββ methformer.py # Model classes, data collator,
β βββ pretrain_methformer.py # Main training script
β βββ finetune_mll.py # (optional) downstream fine-tuning
βββ requirements.txt
βββ README.md
Preprocess 5mC and 5hmC data into 1024bp windows, binned into 32 bins Γ 2 features. Save using Hugging Face's datasets.DatasetDict format:
DatasetDict({
train: Dataset({
features: ['input_values', 'attention_mask', 'labels']
}),
validation: Dataset(...)
})
python scripts/pretrain_methformer.py
Options can be customized inside the script or modified for sweep tuning. This will:
chr8)masked_mse: Mean squared error over unmasked positionsmasked_mae: Mean absolute errorAfter pretraining:
Trainer to fine-tune on log1p-transformed MLL-N RPKM values mean over 1kb regions.See scripts/finetune_mll.py for an example.
You can run Captum or SHAP for:
Key packages:
transformersdatasetswandbtorchanndatascikit-learn