🧚 MethFormer: A Transformer for DNA Methylation

MethFormer is a masked regression transformer model trained to learn local and long-range patterns in DNA methylation (5mC and 5hmC) across genomic regions. Pretrained on binned methylation data, it is designed for downstream fine-tuning on tasks such as predicting MLL binding or chromatin state.


πŸš€ Overview

  • Inputs: Binned methylation values (5mC, 5hmC) over 1024bp windows (32 bins Γ— 2 channels)
  • Pretraining objective: Masked methylation imputation (per-bin regression)
  • Architecture: Transformer encoder with linear projection head
  • Downstream tasks: MLL binding prediction, chromatin state inference, or enhancer classification

πŸ“ Project Structure

.
β”œβ”€β”€ config/                       # config
β”œβ”€β”€ data/                         # Binned methylation datasets (HuggingFace format)
β”œβ”€β”€ output/                       # Pretrained models, logs, and checkpoints
β”œβ”€β”€ scripts/                      
β”‚   β”œβ”€β”€ methformer.py             # Model classes, data collator, 
β”‚   β”œβ”€β”€ pretrain_methformer.py    # Main training script
β”‚   └── finetune_mll.py           # (optional) downstream fine-tuning
β”œβ”€β”€ requirements.txt
└── README.md

πŸ‘©β€πŸ’» Pretraining MethFormer

Step 1: Prepare Dataset

Preprocess 5mC and 5hmC data into 1024bp windows, binned into 32 bins Γ— 2 features. Save using Hugging Face's datasets.DatasetDict format:

DatasetDict({
  train: Dataset({
    features: ['input_values', 'attention_mask', 'labels']
  }),
  validation: Dataset(...)
})

Step 2: Run Pretraining

python scripts/pretrain_methformer.py

Options can be customized inside the script or modified for sweep tuning. This will:

  • Train the model using masked regression loss
  • Evaluate on a held-out chromosome (e.g., chr8)
  • Log metrics to Weights & Biases
  • Save the best model checkpoint

πŸ“Š Metrics

  • masked_mse: Mean squared error over unmasked positions
  • masked_mae: Mean absolute error

πŸ§ͺ Fine-tuning on MLL Binding

After pretraining:

  1. Replace the regression head with a scalar head for MLL prediction.
  2. Use a Trainer to fine-tune on log1p-transformed MLL-N RPKM values mean over 1kb regions.

See scripts/finetune_mll.py for an example.


πŸ” Visualizations & Interpretability

You can run Captum or SHAP for:

  • Per-bin attribution of 5mC/5hmC to MLL binding
  • Visualizing what MethFormer attends to during fine-tuning

πŸ› οΈ Dependencies

Key packages:

  • transformers
  • datasets
  • wandb
  • torch
  • anndata
  • scikit-learn

🧠 Acknowledgements

  • Built with inspiration from DNABERT, Grover, and vision transformers
Downloads last month
4
Safetensors
Model size
7.25M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including CChahrour/Methformer