Training Machine Learning Potentials
This tutorial explains how to train machine learning potentials using CURATOR. CURATOR uses Hydra for managing hyperparameters and configuration files, allowing you to leverage a wide range of Hydra features for flexible experimentation.
Training and evaluation are powered by the PyTorch Lightning framework, enabling the use of advanced deep learning features with minimal effort.
Preparing Data
Training equivariant neural network potentials is beginner-friendly with CURATOR. All hyperparameters are pre-defined except for the datapath. Therefore, the only required input to train a model is the ab-initio dataset.
CURATOR uses ASE’s API to read atomistic structures, so any format supported by ASE is compatible. For example, you can use an OUTCAR file from a VASP calculation or an ASE-compatible .traj file.
A minimal training task can be launched with:
curator-train data.datapath=water.traj
This employs CURATOR’s default settings, using the PaiNN model architecture. If you want to customize hyperparameters, you can either:
Use the command-line override:
curator-train model/representation=nequip
Or specify them in a separate YAML configuration file:
curator-train cfg=user_cfg.yaml
You can define any parameters in the config file. For reference, the meanings of different parameters are documented under curator/configs/. Once a job is run, a complete config.yaml file will be generated. This file contains the effective configuration and can be reused or modified for future tasks.
All other CURATOR tasks (e.g., selection, evaluation) can be initiated in the same way.
Using Hydra’s Defaults List
CURATOR supports multiple equivariant GNN architectures, including PaiNN, NequIP, and MACE. Switching between them usually requires updating many hyperparameters, which can be cumbersome.
Hydra’s defaults list allows you to easily switch architectures by referencing a pre-defined configuration:
defaults:
- model/representation: mace
Similarly, you can configure other components such as loss functions, loggers, and evaluation metrics. For example:
defaults:
- task/outputs: energy_force_virial
- trainer/logger: wandb
Instantiating Objects in Python
You may wish to create models or datasets programmatically from a config file. CURATOR supports this via Hydra’s instantiate function.
1from hydra.utils import instantiate
2from curator.utils import read_user_config
3
4cfg = read_user_config("user_cfg.yaml", config_name="train.yaml")
5model = instantiate(cfg.model)
6data = instantiate(cfg.data)
7data.setup()
Hydra supports recursive instantiation and parameter overrides:
1data = instantiate(cfg.data, batch_size=32)
More examples can be found in the Hydra instantiate documentation.
Saving model and restart training
CURATOR automatically saves model checkpoints during training using PyTorch Lightning’s built-in ModelCheckpoint callback. By default, it saves the latest and best-performing checkpoints according to validation loss.
The checkpoints will be saved to model_path/best_model_{epoch}_{step}_{val_total_loss:.2f}.ckpt.
You can also customize checkpointing behavior in the configuration file, for example to save checkpoints for every N epoch:
trainer:
callbacks:
- _target_: pytorch_lightning.callbacks.ModelCheckpoint
dirpath: model_path
save_top_k: -1
every_n_epochs: 10 # can be any integers
To resume training from a specific checkpoint:
curator-train model_path=model_path/best_model_epoch=10.ckpt
Multi-GPU Training
PyTorch Lightning makes multi-GPU training easy and scalable. To enable multi-GPU training, modify the following options in your YAML file:
1. Specify Number of GPUs
Set the number of GPUs using the trainer.devices option.
2. Choose the Distributed Strategy (Recommended: ddp)
Use the Distributed Data Parallel (DDP) strategy by setting trainer.strategy to "ddp". This is the most commonly used and recommended strategy for multi-GPU training.
3. Adjust Batch Size
Multiply your original batch size by the number of GPUs so that each GPU processes the same amount of data as in single-GPU training.
4. Enable Mixed Precision (Optional)
For faster training and lower memory usage, enable automatic mixed precision by setting trainer.precision to "16-mixed" (for PyTorch >= 1.6).
trainer:
devices: <number_of_gpus>
strategy: ddp
precision: 16-mixed # Optional: enables mixed precision
data:
batch_size: <batch_size_per_gpu> * <number_of_gpus>
Note
For example, if you previously used batch_size: 32 on a single GPU, and now want to use 4 GPUs, set batch_size: 128.
ddp ensures each GPU handles a portion of the data in parallel and synchronizes gradients efficiently. Mixed precision further speeds up training while reducing GPU memory usage.
cuEquivariance acceleration
cuEquivariance is an NVIDIA Python library to accelerate GPU kernel for graph neural networks. It can speed up training and inference of MACE and nequip models significantly. To enable cuEquivariance acceleration in CURATOR, simply set:
model:
representation:
use_cueq: true
This will automatically apply cuEquivariance optimized layers where available.