Training Machine Learning Potentials

This tutorial explains how to train machine learning potentials using CURATOR. CURATOR uses Hydra for managing hyperparameters and configuration files, allowing you to leverage a wide range of Hydra features for flexible experimentation.

Training and evaluation are powered by the PyTorch Lightning framework, enabling the use of advanced deep learning features with minimal effort.

Preparing Data

Training equivariant neural network potentials is beginner-friendly with CURATOR. All hyperparameters are pre-defined except for the datapath. Therefore, the only required input to train a model is the ab-initio dataset.

CURATOR uses ASE’s API to read atomistic structures, so any format supported by ASE is compatible. For example, you can use an OUTCAR file from a VASP calculation or an ASE-compatible .traj file.

A minimal training task can be launched with:

curator-train data.datapath=water.traj

This employs CURATOR’s default settings, using the PaiNN model architecture. If you want to customize hyperparameters, you can either:

Use the command-line override:

curator-train model/representation=nequip

Or specify them in a separate YAML configuration file:
```
curator-train cfg=user_cfg.yaml
```

You can define any parameters in the config file. For reference, the meanings of different parameters are documented under curator/configs/. Once a job is run, a complete config.yaml file will be generated. This file contains the effective configuration and can be reused or modified for future tasks.

All other CURATOR tasks (e.g., selection, evaluation) can be initiated in the same way.

Using Hydra’s Defaults List

CURATOR supports multiple equivariant GNN architectures, including PaiNN, NequIP, and MACE. Switching between them usually requires updating many hyperparameters, which can be cumbersome.

Hydra’s defaults list allows you to easily switch architectures by referencing a pre-defined configuration:

defaults:
  - model/representation: mace

Similarly, you can configure other components such as loss functions, loggers, and evaluation metrics. For example:

defaults:
  - task/outputs: energy_force_virial
  - trainer/logger: wandb

Instantiating Objects in Python

You may wish to create models or datasets programmatically from a config file. CURATOR supports this via Hydra’s instantiate function.

from hydra.utils import instantiate
from curator.utils import read_user_config

cfg = read_user_config("user_cfg.yaml", config_name="train.yaml")
model = instantiate(cfg.model)
data = instantiate(cfg.data)
data.setup()

Hydra supports recursive instantiation and parameter overrides:

data = instantiate(cfg.data, batch_size=32)

More examples can be found in the Hydra instantiate documentation.

Saving model and restart training

CURATOR automatically saves model checkpoints during training using PyTorch Lightning’s built-in ModelCheckpoint callback. By default, it saves the latest and best-performing checkpoints according to validation loss.

The checkpoints will be saved to model_path/best_model_{epoch}_{step}_{val_total_loss:.2f}.ckpt.

You can also customize checkpointing behavior in the configuration file, for example to save checkpoints for every N epoch:

trainer:
  callbacks:
    - _target_: pytorch_lightning.callbacks.ModelCheckpoint
      dirpath: model_path
      save_top_k: -1
      every_n_epochs: 10  # can be any integers

To resume training from a specific checkpoint:

curator-train model_path=model_path/best_model_epoch=10.ckpt

Multi-GPU Training

PyTorch Lightning makes multi-GPU training easy and scalable. To enable multi-GPU training, modify the following options in your YAML file:

1. Specify Number of GPUs

Set the number of GPUs using the trainer.devices option.

2. Choose the Distributed Strategy (Recommended: `ddp`)

Use the Distributed Data Parallel (DDP) strategy by setting trainer.strategy to "ddp". This is the most commonly used and recommended strategy for multi-GPU training.

3. Adjust Batch Size

Multiply your original batch size by the number of GPUs so that each GPU processes the same amount of data as in single-GPU training.

4. Enable Mixed Precision (Optional)

For faster training and lower memory usage, enable automatic mixed precision by setting trainer.precision to "16-mixed" (for PyTorch >= 1.6).

trainer:
  devices: <number_of_gpus>
  strategy: ddp
  precision: 16-mixed  # Optional: enables mixed precision

data:
  batch_size: <batch_size_per_gpu> * <number_of_gpus>

Note

For example, if you previously used batch_size: 32 on a single GPU, and now want to use 4 GPUs, set batch_size: 128.

ddp ensures each GPU handles a portion of the data in parallel and synchronizes gradients efficiently. Mixed precision further speeds up training while reducing GPU memory usage.

cuEquivariance acceleration

cuEquivariance is an NVIDIA Python library to accelerate GPU kernel for graph neural networks. It can speed up training and inference of MACE and nequip models significantly. To enable cuEquivariance acceleration in CURATOR, simply set:

model:
  representation:
    use_cueq: true

This will automatically apply cuEquivariance optimized layers where available.