Quickstart#

Downloading datasets#

Datasets can either be built from the source structures or downloaded from Zenodo. Datasets will be built from source the first time a dataset is used in a run (or by calling the appropriate setup() method in the corresponding datamodule). We provide a CLI tool for downloading datasets:

workshop download <DATASET_NAME>
workshop download pdb
workshop download cath
workshop download afdb_rep_v4
# etc..

If you wish to build datasets from source, we recommend first downloading the entire PDB first (in MMTF format, c. 24 Gb) to reuse shared PDB data as much as possible:

workshop download pdb
# or
python proteinworkshop/scripts/download_pdb_mmtf.py

Training a model#

Launching an experiment minimally requires specification of a dataset, structural encoder, and task (devices can be specified with trainer=cpu/gpu):

workshop train dataset=cath encoder=egnn task=inverse_folding trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding trainer=cpu # or trainer=gpu

This command uses the default configurations in configs/train.yaml, which can be overwritten by equivalently named options. For instance, you can use a different input featurisation using the features option, or set the display name of your experiment on wandb using the name option:

workshop train dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu # or trainer=gpu

Finetuning a model#

Finetuning a model additionally requires specification of a checkpoint.

workshop finetune dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/finetune.py dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu # or trainer=gpu

Running a sweep/experiment#

We can make use of the hydra wandb sweeper plugin to configure experiments as sweeps, allowing searches over hyperparameters, architectures, pre-training/auxiliary tasks and datasets.

See proteinworkshop/config/sweeps/ for examples.

  1. Create the sweep with weights and biases

    wandb sweep proteinworkshop/config/sweeps/my_new_sweep_config.yaml
    
  2. Launch job workers

With wandb:

wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 8

Or an example SLURM submission script:

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --array=0-32

source ~/.bashrc
source $(conda info --base)/envs/proteinworkshop/bin/activate

wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 1

Reproduce the sweeps performed in the manuscript:

# reproduce the baseline tasks sweep (i.e., those performed without pre-training each model)
wandb sweep proteinworkshop/config/sweeps/baseline_fold.yaml
wandb agent mywandbgroup/proteinworkshop/2awtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/baseline_ppi.yaml
wandb agent mywandbgroup/proteinworkshop/2bwtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/baseline_inverse_folding.yaml
wandb agent mywandbgroup/proteinworkshop/2cwtt7oy --count 8

# reproduce the model pre-training sweep
wandb sweep proteinworkshop/config/sweeps/pre_train.yaml
wandb agent mywandbgroup/proteinworkshop/2dwtt7oy --count 8

# reproduce the pre-trained tasks sweep (i.e., those performed after pre-training each model)
wandb sweep proteinworkshop/config/sweeps/pt_fold.yaml
wandb agent mywandbgroup/proteinworkshop/2ewtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/pt_ppi.yaml
wandb agent mywandbgroup/proteinworkshop/2fwtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/pt_inverse_folding.yaml
wandb agent mywandbgroup/proteinworkshop/2gwtt7oy --count 8

Embedding a dataset#

We provide a utility in proteinworkshop/embed.py for embedding a dataset using a pre-trained model. To run it:

python proteinworkshop/embed.py ckpt_path=PATH/TO/CHECKPOINT collection_name=COLLECTION_NAME

See the embed section of proteinworkshop/config/embed.yaml for additional parameters.

Visualising pre-trained model embeddings for a given dataset#

We provide a utility in proteinworkshop/visualise.py for visualising the UMAP embeddings of a pre-trained model for a given dataset. To run it:

python proteinworkshop/visualise.py ckpt_path=PATH/TO/CHECKPOINT plot_filepath=VISUALISATION/FILEPATH.png

See the visualise section of proteinworkshop/config/visualise.yaml for additional parameters.

Performing attribution of a pre-trained model#

We provide a utility in proteinworkshop/explain.py for performing attribution of a pre-trained model using integrated gradients.

This will write PDB files for all the structures in a dataset for a supervised task with residue-level attributions in the b_factor column. To visualise the attributions, we recommend using the Protein Viewer VSCode extension and changing the 3D representation to colour by Uncertainty/Disorder.

To run the attribution:

python proteinworkshop/explain.py ckpt_path=PATH/TO/CHECKPOINT output_dir=ATTRIBUTION/DIRECTORY

See the explain section of proteinworkshop/config/explain.yaml for additional parameters.

Verifying a config#

python proteinworkshop/validate_config.py dataset=cath features=full_atom task=inverse_folding

Using proteinworkshop modules functionally#

One may use the modules (e.g., datasets, models, featurisers, and utilities) of proteinworkshop functionally by importing them directly. When installing this package using PyPi, this makes building on top of the assets of proteinworkshop straightforward and convenient.

For example, to use any datamodule available in proteinworkshop:

from proteinworkshop.datasets.cath import CATHDataModule

datamodule = CATHDataModule(path="data/cath/", pdb_dir="data/pdb/", format="mmtf", batch_size=32)
datamodule.download()

train_dl = datamodule.train_dataloader()

To use any model or featuriser available in proteinworkshop:

from proteinworkshop.models.graph_encoders.dimenetpp import DimeNetPPModel
from proteinworkshop.features.factory import ProteinFeaturiser
from proteinworkshop.datasets.utils import create_example_batch

model = DimeNetPPModel(hidden_channels=64, num_layers=3)
ca_featuriser = ProteinFeaturiser(
    representation="CA",
    scalar_node_features=["amino_acid_one_hot"],
    vector_node_features=[],
    edge_types=["knn_16"],
    scalar_edge_features=["edge_distance"],
    vector_edge_features=[],
)

example_batch = create_example_batch()
batch = ca_featuriser(example_batch)

model_outputs = model(example_batch)