Quickstart#
Downloading datasets#
Datasets can either be built from the source structures or downloaded from Zenodo. Datasets will be built from source the first time a dataset is used in a run (or by calling the appropriate setup()
method in the corresponding datamodule). We provide a CLI tool for downloading datasets:
workshop download <DATASET_NAME>
workshop download pdb
workshop download cath
workshop download afdb_rep_v4
# etc..
If you wish to build datasets from source, we recommend first downloading the entire PDB first (in MMTF format, c. 24 Gb) to reuse shared PDB data as much as possible:
workshop download pdb
# or
python proteinworkshop/scripts/download_pdb_mmtf.py
Training a model#
Launching an experiment minimally requires specification of a dataset, structural encoder, and task (devices can be specified with trainer=cpu/gpu
):
workshop train dataset=cath encoder=egnn task=inverse_folding trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding trainer=cpu # or trainer=gpu
This command uses the default configurations in configs/train.yaml
, which can be overwritten by equivalently named options. For instance, you can use a different input featurisation using the features
option, or set the display name of your experiment on wandb using the name
option:
workshop train dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu # or trainer=gpu
Finetuning a model#
Finetuning a model additionally requires specification of a checkpoint.
workshop finetune dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/finetune.py dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu # or trainer=gpu
Running a sweep/experiment#
We can make use of the hydra wandb sweeper plugin to configure experiments as sweeps, allowing searches over hyperparameters, architectures, pre-training/auxiliary tasks and datasets.
See proteinworkshop/config/sweeps/
for examples.
Create the sweep with weights and biases
wandb sweep proteinworkshop/config/sweeps/my_new_sweep_config.yaml
Launch job workers
With wandb:
wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 8
Or an example SLURM submission script:
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --array=0-32
source ~/.bashrc
source $(conda info --base)/envs/proteinworkshop/bin/activate
wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 1
Reproduce the sweeps performed in the manuscript:
# reproduce the baseline tasks sweep (i.e., those performed without pre-training each model)
wandb sweep proteinworkshop/config/sweeps/baseline_fold.yaml
wandb agent mywandbgroup/proteinworkshop/2awtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/baseline_ppi.yaml
wandb agent mywandbgroup/proteinworkshop/2bwtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/baseline_inverse_folding.yaml
wandb agent mywandbgroup/proteinworkshop/2cwtt7oy --count 8
# reproduce the model pre-training sweep
wandb sweep proteinworkshop/config/sweeps/pre_train.yaml
wandb agent mywandbgroup/proteinworkshop/2dwtt7oy --count 8
# reproduce the pre-trained tasks sweep (i.e., those performed after pre-training each model)
wandb sweep proteinworkshop/config/sweeps/pt_fold.yaml
wandb agent mywandbgroup/proteinworkshop/2ewtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/pt_ppi.yaml
wandb agent mywandbgroup/proteinworkshop/2fwtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/pt_inverse_folding.yaml
wandb agent mywandbgroup/proteinworkshop/2gwtt7oy --count 8
Embedding a dataset#
We provide a utility in proteinworkshop/embed.py
for embedding a dataset using a pre-trained model.
To run it:
python proteinworkshop/embed.py ckpt_path=PATH/TO/CHECKPOINT collection_name=COLLECTION_NAME
See the embed
section of proteinworkshop/config/embed.yaml
for additional parameters.
Visualising pre-trained model embeddings for a given dataset#
We provide a utility in proteinworkshop/visualise.py
for visualising the UMAP embeddings of a pre-trained model for a given dataset.
To run it:
python proteinworkshop/visualise.py ckpt_path=PATH/TO/CHECKPOINT plot_filepath=VISUALISATION/FILEPATH.png
See the visualise
section of proteinworkshop/config/visualise.yaml
for additional parameters.
Performing attribution of a pre-trained model#
We provide a utility in proteinworkshop/explain.py
for performing attribution of a pre-trained model using integrated gradients.
This will write PDB files for all the structures in a dataset for a supervised task with residue-level attributions in the b_factor
column. To visualise the attributions, we recommend using the Protein Viewer VSCode extension and changing the 3D representation to colour by Uncertainty/Disorder
.
To run the attribution:
python proteinworkshop/explain.py ckpt_path=PATH/TO/CHECKPOINT output_dir=ATTRIBUTION/DIRECTORY
See the explain
section of proteinworkshop/config/explain.yaml
for additional parameters.
Verifying a config#
python proteinworkshop/validate_config.py dataset=cath features=full_atom task=inverse_folding
Using proteinworkshop
modules functionally#
One may use the modules (e.g., datasets, models, featurisers, and utilities) of proteinworkshop
functionally by importing them directly. When installing this package using PyPi, this makes building
on top of the assets of proteinworkshop
straightforward and convenient.
For example, to use any datamodule available in proteinworkshop
:
from proteinworkshop.datasets.cath import CATHDataModule
datamodule = CATHDataModule(path="data/cath/", pdb_dir="data/pdb/", format="mmtf", batch_size=32)
datamodule.download()
train_dl = datamodule.train_dataloader()
To use any model or featuriser available in proteinworkshop
:
from proteinworkshop.models.graph_encoders.dimenetpp import DimeNetPPModel
from proteinworkshop.features.factory import ProteinFeaturiser
from proteinworkshop.datasets.utils import create_example_batch
model = DimeNetPPModel(hidden_channels=64, num_layers=3)
ca_featuriser = ProteinFeaturiser(
representation="CA",
scalar_node_features=["amino_acid_one_hot"],
vector_node_features=[],
edge_types=["knn_16"],
scalar_edge_features=["edge_distance"],
vector_edge_features=[],
)
example_batch = create_example_batch()
batch = ca_featuriser(example_batch)
model_outputs = model(example_batch)