Dataset#
To switch between different datasets, simply change the dataset argument in the launch command. For example:
workshop train encoder=gear_net dataset=<DATASET_NAME> task=inverse_folding trainer=cpu
# or
python proteinworkshop/train.py encoder=gear_net dataset=<DATASET_NAME> task=inverse_folding trainer=cpu # or trainer=gpu
Where <DATASET_NAME>
is given by bracketed name in the listing below. For example, the dataset name for CATH is cath
.
Note
If you have pip-installed proteinworkshop, you can download pre-training or processed downstream datasets from Zenodo with:
workshop download <DATASET_NAME>
Unlabelled Datasets#
Structure-based Pre-training Corpuses#
Pre-training corpuses (with the exception of pdb
, cath
, and astral
) are provided in FoldComp database format. This format is highly compressed, resulting in very small disk space requirements despite the large size. pdb
is provided as a collection of
MMTF
files, which are significantly smaller in size than conventional .pdb
or .cif
file.
Name |
Description |
Source |
Size |
Disk Size |
License |
---|---|---|---|---|---|
|
SCOPe domain structures |
1 - 2.2 Gb |
|||
|
Representative structures identified from the AlphaFold database by FoldSeek structural clustering |
2.27M Chains |
9.6 Gb |
||
|
Dark proteome structures identied by structural clustering of the AlphaFold database. |
~800k |
2.2 Gb |
||
|
AlphaFold2 predictions for SwissProt/UniProtKB |
542k Chains |
2.9 Gb |
||
|
AlphaFold2 predictions for UniProt |
214M Chains |
1 Tb |
||
|
CATH 4.2 40% split by CATH topologies. |
~18k chains |
4.3 Gb |
||
|
ESMAtlas predictions (full) |
1 Tb |
|||
|
ESMAtlas predictions (v2023_02 release) |
137 Gb |
|||
|
ESMAtlas High Quality predictions |
37M Chains |
114 Gb |
||
|
IGFold Predictions for Paired OAS |
104,994 paired Ab chains |
|||
|
IGFold predictions for Jaffe2022 data |
1,340,180 paired Ab chains |
|||
|
Experimental structures deposited in the RCSB Protein Data Bank |
~800k Chains |
23 Gb |
Additionally, we provide several species-specific compilations (mostly reference species)
| Name | Description | Source | Size | | ----------------| ----------- | ------ | ------ | | `a_thaliana` | _Arabidopsis thaliana_ (thale cress) proteome | AlphaFold2| | `c_albicans` | _Candida albicans_ (a fungus) proteome | AlphaFold2| | `c_elegans` | _Caenorhabditis elegans_ (roundworm) proteome | AlphaFold2 | | | `d_discoideum` | _Dictyostelium discoideum_ (slime mold) proteome | AlphaFold2| | | `d_melanogaster` | [_Drosophila melanogaster_](https://www.uniprot.org/taxonomy/7227) (fruit fly) proteome | AlphaFold2 | | | `d_rerio` | [_Danio rerio_](https://www.uniprot.org/taxonomy/7955) (zebrafish) proteome | AlphaFold2 | | | `e_coli` | _Escherichia coli_ (a bacteria) proteome | AlphaFold2 | | | `g_max` | _Glycine max_ (soy bean) proteome | AlphaFold2 | | | `h_sapiens` | _Homo sapiens_ (human) proteome | AlphaFold2 | | | `m_jannaschii` | _Methanocaldococcus jannaschii_ (an archaea) proteome | AlphaFold2 | | | `m_musculus` | _Mus musculus_ (mouse) proteome | AlphaFold2 | | | `o_sativa` | _Oryza sative_ (rice) proteome | AlphaFold2 | | | `r_norvegicus` | _Rattus norvegicus_ (brown rat) proteome | AlphaFold2 | | | `s_cerevisiae` | _Saccharomyces cerevisiae_ (brewer's yeast) proteome | AlphaFold2 | | | `s_pombe` | _Schizosaccharomyces pombe_ (a fungus) proteome | AlphaFold2 | | | `z_mays` | _Zea mays_ (corn) proteome | AlphaFold2 | |ASTRAL
(astral
)#
ASTRAL provides compendia of protein domain structures, regions of proteins that can maintain their structure and function independently of the rest of the protein. Domains typically exhibit highly-specific functions and can be considered structural building blocks of proteins.
datamodule:
_target_: "proteinworkshop.datasets.astral.AstralDataModule"
path: ${env.paths.data}/Astral/ # Directory where the dataset is stored
release: "2.08" # Version of ASTRAL to use
identity: "95" # Percent identity clustering threshold to use
batch_size: 32 # Batch size for dataloader
pin_memory: True # Pin memory for dataloader
num_workers: 4 # Number of workers for dataloader
dataset_fraction: 1.0 # Fraction of dataset to use
transforms: ${transforms} # Transforms to apply to dataset examples
overwrite: False # Whether to overwrite cached dataset example files
train_val_test: [0.8, 0.1, 0.1] # Cross-validation ratios to use for train, val, and test splits
num_classes: null # Number of classes
CATH
(cath
)#
datamodule:
_target_: "proteinworkshop.datasets.cath.CATHDataModule"
path: ${env.paths.data}/cath/ # Directory where the dataset is stored
pdb_dir: ${env.paths.data}/pdb/ # Directory where raw PDB/mmtf files are stored
format: "mmtf" # Format of the raw PDB/MMTF files
num_workers: 4 # Number of workers for dataloader
pin_memory: True # Pin memory for dataloader
batch_size: 32 # Batch size for dataloader
dataset_fraction: 1.0 # Fraction of the dataset to use
transforms: ${transforms} # Transforms to apply to dataset examples
overwrite: False # Whether to overwrite the dataset if it already exists
in_memory: True # Whether to load the entire dataset into memory
num_classes: 23 # Number of classes
PDB
(pdb
)#
See also
proteinworkshop.datasets.pdb_dataset.PDBData
datamodule:
_target_: "proteinworkshop.datasets.pdb_dataset.PDBDataModule"
path: ${env.paths.data}/pdb/ # Directory where the dataset is stored
batch_size: 32 # Batch size for dataloader
num_workers: 4 # Number of workers for dataloader
pin_memory: True # Pin memory for dataloader
transforms: ${transforms} # Transforms to apply to dataset examples
overwrite: False # Whether to overwrite existing dataset files
pdb_dataset:
_target_: "proteinworkshop.datasets.pdb_dataset.PDBData"
fraction: 1.0 # Fraction of dataset to use
molecule_type: "protein" # Type of molecule for which to select
experiment_types: ["diffraction", "NMR", "EM", "other"] # All experiment types
max_length: 1000 # Exclude polypeptides greater than length 1000
min_length: 10 # Exclude peptides of length 10
oligomeric_min: 1 # Include only monomeric proteins
oligomeric_max: 5 # Include up to 5-meric proteins
best_resolution: 0.0 # Include only proteins with resolution >= 0.0
worst_resolution: 8.0 # Include only proteins with resolution <= 8.0
has_ligands: ["ZN"] # Include only proteins containing the ligand `ZN`
remove_ligands: [] # Exclude specific ligands from any available protein-ligand complexes
remove_non_standard_residues: True # Include only proteins containing standard amino acid residues
remove_pdb_unavailable: True # Include only proteins that are available to download
split_sizes: [0.8, 0.1, 0.1] # Cross-validation ratios to use for train, val, and test splits
AFdb Rep. v4
(afdb_rep_v4
)#
This is a dataset of approximately 3 million protein structures from the AlphaFold database, structurally clustered using FoldSeek.
datamodule:
_target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
data_dir: ${env.paths.data}/afdb_rep_v4/
database: "afdb_rep_v4"
batch_size: 32
num_workers: 4
train_split: 0.98
val_split: 0.01
test_split: 0.01
pin_memory: True
use_graphein: True
transform: ${transforms}
dataset_name: "afdb_rep_v4"
num_classes: None # number of classes
AFdb Dark Proteome
(afdb_rep_dark_v4
)#
datamodule:
_target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
data_dir: ${env.paths.data}/afdb_rep_dark_v4/
database: "afdb_rep_dark_v4"
batch_size: 32
num_workers: 4
train_split: 0.8
val_split: 0.1
test_split: 0.1
pin_memory: True
use_graphein: True
transform: ${transforms}
dataset_name: "afdb_rep_dark_v4"
num_classes: None # number of classes
ESM Atlas
(esmatlas_v2023_02
)#
datamodule:
_target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
data_dir: ${env.paths.data}/esmatlas_v2023_02/
database: "esmatlas_v2023_02"
batch_size: 32
num_workers: 4
train_split: 0.8
val_split: 0.1
test_split: 0.1
pin_memory: True
use_graphein: True
transform: ${transforms}
dataset_name: "esmatlas_v2023_02"
num_classes: None # number of classes
ESM Atlas (High Quality)
(highquality_clust30
)#
datamodule:
_target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
data_dir: ${env.paths.data}/highquality_clust30/
database: "highquality_clust30"
batch_size: 32
num_workers: 4
train_split: 0.8
val_split: 0.1
test_split: 0.1
pin_memory: True
use_graphein: True
transform: ${transforms}
dataset_name: "highquality_clust30"
num_classes: None # number of classes
UniProt (Alphafold)
(afdb_uniprot_v4
)#
datamodule:
_target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
data_dir: ${env.paths.data}/afdb_uniprot_v4/
database: "afdb_uniprot_v4"
batch_size: 32
num_workers: 4
train_split: 0.8
val_split: 0.1
test_split: 0.1
pin_memory: True
use_graphein: True
transform: ${transforms}
dataset_name: "afdb_uniprot_v4"
num_classes: None # number of classes
SwissProt (Alphafold)
(afdb_swissprot_v4
)#
datamodule:
_target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
data_dir: ${env.paths.data}/afdb_swissprot_v4/
database: "afdb_swissprot_v4"
batch_size: 32
num_workers: 32
train_split: 0.8
val_split: 0.1
test_split: 0.1
pin_memory: True
use_graphein: True
transform: ${transforms}
dataset_name: "afdb_swissprot_v4"
num_classes: None # number of classes
Species-Specific Datasets#
Stay tuned!
Graph-level Datasets#
Antibody Developability
(antibody_developability
)#
Therapeutic antibodies must be optimised for favourable physicochemical properties in addition to target binding affinity and specificity to be viable development candidates. Consequently, this task frames prediction of antibody developability as a binary graph classification task indicating whether a given antibody is developable
Dataset: We adopt the antibody developability dataset originally curated from SabDab by TDC.
Impact: From a benchmarking perspective, this task is important as it enables targeted performance assessment of models on a specific (immunoglobulin) fold, providing insight into whether general- purpose structure-based encoders can be applicable to fold-specific tasks.
datamodule:
_target_: proteinworkshop.datasets.antibody_developability.AntibodyDevelopabilityDataModule
path: ${env.paths.data}/AntibodyDevelopability # Directory where the dataset is stored
pdb_dir: ${env.paths.data}/pdb/ # Path to all downloaded PDB files
batch_size: 32 # Batch size for dataloader
pin_memory: True # Pin memory for dataloader
num_workers: 4 # Number of workers for dataloader
in_memory: False # Load the dataset in memory
format: "mmtf" # Format of the structure files
obsolete_strategy: "drop" # What to do with obsolete PDB entries
transforms: ${transforms} # Transforms to apply to dataset examples
overwrite: False
num_classes: 2 # Number of classes
Atom3D Mutation Stability Prediction
(atom3d_msp
)#
This task is defined in the Atom3D benchmark.
As per their documentation:
Impact: Identifying mutations that stabilize a protein’s interactions is a key task in designing new proteins. Experimental techniques for probing these are labor intensive, motivating the development of efficient computational methods.
Dataset description: We derive a novel dataset by collecting single-point mutations from the SKEMPI database (Jankauskaitė et al., 2019) and model each mutation into the structure to produce mutated structures.
Task: We formulate this as a binary classification task where we predict whether the stability of the complex increases as a result of the mutation.
Splitting criteria: We split protein complexes by sequence identity at 30%.
Downloads: The full dataset, split data, and split indices are available for download via Zenodo (doi:10.5281/zenodo.4962515)
datamodule:
_target_: proteinworkshop.datasets.atom3d_datamodule.ATOM3DDataModule
task: MSP
data_dir: ${env.paths.data}/ATOM3D
max_units: 0
unit: edge
batch_size: 1
num_workers: 4
pin_memory: false
num_classes: 2
Atom3D Protein Structure Ranking
(atom3d_psr
)#
This task is defined in the Atom3D benchmark.
As per their documentation:
Impact: Proteins are one of the primary workhorses of the cell, and knowing their structure is often critical to understanding (and engineering) their function.
Dataset description: The Critical Assessment of Structure Prediction (CASP) (Kryshtafovych et al., 2019) is a blind international competition for predicting protein structure.
Task: We formulate this as a regression task, where we predict the global distance test (GDT_TS) from the true structure for each of the predicted structures submitted in the last 18 years of CASP.
Splitting criteria: We split structures temporally by competition year.
Downloads: The full dataset, split data, and split indices are available for download via Zenodo (doi:10.5281/zenodo.4915648)
datamodule:
_target_: proteinworkshop.datasets.atom3d_datamodule.ATOM3DDataModule
task: PSR
data_dir: ${env.paths.data}/ATOM3D
max_units: 0
unit: edge
batch_size: 1
num_workers: 4
pin_memory: false
num_classes: 1
Deep Sea Protein Classification
(deep_sea_proteins
)#
datamodule:
_target_: proteinworkshop.datasets.deep_sea_proteins.DeepSeaProteinsDataModule
path: ${env.paths.data}/deep-sea-proteins/ # Directory where the dataset is stored
pdb_dir: ${env.paths.data}/pdb/ # Directory where raw PDB/mmtf files are stored
validation_fold: "4" # Fold to use for validation (one of '1', '2', '3', '4', 'PM_group')
batch_size: 32 # Batch size for dataloader
pin_memory: True # Pin memory for dataloader
num_workers: 8 # Number of workers for dataloader
obsolete_strategy: "drop"
format: "mmtf" # Format of the raw PDB/MMTF files
transforms: ${transforms}
overwrite: False
num_classes: 2
Enzyme Commission Number Prediction
(ec_reaction
)#
datamodule:
_target_: proteinworkshop.datasets.ec_reaction.EnzymeCommissionReactionDataset
path: ${env.paths.data}/ECReaction/ # Directory where the dataset is stored
pdb_dir: ${env.paths.data}/pdb/ # Directory where raw PDB/mmtf files are stored
format: "mmtf" # Format of the raw PDB/MMTF files
batch_size: 32 # Batch size for dataloader
pin_memory: True # Pin memory for dataloader
num_workers: 8 # Number of workers for dataloader
dataset_fraction: 1.0 # Fraction of the dataset to use
shuffle_labels: False # Whether to shuffle labels for permutation testing
transforms: ${transforms}
overwrite: False
in_memory: True
num_classes: 384
Fold Classification
(fold-family
, fold-superfamily
, fold-fold
)#
This is a multiclass graph classification task where each protein, G, is mapped to a label y ∈ {1, … , 1195} denoting the fold class.
Dataset: We adopt the fold classification dataset originally curated from SCOP 1.75 by Hermosilla et al. In particular, this dataset contains three distinct test splits across which we average a method’s results.
Impact: The utility of this task is that it serves as a litmus test for the ability of a model to distinguish different structural folds. It stands to reason that models that perform poorly on distinguishing fold classes likely learn limited or low-quality structural representations.214
Splitting Criteria:
datamodule:
_target_: "proteinworkshop.datasets.fold_classification.FoldClassificationDataModule"
path: ${env.paths.data}/FoldClassification/ # Directory where the dataset is stored
split: "family" # Level of fold classification to perform (`family`, `superfamily`, or `fold`)
batch_size: 32 # Batch size for dataloader
pin_memory: True # Pin memory for dataloader
num_workers: 4 # Number of workers for dataloader
dataset_fraction: 1.0 # Fraction of dataset to use
shuffle_labels: False # Whether to shuffle labels for permutation testing
transforms: ${transforms} # Transforms to apply to dataset examples
overwrite: False # Whether to overwrite existing dataset files
in_memory: True # Whether to load the entire dataset into memory
num_classes: 1195 # Number of classes
Gene Ontology (go-bp
, go-cc
, go-mf
)#
datamodule:
_target_: proteinworkshop.datasets.go.GeneOntologyDataset
path: ${env.paths.data}/GeneOntology/ # Directory where the dataset is stored
pdb_dir: ${env.paths.data}/pdb/ # Directory where raw PDB/mmtf files are stored
format: "mmtf" # Format of the raw PDB/MMTF files
batch_size: 32 # Batch size for dataloader
dataset_fraction: 1.0 # Fraction of the dataset to use
shuffle_labels: False # Whether to shuffle labels for permutation testing
pin_memory: True # Pin memory for dataloader
num_workers: 8 # Number of workers for dataloader
split: "BP" # Split of the dataset to use (`BP`, `MF`, `CC`)
transforms: ${transforms} # Transforms to apply to dataset examples
overwrite: False # Whether to overwrite existing dataset files
in_memory: True
num_classes: 1943 # Number of classes
Node-level Datasets#
Atom3D Residue Identity Prediction (atom3d_res
)#
This task is defined in the Atom3D benchmark.
As per their documentation:
Impact: Understanding the structural role of individual amino acids is important for engineering new proteins. We can understand this role by predicting the substitutabilities of different amino acids at a given protein site based on the surrounding structural environment.
Dataset description: We generate a novel dataset consisting of atomic environments extracted from nonredundant structures in the PDB.
Task: We formulate this as a classification task where we predict the identity of the amino acid in the center of the environment based on all other atoms.
Splitting criteria: We split residue environments by domain-level CATH protein topology class.
datamodule:
_target_: proteinworkshop.datasets.atom3d_datamodule.ATOM3DDataModule
task: RES
data_dir: ${env.paths.data}/ATOM3D
res_split: cath-topology
max_units: 0
unit: edge
batch_size: 1
num_workers: 4
pin_memory: false
num_classes: 20
CCPDB Ligand Binding (ccpdb_ligand
)#
datamodule:
_target_: proteinworkshop.datasets.cc_pdb.CCPDBDataModule
path: ${env.paths.data}/ccpdb/ligands/ # Path to the dataset
pdb_dir: ${env.paths.data}/pdb/ # Path to the PDB files
name: "ligands" # Name of the ccPDB dataset
batch_size: 32 # Batch size
pin_memory: True # Pin memory for the dataloader
num_workers: 4 # Number of workers for the dataloader
format: "mmtf" # Format of the structure files
obsolete_strategy: "drop" # What to do with obsolete PDB entries
split_strategy: "random" # (or 'stratified') How to split the dataset
train_fraction: 0.8 # Fraction of the dataset to use for training
val_fraction: 0.1 # Fraction of the dataset to use for validation
test_fraction: 0.1 # Fraction of the dataset to use for testing
transforms: ${transforms}
overwrite: False # Whether to overwrite the dataset if it already exists
num_classes: 7 # Number of classes
CCPDB Metal Binding (ccpdb_metal
)#
datamodule:
_target_: proteinworkshop.datasets.cc_pdb.CCPDBDataModule
path: ${env.paths.data}/ccpdb/metal/ # Path to the dataset
pdb_dir: ${env.paths.data}/pdb/ # Path to the PDB files
name: "metal" # Name of the ccPDB dataset
batch_size: 32 # Batch size
pin_memory: True # Pin memory for the dataloader
num_workers: 4 # Number of workers for the dataloader
format: "mmtf" # Format of the structure files
obsolete_strategy: "drop" # What to do with obsolete PDB entries
split_strategy: "random" # (or 'stratified') How to split the dataset
train_fraction: 0.8 # Fraction of the dataset to use for training
val_fraction: 0.1 # Fraction of the dataset to use for validation
test_fraction: 0.1 # Fraction of the dataset to use for testing
transforms: ${transforms}
overwrite: False # Whether to overwrite the dataset if it already exists
num_classes: 7 # Number of classes
CCPDB Nucleic Acid Binding (ccpdb_nucleic
)#
datamodule:
_target_: proteinworkshop.datasets.cc_pdb.CCPDBDataModule
path: ${env.paths.data}/ccpdb/nucleic/ # Path to the dataset
pdb_dir: ${env.paths.data}/pdb/ # Path to the PDB files
name: "nucleic" # Name of the ccPDB dataset
batch_size: 32 # Batch size
pin_memory: True # Pin memory for the dataloader
num_workers: 4 # Number of workers for the dataloader
format: "mmtf" # Format of the structure files
obsolete_strategy: "drop" # What to do with obsolete PDB entries
split_strategy: "random" # (or 'stratified') How to split the dataset
train_fraction: 0.8 # Fraction of the dataset to use for training
val_fraction: 0.1 # Fraction of the dataset to use for validation
test_fraction: 0.1 # Fraction of the dataset to use for testing
transforms: ${transforms}
overwrite: False # Whether to overwrite the dataset if it already exists
num_classes: 2 # Number of classes
CCPDB Nucleotide Binding (ccpdb_nucleotides
)#
datamodule:
_target_: proteinworkshop.datasets.cc_pdb.CCPDBDataModule
path: ${env.paths.data}/ccpdb/nucleotides/ # Path to the dataset
pdb_dir: ${env.paths.data}/pdb/ # Path to the PDB files
name: "nucleotides" # Name of the ccPDB dataset
batch_size: 32 # Batch size
pin_memory: True # Pin memory for the dataloader
num_workers: 4 # Number of workers for the dataloader
format: "mmtf" # Format of the structure files
obsolete_strategy: "drop" # What to do with obsolete PDB entries
split_strategy: "random" # (or 'stratified') How to split the dataset
train_fraction: 0.8 # Fraction of the dataset to use for training
val_fraction: 0.1 # Fraction of the dataset to use for validation
test_fraction: 0.1 # Fraction of the dataset to use for testing
transforms: ${transforms}
overwrite: False # Whether to overwrite the dataset if it already exists
num_classes: 8 # Number of classes
Post Translational Modifications (ptm
)#
datamodule:
_target_: "proteinworkshop.datasets.ptm.PTMDataModule"
dataset_name: "ptm_13" # Options currently include (`ptm_13`, `optm`)
path: ${env.paths.data}/PostTranslationalModification/ # Directory where the dataset is stored
batch_size: 32 # Batch size for dataloader
in_memory: False # Load the dataset in memory
pin_memory: True # Pin memory for dataloader
num_workers: 16 # Number of workers for dataloader
transforms: ${transforms} # Transforms to apply to dataset examples
overwrite: False # Whether to overwrite existing dataset files
num_classes: 13 # Number of classes
PPI Site Prediction (masif_site
)#
We use the dataset of experimental structures curated from the PDB by Gainza et al. and retain the original splits, though we modify the labelling scheme to be based on inter-atomic proximity (3.5 Å), which can be user-defined, rather than solvent exclusion.
The dataset is composed by selecting PPI pairs from the PRISM list of nonredundant proteins, the ZDock benchmark, PDBBind and SabDab. Splits are performed using CD-HIT and structural splits are performed using TM-algin.
datamodule:
_target_: proteinworkshop.datasets.masif_site.MaSIFPPISP
path: ${env.paths.data}/masif_site/ # Directory where the dataset is stored
pdb_dir: ${env.paths.data}/pdb/ # Directory where raw PDB/mmtf files are stored
format: "mmtf" # Format of the raw PDB/MMTF files
batch_size: 32 # Batch size for dataloader
pin_memory: True # Pin memory for dataloader
num_workers: 8 # Number of workers for dataloader
dataset_fraction: 1.0 # Fraction of the dataset to use
shuffle_labels: False # Whether to shuffle labels for permutation testing
transforms: ${transforms} # Transforms to apply to dataset examples
overwrite: False # Whether to overwrite existing dataset files
num_classes: 2 # Number of classes