protein_workshop.datasets#

Base Classes#

Base classes for protein structure datamodules and datasets.

class proteinworkshop.datasets.base.ProteinDataModule[source]#

Base class for Protein datamodules.

See also

L.LightningDataModule

compose_transforms(transforms: Iterable[Callable]) → Compose[source]#

Compose an iterable of Transforms into a single transform.

Parameters:: transforms (Iterable[Callable]) – An iterable of transforms.
Raises:: ValueError – If transforms is not a list or dict.
Returns:: A single transform.
Return type:: T.Compose

abstract download()[source]#

Implement downloading of raw data.

Typically this will be an index file of structure identifiers (for datasets derived from the PDB) but may contain structures too.

abstract exclude_pdbs()[source]#: Return a list of PDBs/IDs to exclude from the dataset.

get_class_weights() → Tensor[source]#: Return tensor of class weights.

property obsolete_pdbs: Dict[str, str]#

Returns a mapping of obsolete PDB codes to their updated replacement.

Returns:: Mapping of obsolete PDB codes to their updated replacements.
Return type:: Dict[str, str]

abstract parse_dataset(split: str) → DataFrame[source]#

Implement the parsing of the raw dataset to a dataframe.

Override this method to implement custom parsing of raw data.

Parameters:: split (str) – The split to parse (e.g. train/val/test)
Returns:: The parsed dataset as a dataframe.
Return type:: pd.DataFrame

abstract parse_labels() → Any[source]#

Optional method to parse labels from the dataset.

Labels may or may not be present in the dataframe returned by parse_dataset.

Returns:: The parsed labels in any format. We’d recommend: Dict[id, Tensor].
Return type:: Any

setup(stage: str | None = None)[source]#

Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.

Parameters:: stage – either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)

abstract test_dataloader() → ProteinDataLoader[source]#

Implement the construction of the test dataloader.

Returns:: The test dataloader.
Return type:: ProteinDataLoader

abstract test_dataset() → Dataset[source]#

Implement the construction of the test dataset.

Returns:: The test dataset.
Return type:: Dataset

abstract train_dataloader() → ProteinDataLoader[source]#

Implement the construction of the training dataloader.

Returns:: The training dataloader.
Return type:: ProteinDataLoader

abstract train_dataset() → Dataset[source]#

Implement the construction of the training dataset.

Returns:: The training dataset.
Return type:: Dataset

abstract val_dataloader() → ProteinDataLoader[source]#

Implement the construction of the validation dataloader.

Returns:: The validation dataloader.
Return type:: ProteinDataLoader

abstract val_dataset() → Dataset[source]#

Implement the construction of the validation dataset.

Returns:: The validation dataset.
Return type:: Dataset

class proteinworkshop.datasets.base.ProteinDataset(pdb_codes: List[str], root: str | None = None, pdb_dir: str | None = None, processed_dir: str | None = None, pdb_paths: List[str] | None = None, chains: List[str] | None = None, graph_labels: List[Tensor] | None = None, node_labels: List[Tensor] | None = None, transform: List[Callable] | None = None, pre_transform: Callable | None = None, pre_filter: Callable | None = None, log: bool = True, overwrite: bool = False, format: Literal['mmtf', 'pdb', 'ent'] = 'pdb', in_memory: bool = False, store_het: bool = False, out_names: List[str] | None = None)[source]#

Dataset for loading protein structures.

Parameters:

pdb_codes (List[str]) – List of PDB codes to load. This can also be a list of identifiers to specific to your filenames if you have pre-downloaded structures.
root (Optional[str], optional) – Path to root directory, defaults to None.
pdb_dir (Optional[str], optional) – Path to directory containing raw PDB files, defaults to None.
processed_dir (Optional[str], optional) – Directory to store processed data, defaults to None.
pdb_paths (Optional[List[str]], optional) – If specified, the dataset will load structures from these paths instead of downloading them from the RCSB PDB or using the identifies in pdb_codes. This is useful if you have already downloaded structures and want to use them. defaults to None
chains (Optional[List[str]], optional) – List of chains to load for each PDB code, defaults to None.
graph_labels (Optional[List[torch.Tensor]], optional) – List of tensors to set as graph labels for each examples. If not specified, no graph labels will be set. defaults to None.
node_labels (Optional[List[torch.Tensor]], optional) – List of tensors to set as node labels for each examples. If not specified, no node labels will be set. defaults to None.
transform (Optional[List[Callable]], optional) – List of transforms to apply to each example, defaults to None.
pre_transform (Optional[Callable], optional) – Transform to apply to each example before processing, defaults to None.
pre_filter (Optional[Callable], optional) – Filter to apply to each example before processing, defaults to None.
log (bool, optional) – Whether to log. If True, logs will be printed to stdout, defaults to True.
overwrite (bool, optional) – Whether to overwrite existing files, defaults to False.
format (Literal[mmtf, pdb, ent], optional) – Format to save structures in, defaults to “pdb”.
in_memory (bool, optional) – Whether to load data into memory, defaults to False.
store_het (bool, optional) – Whether to store heteroatoms in the graph, defaults to False.

download()[source]#

Download structure files not present in the raw directory (raw_dir).

Structures are downloaded from the RCSB PDB using the Graphein multiprocessed downloader.

Structure files are downloaded in self.format format (mmtf or pdb). Downloading files in mmtf format is strongly recommended as it will be both faster and smaller than pdb format.

Downloaded files are stored in self.raw_dir.

get(idx: int) → Data[source]#

Return PyTorch Geometric Data object for a given index.

Parameters:: idx (int) – Index to retrieve.
Returns:: PyTorch Geometric Data object.

len() → int[source]#: Return length of the dataset.

process()[source]#

Process raw data into PyTorch Geometric Data objects with Graphein.

Processed data are stored in self.processed_dir as .pt files.

property processed_file_names: str | List[str] | Tuple#

Returns the processed file names.

This will either be a list in format [{pdb_code}.pt] or a list of [{pdb_code}_{chain(s)}.pt].

Returns:: List of processed file names.
Return type:: Union[str, List[str], Tuple]

property raw_dir: str#

Returns the path to the raw data directory.

Returns:: Raw data directory.
Return type:: str

property raw_file_names: List[str]#

Returns the raw file names.

Returns:: List of raw file names.
Return type:: List[str]

proteinworkshop.datasets.base.pair_data(a: Data, b: Data) → Data[source]#

Pairs two graphs together in a single Data instance.

The first graph is accessed via data.a (e.g. data.a.coords) and the second via data.b.

Parameters:

a (torch_geometric.data.Data) – The first graph.
b (torch_geometric.data.Data) – The second graph.

Returns:

The paired graph.

Pre-Training Datasets#

class proteinworkshop.datasets.cath.CATHDataModule(path: str, batch_size: int, format: Literal['mmtf', 'pdb'] = 'mmtf', pdb_dir: str | None = None, pin_memory: bool = True, in_memory: bool = False, num_workers: int = 16, dataset_fraction: float = 1.0, transforms: Iterable[Callable] | None = None, overwrite: bool = False)[source]#

Data module for CATH dataset.

Parameters:

path (str) – Path to store data.
batch_size (int) – Batch size for dataloaders.
format (Literal["mmtf", "pdb"]) – Format to load PDB files in.
pdb_dir (str) – Path to directory containing PDB files.
pin_memory (bool) – Whether to pin memory for dataloaders.
in_memory (bool) – Whether to load the entire dataset into memory.
num_workers (int) – Number of workers for dataloaders.
dataset_fraction (float) – Fraction of dataset to use.
transforms (Optional[List[Callable]]) – List of transforms to apply to dataset.
overwrite (bool) – Whether to overwrite existing data. Defaults to False.

download()[source]#: Downloads raw data from Ingraham et al.

exclude_pdbs()[source]#: Not implemented for CATH dataset

parse_dataset() → Dict[str, List[str]][source]#

Parses dataset index file

Returns a dictionary with keys “train”, “validation”, and “test” and values as lists of PDB IDs.

Returns:: Dictionary of PDB IDs
Return type:: Dict[str, List[str]]

parse_labels()[source]#: Not implemented for CATH dataset

test_dataloader() → ProteinDataLoader[source]#

Returns the test dataloader.

Returns:: Test dataloader
Return type:: ProteinDataLoader

test_dataset() → ProteinDataset[source]#

Returns the test dataset.

Returns:: Test dataset
Return type:: ProteinDataset

train_dataloader() → ProteinDataLoader[source]#

Returns the training dataloader.

Returns:: Training dataloader
Return type:: ProteinDataLoader

train_dataset() → ProteinDataset[source]#

Returns the training dataset.

Returns:: Training dataset
Return type:: ProteinDataset

val_dataloader() → ProteinDataLoader[source]#

Implement the construction of the validation dataloader.

Returns:: The validation dataloader.
Return type:: ProteinDataLoader

val_dataset() → ProteinDataset[source]#

Returns the validation dataset.

Returns:: Validation dataset
Return type:: ProteinDataset

class proteinworkshop.datasets.astral.AstralDataModule(path: str, batch_size: int, pin_memory: bool, num_workers: int, release: str = '1.75', identity: Literal['40', '95'] = '95', dataset_fraction: float = 1.0, transforms: Iterable[Callable] | None = None, in_memory: bool = False, train_val_test: List[float] = [0.8, 0.1, 0.1], overwrite: bool = False)[source]#

download()[source]#: Downloads ASTRAL structures from SCOPe.

exclude_pdbs()[source]#: Not implemented for ASTRAL dataset.

parse_class_map() → Dict[str, str][source]#

Parses class map from ASTRAL dataset.

Returns:: Class map.
Return type:: Dict[str, str]

parse_dataset(split: Literal['train', 'val', 'test']) → List[str][source]#

Parses ASTRAL dataset. Returns a list of IDs for each split.

Parameters:: split (Literal["train", "val", "test"]) – Split to parse.
Returns:: List of IDs for split.
Return type:: List[str]

parse_labels()[source]#: Not implemented for ASTRAL dataset.

setup(stage: str | None = None)[source]#

Parameters:: stage – either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)

test_dataloader() → ProteinDataLoader[source]#

Returns the test dataloader.

Returns:: Test dataloader.
Return type:: ProteinDataLoader

test_dataset() → ProteinDataset[source]#

Implement the construction of the test dataset.

Returns:: The test dataset.
Return type:: Dataset

train_dataloader() → ProteinDataLoader[source]#

Returns the training dataloader.

Returns:: Training dataloader.
Return type:: ProteinDataLoader

train_dataset() → ProteinDataset[source]#

Returns the training dataset.

Node-level Datasets#

Graph-level Datasets#

class proteinworkshop.datasets.go.GOLabeller(label_df: DataFrame)[source]#

This labeller applies the graph labels to each example as a transform.

This is required as chains can be used across tasks (e.g. CC, BP or MF) with different labels.

class proteinworkshop.datasets.go.GeneOntologyDataset(path: str, batch_size: int, split: str = 'BP', obsolete='drop', pdb_dir: str | None = None, format: Literal['mmtf', 'pdb'] = 'mmtf', in_memory: bool = False, dataset_fraction: float = 1.0, shuffle_labels: bool = False, pin_memory: bool = True, num_workers: int = 16, transforms: Iterable[Callable] | None = None, overwrite: bool = False)[source]#

Statistics (test_cutoff=0.95):

#Train: 27,496
#Valid: 3,053
#Test: 2,991

download()[source]#

Implement downloading of raw data.

Typically this will be an index file of structure identifiers (for datasets derived from the PDB) but may contain structures too.

exclude_pdbs()[source]#: Return a list of PDBs/IDs to exclude from the dataset.

parse_dataset(split: Literal['training', 'validation', 'testing']) → DataFrame[source]#: Parses the raw dataset files to Pandas DataFrames. Maps classes to numerical values.

parse_labels() → Dict[str, Tensor][source]#: Parse the GO labels from the nrPDB-GO_annot.tsv file.

test_dataloader() → ProteinDataLoader[source]#

Implement the construction of the test dataloader.

Returns:: The test dataloader.
Return type:: ProteinDataLoader

test_dataset() → ProteinDataset[source]#

Implement the construction of the test dataset.

Returns:: The test dataset.
Return type:: Dataset

train_dataloader() → ProteinDataLoader[source]#

Implement the construction of the training dataloader.

Returns:: The training dataloader.
Return type:: ProteinDataLoader

train_dataset() → ProteinDataset[source]#

Implement the construction of the training dataset.

Returns:: The training dataset.
Return type:: Dataset

val_dataloader() → ProteinDataLoader[source]#

Implement the construction of the validation dataloader.

Returns:: The validation dataloader.
Return type:: ProteinDataLoader

val_dataset() → ProteinDataset[source]#

Implement the construction of the validation dataset.

Returns:: The validation dataset.
Return type:: Dataset

class proteinworkshop.datasets.fold_classification.FoldClassificationDataModule(path: str, split: str, batch_size: int, pin_memory: bool, num_workers: int, dataset_fraction: float = 1.0, shuffle_labels: bool = False, transforms: Iterable[Callable] | None = None, in_memory: bool = False, overwrite: bool = False)[source]#

download()[source]#

Implement downloading of raw data.

Typically this will be an index file of structure identifiers (for datasets derived from the PDB) but may contain structures too.

download_data_files()[source]#: Downloads dataset index files.

download_structures()[source]#: Downloads SCOPe structures.

exclude_pdbs()[source]#: Return a list of PDBs/IDs to exclude from the dataset.

parse_dataset(split: str) → DataFrame[source]#: Parses the raw dataset files to Pandas DataFrames. Maps classes to numerical values.

parse_labels()[source]#

Optional method to parse labels from the dataset.

Labels may or may not be present in the dataframe returned by parse_dataset.

Returns:: The parsed labels in any format. We’d recommend: Dict[id, Tensor].
Return type:: Any

setup(stage: str | None = None)[source]#

Parameters:: stage – either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)

test_dataloader() → ProteinDataLoader[source]#

Implement the construction of the test dataloader.

Returns:: The test dataloader.
Return type:: ProteinDataLoader

test_dataset() → ProteinDataset[source]#

Implement the construction of the test dataset.

Returns:: The test dataset.
Return type:: Dataset

train_dataloader() → ProteinDataLoader[source]#

Implement the construction of the training dataloader.

Returns:: The training dataloader.
Return type:: ProteinDataLoader

train_dataset() → ProteinDataset[source]#

Implement the construction of the training dataset.

Returns:: The training dataset.
Return type:: Dataset

val_dataloader() → ProteinDataLoader[source]#

Implement the construction of the validation dataloader.

Returns:: The validation dataloader.
Return type:: ProteinDataLoader

val_dataset() → ProteinDataset[source]#

Implement the construction of the validation dataset.

Returns:: The validation dataset.
Return type:: Dataset

proteinworkshop.datasets.fold_classification.flatten_dir(dir: PathLike)[source]#: Flattens the nested directory structure of a directory into a single level.

FLIP#

class proteinworkshop.datasets.flip_datamodule.FLIPDatamodule(root: str, dataset_name: str, split: str)[source]#

download(overwrite: bool = False)[source]#

Implement downloading of raw data.

Typically this will be an index file of structure identifiers (for datasets derived from the PDB) but may contain structures too.

exclude_pdbs()[source]#: Return a list of PDBs/IDs to exclude from the dataset.

parse_dataset(split: str) → DataFrame[source]#

Implement the parsing of the raw dataset to a dataframe.

Override this method to implement custom parsing of raw data.

Parameters:: split (str) – The split to parse (e.g. train/val/test)
Returns:: The parsed dataset as a dataframe.
Return type:: pd.DataFrame

parse_labels(split: str)[source]#

Optional method to parse labels from the dataset.

Labels may or may not be present in the dataframe returned by parse_dataset.

Returns:: The parsed labels in any format. We’d recommend: Dict[id, Tensor].
Return type:: Any

test_dataloader() → DataLoader[source]#

Implement the construction of the test dataloader.

Returns:: The test dataloader.
Return type:: ProteinDataLoader

test_dataset()[source]#

Implement the construction of the test dataset.

Returns:: The test dataset.
Return type:: Dataset

train_dataloader() → DataLoader[source]#

Implement the construction of the training dataloader.

Returns:: The training dataloader.
Return type:: ProteinDataLoader

train_dataset()[source]#

Implement the construction of the training dataset.

Returns:: The training dataset.
Return type:: Dataset

val_dataloader() → DataLoader[source]#

Implement the construction of the validation dataloader.

Returns:: The validation dataloader.
Return type:: ProteinDataLoader

val_dataset()[source]#

Implement the construction of the validation dataset.

Returns:: The validation dataset.
Return type:: Dataset

Utils#

proteinworkshop.datasets.utils.create_example_batch(n: int = 4) → ProteinBatch[source]#

Returns a batch of random proteins.

Parameters:: n (int, optional) – Number of proteins to include in batch.
Returns:: Batch of random proteins.
Return type:: ProteinBatch

proteinworkshop.datasets.utils.download_pdb_mmtf(mmtf_dir: Path, ids: List[str] | None = None, create_tar: bool = False)[source]#

Download PDB files in MMTF format from RCSB PDB and create archive. MMTF files are downloaded into a new directory in this path and the .tar archive is created here. Obtain all PDB IDs using a query that includes all entries. Each PDB entry has a title.

Parameters:

mmtf_dir (pathlib.Path) – Path to directory to store MMTF files.
ids (Optional[List[str]]) – List of PDB IDs to download.
create_tar (bool) – Whether to create a .tar archive from the downloaded files.

proteinworkshop.datasets.utils.flatten_dir(dir: PathLike)[source]#

Flattens the nested directory structure of a directory into a single level.

Parameters:: dir (os.PathLike) – Path to directory