protein_workshop.datasets#
Base Classes#
Base classes for protein structure datamodules and datasets.
- class proteinworkshop.datasets.base.ProteinDataModule[source]#
Base class for Protein datamodules.
See also
L.LightningDataModule
- compose_transforms(transforms: Iterable[Callable]) Compose [source]#
Compose an iterable of Transforms into a single transform.
- Parameters:
transforms (Iterable[Callable]) – An iterable of transforms.
- Raises:
ValueError – If
transforms
is not a list or dict.- Returns:
A single transform.
- Return type:
T.Compose
- abstract download()[source]#
Implement downloading of raw data.
Typically this will be an index file of structure identifiers (for datasets derived from the PDB) but may contain structures too.
- property obsolete_pdbs: Dict[str, str]#
Returns a mapping of obsolete PDB codes to their updated replacement.
- abstract parse_dataset(split: str) DataFrame [source]#
Implement the parsing of the raw dataset to a dataframe.
Override this method to implement custom parsing of raw data.
- Parameters:
split (str) – The split to parse (e.g. train/val/test)
- Returns:
The parsed dataset as a dataframe.
- Return type:
pd.DataFrame
- abstract parse_labels() Any [source]#
Optional method to parse labels from the dataset.
Labels may or may not be present in the dataframe returned by
parse_dataset
.- Returns:
The parsed labels in any format. We’d recommend:
Dict[id, Tensor]
.- Return type:
Any
- setup(stage: str | None = None)[source]#
Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.
- Parameters:
stage – either
'fit'
,'validate'
,'test'
, or'predict'
Example:
class LitModel(...): def __init__(self): self.l1 = None def prepare_data(self): download_data() tokenize() # don't do this self.something = else def setup(self, stage): data = load_data(...) self.l1 = nn.Linear(28, data.num_classes)
- abstract test_dataloader() ProteinDataLoader [source]#
Implement the construction of the test dataloader.
- Returns:
The test dataloader.
- Return type:
ProteinDataLoader
- abstract test_dataset() Dataset [source]#
Implement the construction of the test dataset.
- Returns:
The test dataset.
- Return type:
Dataset
- abstract train_dataloader() ProteinDataLoader [source]#
Implement the construction of the training dataloader.
- Returns:
The training dataloader.
- Return type:
ProteinDataLoader
- abstract train_dataset() Dataset [source]#
Implement the construction of the training dataset.
- Returns:
The training dataset.
- Return type:
Dataset
- class proteinworkshop.datasets.base.ProteinDataset(pdb_codes: List[str], root: str | None = None, pdb_dir: str | None = None, processed_dir: str | None = None, pdb_paths: List[str] | None = None, chains: List[str] | None = None, graph_labels: List[Tensor] | None = None, node_labels: List[Tensor] | None = None, transform: List[Callable] | None = None, pre_transform: Callable | None = None, pre_filter: Callable | None = None, log: bool = True, overwrite: bool = False, format: Literal['mmtf', 'pdb', 'ent'] = 'pdb', in_memory: bool = False, store_het: bool = False, out_names: List[str] | None = None)[source]#
Dataset for loading protein structures.
- Parameters:
pdb_codes (List[str]) – List of PDB codes to load. This can also be a list of identifiers to specific to your filenames if you have pre-downloaded structures.
root (Optional[str], optional) – Path to root directory, defaults to
None
.pdb_dir (Optional[str], optional) – Path to directory containing raw PDB files, defaults to
None
.processed_dir (Optional[str], optional) – Directory to store processed data, defaults to
None
.pdb_paths (Optional[List[str]], optional) – If specified, the dataset will load structures from these paths instead of downloading them from the RCSB PDB or using the identifies in
pdb_codes
. This is useful if you have already downloaded structures and want to use them. defaults toNone
chains (Optional[List[str]], optional) – List of chains to load for each PDB code, defaults to
None
.graph_labels (Optional[List[torch.Tensor]], optional) – List of tensors to set as graph labels for each examples. If not specified, no graph labels will be set. defaults to
None
.node_labels (Optional[List[torch.Tensor]], optional) – List of tensors to set as node labels for each examples. If not specified, no node labels will be set. defaults to
None
.transform (Optional[List[Callable]], optional) – List of transforms to apply to each example, defaults to
None
.pre_transform (Optional[Callable], optional) – Transform to apply to each example before processing, defaults to
None
.pre_filter (Optional[Callable], optional) – Filter to apply to each example before processing, defaults to
None
.log (bool, optional) – Whether to log. If
True
, logs will be printed to stdout, defaults toTrue
.overwrite (bool, optional) – Whether to overwrite existing files, defaults to
False
.format (Literal[mmtf, pdb, ent], optional) – Format to save structures in, defaults to “pdb”.
in_memory (bool, optional) – Whether to load data into memory, defaults to False.
store_het (bool, optional) – Whether to store heteroatoms in the graph, defaults to
False
.
- download()[source]#
Download structure files not present in the raw directory (
raw_dir
).Structures are downloaded from the RCSB PDB using the Graphein multiprocessed downloader.
Structure files are downloaded in
self.format
format (mmtf
orpdb
). Downloading files inmmtf
format is strongly recommended as it will be both faster and smaller thanpdb
format.Downloaded files are stored in
self.raw_dir
.
- get(idx: int) Data [source]#
Return PyTorch Geometric Data object for a given index.
- Parameters:
idx (int) – Index to retrieve.
- Returns:
PyTorch Geometric Data object.
- process()[source]#
Process raw data into PyTorch Geometric Data objects with Graphein.
Processed data are stored in
self.processed_dir
as.pt
files.
- property processed_file_names: str | List[str] | Tuple#
Returns the processed file names.
This will either be a list in format [
{pdb_code}.pt
] or a list of [{pdb_code}_{chain(s)}.pt].
- proteinworkshop.datasets.base.pair_data(a: Data, b: Data) Data [source]#
Pairs two graphs together in a single
Data
instance.The first graph is accessed via
data.a
(e.g.data.a.coords
) and the second viadata.b
.- Parameters:
a (torch_geometric.data.Data) – The first graph.
b (torch_geometric.data.Data) – The second graph.
- Returns:
The paired graph.
Pre-Training Datasets#
- class proteinworkshop.datasets.cath.CATHDataModule(path: str, batch_size: int, format: Literal['mmtf', 'pdb'] = 'mmtf', pdb_dir: str | None = None, pin_memory: bool = True, in_memory: bool = False, num_workers: int = 16, dataset_fraction: float = 1.0, transforms: Iterable[Callable] | None = None, overwrite: bool = False)[source]#
Data module for CATH dataset.
- Parameters:
path (str) – Path to store data.
batch_size (int) – Batch size for dataloaders.
format (Literal["mmtf", "pdb"]) – Format to load PDB files in.
pdb_dir (str) – Path to directory containing PDB files.
pin_memory (bool) – Whether to pin memory for dataloaders.
in_memory (bool) – Whether to load the entire dataset into memory.
num_workers (int) – Number of workers for dataloaders.
dataset_fraction (float) – Fraction of dataset to use.
transforms (Optional[List[Callable]]) – List of transforms to apply to dataset.
overwrite (bool) – Whether to overwrite existing data. Defaults to
False
.
- parse_dataset() Dict[str, List[str]] [source]#
Parses dataset index file
Returns a dictionary with keys “train”, “validation”, and “test” and values as lists of PDB IDs.
- test_dataloader() ProteinDataLoader [source]#
Returns the test dataloader.
- Returns:
Test dataloader
- Return type:
ProteinDataLoader
- test_dataset() ProteinDataset [source]#
Returns the test dataset.
- Returns:
Test dataset
- Return type:
- train_dataloader() ProteinDataLoader [source]#
Returns the training dataloader.
- Returns:
Training dataloader
- Return type:
ProteinDataLoader
- train_dataset() ProteinDataset [source]#
Returns the training dataset.
- Returns:
Training dataset
- Return type:
- val_dataloader() ProteinDataLoader [source]#
Implement the construction of the validation dataloader.
- Returns:
The validation dataloader.
- Return type:
ProteinDataLoader
- val_dataset() ProteinDataset [source]#
Returns the validation dataset.
- Returns:
Validation dataset
- Return type:
- class proteinworkshop.datasets.astral.AstralDataModule(path: str, batch_size: int, pin_memory: bool, num_workers: int, release: str = '1.75', identity: Literal['40', '95'] = '95', dataset_fraction: float = 1.0, transforms: Iterable[Callable] | None = None, in_memory: bool = False, train_val_test: List[float] = [0.8, 0.1, 0.1], overwrite: bool = False)[source]#
-
- parse_dataset(split: Literal['train', 'val', 'test']) List[str] [source]#
Parses ASTRAL dataset. Returns a list of IDs for each split.
- Parameters:
split (Literal["train", "val", "test"]) – Split to parse.
- Returns:
List of IDs for split.
- Return type:
List[str]
- setup(stage: str | None = None)[source]#
Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.
- Parameters:
stage – either
'fit'
,'validate'
,'test'
, or'predict'
Example:
class LitModel(...): def __init__(self): self.l1 = None def prepare_data(self): download_data() tokenize() # don't do this self.something = else def setup(self, stage): data = load_data(...) self.l1 = nn.Linear(28, data.num_classes)
- test_dataloader() ProteinDataLoader [source]#
Returns the test dataloader.
- Returns:
Test dataloader.
- Return type:
ProteinDataLoader
- test_dataset() ProteinDataset [source]#
Implement the construction of the test dataset.
- Returns:
The test dataset.
- Return type:
Dataset
- train_dataloader() ProteinDataLoader [source]#
Returns the training dataloader.
- Returns:
Training dataloader.
- Return type:
ProteinDataLoader
- train_dataset() ProteinDataset [source]#
Returns the training dataset.
- Returns:
Training dataset.
- Return type:
- val_dataloader() ProteinDataLoader [source]#
Returns the validation dataloader.
- Returns:
Validation dataloader.
- Return type:
ProteinDataLoader
- val_dataset() ProteinDataset [source]#
Returns the validation dataset.
- Returns:
Validation dataset.
- Return type:
Node-level Datasets#
Graph-level Datasets#
- class proteinworkshop.datasets.go.GOLabeller(label_df: DataFrame)[source]#
This labeller applies the graph labels to each example as a transform.
This is required as chains can be used across tasks (e.g. CC, BP or MF) with different labels.
- class proteinworkshop.datasets.go.GeneOntologyDataset(path: str, batch_size: int, split: str = 'BP', obsolete='drop', pdb_dir: str | None = None, format: Literal['mmtf', 'pdb'] = 'mmtf', in_memory: bool = False, dataset_fraction: float = 1.0, shuffle_labels: bool = False, pin_memory: bool = True, num_workers: int = 16, transforms: Iterable[Callable] | None = None, overwrite: bool = False)[source]#
- Statistics (test_cutoff=0.95):
#Train: 27,496
#Valid: 3,053
#Test: 2,991
- download()[source]#
Implement downloading of raw data.
Typically this will be an index file of structure identifiers (for datasets derived from the PDB) but may contain structures too.
- parse_dataset(split: Literal['training', 'validation', 'testing']) DataFrame [source]#
Parses the raw dataset files to Pandas DataFrames. Maps classes to numerical values.
- test_dataloader() ProteinDataLoader [source]#
Implement the construction of the test dataloader.
- Returns:
The test dataloader.
- Return type:
ProteinDataLoader
- test_dataset() ProteinDataset [source]#
Implement the construction of the test dataset.
- Returns:
The test dataset.
- Return type:
Dataset
- train_dataloader() ProteinDataLoader [source]#
Implement the construction of the training dataloader.
- Returns:
The training dataloader.
- Return type:
ProteinDataLoader
- train_dataset() ProteinDataset [source]#
Implement the construction of the training dataset.
- Returns:
The training dataset.
- Return type:
Dataset
- val_dataloader() ProteinDataLoader [source]#
Implement the construction of the validation dataloader.
- Returns:
The validation dataloader.
- Return type:
ProteinDataLoader
- val_dataset() ProteinDataset [source]#
Implement the construction of the validation dataset.
- Returns:
The validation dataset.
- Return type:
Dataset
- class proteinworkshop.datasets.fold_classification.FoldClassificationDataModule(path: str, split: str, batch_size: int, pin_memory: bool, num_workers: int, dataset_fraction: float = 1.0, shuffle_labels: bool = False, transforms: Iterable[Callable] | None = None, in_memory: bool = False, overwrite: bool = False)[source]#
- download()[source]#
Implement downloading of raw data.
Typically this will be an index file of structure identifiers (for datasets derived from the PDB) but may contain structures too.
- parse_dataset(split: str) DataFrame [source]#
Parses the raw dataset files to Pandas DataFrames. Maps classes to numerical values.
- parse_labels()[source]#
Optional method to parse labels from the dataset.
Labels may or may not be present in the dataframe returned by
parse_dataset
.- Returns:
The parsed labels in any format. We’d recommend:
Dict[id, Tensor]
.- Return type:
Any
- setup(stage: str | None = None)[source]#
Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.
- Parameters:
stage – either
'fit'
,'validate'
,'test'
, or'predict'
Example:
class LitModel(...): def __init__(self): self.l1 = None def prepare_data(self): download_data() tokenize() # don't do this self.something = else def setup(self, stage): data = load_data(...) self.l1 = nn.Linear(28, data.num_classes)
- test_dataloader() ProteinDataLoader [source]#
Implement the construction of the test dataloader.
- Returns:
The test dataloader.
- Return type:
ProteinDataLoader
- test_dataset() ProteinDataset [source]#
Implement the construction of the test dataset.
- Returns:
The test dataset.
- Return type:
Dataset
- train_dataloader() ProteinDataLoader [source]#
Implement the construction of the training dataloader.
- Returns:
The training dataloader.
- Return type:
ProteinDataLoader
- train_dataset() ProteinDataset [source]#
Implement the construction of the training dataset.
- Returns:
The training dataset.
- Return type:
Dataset
- val_dataloader() ProteinDataLoader [source]#
Implement the construction of the validation dataloader.
- Returns:
The validation dataloader.
- Return type:
ProteinDataLoader
- val_dataset() ProteinDataset [source]#
Implement the construction of the validation dataset.
- Returns:
The validation dataset.
- Return type:
Dataset
FLIP#
- class proteinworkshop.datasets.flip_datamodule.FLIPDatamodule(root: str, dataset_name: str, split: str)[source]#
- download(overwrite: bool = False)[source]#
Implement downloading of raw data.
Typically this will be an index file of structure identifiers (for datasets derived from the PDB) but may contain structures too.
- parse_dataset(split: str) DataFrame [source]#
Implement the parsing of the raw dataset to a dataframe.
Override this method to implement custom parsing of raw data.
- Parameters:
split (str) – The split to parse (e.g. train/val/test)
- Returns:
The parsed dataset as a dataframe.
- Return type:
pd.DataFrame
- parse_labels(split: str)[source]#
Optional method to parse labels from the dataset.
Labels may or may not be present in the dataframe returned by
parse_dataset
.- Returns:
The parsed labels in any format. We’d recommend:
Dict[id, Tensor]
.- Return type:
Any
- test_dataloader() DataLoader [source]#
Implement the construction of the test dataloader.
- Returns:
The test dataloader.
- Return type:
ProteinDataLoader
- test_dataset()[source]#
Implement the construction of the test dataset.
- Returns:
The test dataset.
- Return type:
Dataset
- train_dataloader() DataLoader [source]#
Implement the construction of the training dataloader.
- Returns:
The training dataloader.
- Return type:
ProteinDataLoader
- train_dataset()[source]#
Implement the construction of the training dataset.
- Returns:
The training dataset.
- Return type:
Dataset
- val_dataloader() DataLoader [source]#
Implement the construction of the validation dataloader.
- Returns:
The validation dataloader.
- Return type:
ProteinDataLoader
Utils#
- proteinworkshop.datasets.utils.create_example_batch(n: int = 4) ProteinBatch [source]#
Returns a batch of random proteins.
- Parameters:
n (int, optional) – Number of proteins to include in batch.
- Returns:
Batch of random proteins.
- Return type:
ProteinBatch
- proteinworkshop.datasets.utils.download_pdb_mmtf(mmtf_dir: Path, ids: List[str] | None = None, create_tar: bool = False)[source]#
Download PDB files in MMTF format from RCSB PDB and create archive. MMTF files are downloaded into a new directory in this path and the .tar archive is created here. Obtain all PDB IDs using a query that includes all entries. Each PDB entry has a title.
- Parameters:
mmtf_dir (pathlib.Path) – Path to directory to store MMTF files.
ids (Optional[List[str]]) – List of PDB IDs to download.
create_tar (bool) – Whether to create a .tar archive from the downloaded files.
- proteinworkshop.datasets.utils.flatten_dir(dir: PathLike)[source]#
Flattens the nested directory structure of a directory into a single level.
- Parameters:
dir (os.PathLike) – Path to directory