protein_workshop.tasks#
Denoising Transforms#
Implements sequence denoising task.
- class proteinworkshop.tasks.sequence_denoising.SequenceNoiseTransform(corruption_rate: float, corruption_strategy: Literal['mutate', 'mask'])[source]#
Bases:
BaseTransform
Implements a transform for corrupting the Cartesian coordinates of a protein structure.
- class proteinworkshop.tasks.structural_denoising.StructuralNoiseTransform(corruption_rate: float, corruption_strategy: Literal['uniform', 'gaussian'])[source]#
Bases:
BaseTransformAdds noise to the coordinates of a protein structure.
Sets the following attributes on the protein data object:
coords_uncorrupted: The original coordinates of the protein.noise: The noise added to the coordinates.coords: The original coordinates + noise.
- Parameters:
corruption_rate (float) – Magnitude of corruption to apply to the coordinates.
corruption_strategy (Literal["uniform", "gaussian"]) – Noise strategy to use for corruption.
Implementation of the Torsional Noise Transform.
- class proteinworkshop.tasks.torsional_denoising.TorsionalNoiseTransform(corruption_strategy: str = 'gaussian', corruption_rate: float = 0.1)[source]#
Bases:
BaseTransformAdds noise to the torsional angles of a protein.
Cartesian coordinates are re-computed from the noisy dihedral angles using the pNeRF algorithm.
The true dihedral angles are stored as an attribute on the protein object:
batch.true_dihedrals.Warning
This will subset the data to only include the backbone atoms (N, Ca, C). The backbone oxygen can be placed with:
graphein.protein.tensor.reconstruction.place_fourth_coord.This will break, for example, sidechain torsion angle computation for the first few chi angles that are partially defined by backbone atoms.
Masked Attribute Prediction Transforms#
- class proteinworkshop.tasks.edge_distance_prediction.EdgeDistancePredictionTransform(num_samples: int)[source]#
Bases:
BaseTransformSelf-supervision task to predict the pairwise distance between two nodes.
We first sample
num_samplesedges randomly from the input batch. We then construct a mask to remove the sampled edges from the batch. We store the masked node indices and their pairwise distance asbatch.node_maskandbatch.edge_distance_labels, respectively. Finally, it masks the edges (and their attributes) using the constructed mask and returns the modified batch.
- class proteinworkshop.tasks.backbone_dihedral_angle_prediction.BackboneDihedralPredictionTransform[source]#
Bases:
BaseTransformTransform to store backbone dihedral angles as attributes on proteins.
This is used for setting the labels in a SSL context, not for featurisation.
Sets dihedrals as an attribute of the Batch object (i.e.
batch.dihedrals). This is retrieved inproteinworkshop.models.base.BaseModel.get_labels()for supervision.- property required_attributes: Set[str]#
Required batch attributes for this transform.
coordsare required for computing dihedrals. This is a tensor ofshape \((N, 37, 3)\) where \(N\) is the number of residues, 37 is the number of unique atoms in PDBs, and 3 is the x, y, z position of each atom.
- Returns:
Set of required attributes
- Return type:
Set[str]
Structural Annotation Prediction#
- class proteinworkshop.tasks.ppi_site_prediction.BindingSiteTransform(radius: float = 3.5, ca_only: bool = True)[source]#
Bases:
BaseTransform
- class proteinworkshop.tasks.binding_site_prediction.BindingSiteTransform(hetatms: List[str], threshold: float, ca_only: bool = False, multilabel: bool = True)[source]#
Bases:
BaseTransformExtracts binding site labels for a given set of HETATMs.
This transform builds a KDTree from the protein coordinates. Atoms belonging to HETATMs (specified by the
hetatmsarg at initialization) are then queried against the KDTree to obtain indices of residues withinthresholddistance of the HETATM.These indices are used to assign node labels to the protein graph. If
multilabelis set toTrue, then each binding HETATM will be assigned a separate label (i.e. whether residue \(i\) is proximal to HETATM \(j\) is given by: \(\hat{y}_{ij} \in \mathbb{R}^{|V| imes |H|}\)). Otherwise, the labels will be assigned as a single label (i.e. is residue \(i\) proximal to any HETATM \(\hat{y} \in \mathbb{R}^{|V|}\)). proximal to any HETATM).If
ca_onlyis set toTrue, then only the alpha carbon atoms will be used to determine proximity. Ifca_onlyis set toFalse, then all atoms will be used to determine proximity. I.e. if any atom in a residue is withinthresholddistance of a HETATM, then the residue will be labeled accordingly.Warning
This transform requires that the
data.coordsanddata.hetatmsfields to be set on the input Data/Batch. See:required_attributes()
Misc#
Implementation of a transform to remove residues with missing CA atoms.
- class proteinworkshop.tasks.remove_missing_ca.RemoveMissingCa(fill_value: float = 1e-05, ca_idx: int = 1)[source]#
Bases:
BaseTransformRemoves residues with missing CA atoms from a protein structure.