protein_workshop.tasks#
Denoising Transforms#
Implements sequence denoising task.
- class proteinworkshop.tasks.sequence_denoising.SequenceNoiseTransform(corruption_rate: float, corruption_strategy: Literal['mutate', 'mask'])[source]#
Bases:
BaseTransform
Implements a transform for corrupting the Cartesian coordinates of a protein structure.
- class proteinworkshop.tasks.structural_denoising.StructuralNoiseTransform(corruption_rate: float, corruption_strategy: Literal['uniform', 'gaussian'])[source]#
Bases:
BaseTransform
Adds noise to the coordinates of a protein structure.
Sets the following attributes on the protein data object:
coords_uncorrupted
: The original coordinates of the protein.noise
: The noise added to the coordinates.coords
: The original coordinates + noise.
- Parameters:
corruption_rate (float) – Magnitude of corruption to apply to the coordinates.
corruption_strategy (Literal["uniform", "gaussian"]) – Noise strategy to use for corruption.
Implementation of the Torsional Noise Transform.
- class proteinworkshop.tasks.torsional_denoising.TorsionalNoiseTransform(corruption_strategy: str = 'gaussian', corruption_rate: float = 0.1)[source]#
Bases:
BaseTransform
Adds noise to the torsional angles of a protein.
Cartesian coordinates are re-computed from the noisy dihedral angles using the pNeRF algorithm.
The true dihedral angles are stored as an attribute on the protein object:
batch.true_dihedrals
.Warning
This will subset the data to only include the backbone atoms (N, Ca, C). The backbone oxygen can be placed with:
graphein.protein.tensor.reconstruction.place_fourth_coord
.This will break, for example, sidechain torsion angle computation for the first few chi angles that are partially defined by backbone atoms.
Masked Attribute Prediction Transforms#
- class proteinworkshop.tasks.edge_distance_prediction.EdgeDistancePredictionTransform(num_samples: int)[source]#
Bases:
BaseTransform
Self-supervision task to predict the pairwise distance between two nodes.
We first sample
num_samples
edges randomly from the input batch. We then construct a mask to remove the sampled edges from the batch. We store the masked node indices and their pairwise distance asbatch.node_mask
andbatch.edge_distance_labels
, respectively. Finally, it masks the edges (and their attributes) using the constructed mask and returns the modified batch.
- class proteinworkshop.tasks.backbone_dihedral_angle_prediction.BackboneDihedralPredictionTransform[source]#
Bases:
BaseTransform
Transform to store backbone dihedral angles as attributes on proteins.
This is used for setting the labels in a SSL context, not for featurisation.
Sets dihedrals as an attribute of the Batch object (i.e.
batch.dihedrals
). This is retrieved inproteinworkshop.models.base.BaseModel.get_labels()
for supervision.- property required_attributes: Set[str]#
Required batch attributes for this transform.
coords
are required for computing dihedrals. This is a tensor ofshape \((N, 37, 3)\) where \(N\) is the number of residues, 37 is the number of unique atoms in PDBs, and 3 is the x, y, z position of each atom.
- Returns:
Set of required attributes
- Return type:
Set[str]
Structural Annotation Prediction#
- class proteinworkshop.tasks.ppi_site_prediction.BindingSiteTransform(radius: float = 3.5, ca_only: bool = True)[source]#
Bases:
BaseTransform
- class proteinworkshop.tasks.binding_site_prediction.BindingSiteTransform(hetatms: List[str], threshold: float, ca_only: bool = False, multilabel: bool = True)[source]#
Bases:
BaseTransform
Extracts binding site labels for a given set of HETATMs.
This transform builds a KDTree from the protein coordinates. Atoms belonging to HETATMs (specified by the
hetatms
arg at initialization) are then queried against the KDTree to obtain indices of residues withinthreshold
distance of the HETATM.These indices are used to assign node labels to the protein graph. If
multilabel
is set toTrue
, then each binding HETATM will be assigned a separate label (i.e. whether residue \(i\) is proximal to HETATM \(j\) is given by: \(\hat{y}_{ij} \in \mathbb{R}^{|V| imes |H|}\)). Otherwise, the labels will be assigned as a single label (i.e. is residue \(i\) proximal to any HETATM \(\hat{y} \in \mathbb{R}^{|V|}\)). proximal to any HETATM).If
ca_only
is set toTrue
, then only the alpha carbon atoms will be used to determine proximity. Ifca_only
is set toFalse
, then all atoms will be used to determine proximity. I.e. if any atom in a residue is withinthreshold
distance of a HETATM, then the residue will be labeled accordingly.Warning
This transform requires that the
data.coords
anddata.hetatms
fields to be set on the input Data/Batch. See:required_attributes()
Misc#
Implementation of a transform to remove residues with missing CA atoms.
- class proteinworkshop.tasks.remove_missing_ca.RemoveMissingCa(fill_value: float = 1e-05, ca_idx: int = 1)[source]#
Bases:
BaseTransform
Removes residues with missing CA atoms from a protein structure.