protein_workshop.tasks#

Denoising Transforms#

Implements sequence denoising task.

class proteinworkshop.tasks.sequence_denoising.SequenceNoiseTransform(corruption_rate: float, corruption_strategy: Literal['mutate', 'mask'])[source]#

Bases: BaseTransform

property required_attributes: Set[str]#

Implements a transform for corrupting the Cartesian coordinates of a protein structure.

class proteinworkshop.tasks.structural_denoising.StructuralNoiseTransform(corruption_rate: float, corruption_strategy: Literal['uniform', 'gaussian'])[source]#

Bases: BaseTransform

Adds noise to the coordinates of a protein structure.

Sets the following attributes on the protein data object:

  • coords_uncorrupted: The original coordinates of the protein.

  • noise: The noise added to the coordinates.

  • coords: The original coordinates + noise.

Parameters:
  • corruption_rate (float) – Magnitude of corruption to apply to the coordinates.

  • corruption_strategy (Literal["uniform", "gaussian"]) – Noise strategy to use for corruption.

property required_attributes: Set[str]#

Implementation of the Torsional Noise Transform.

class proteinworkshop.tasks.torsional_denoising.TorsionalNoiseTransform(corruption_strategy: str = 'gaussian', corruption_rate: float = 0.1)[source]#

Bases: BaseTransform

Adds noise to the torsional angles of a protein.

Cartesian coordinates are re-computed from the noisy dihedral angles using the pNeRF algorithm.

The true dihedral angles are stored as an attribute on the protein object: batch.true_dihedrals.

Warning

This will subset the data to only include the backbone atoms (N, Ca, C). The backbone oxygen can be placed with: graphein.protein.tensor.reconstruction.place_fourth_coord.

This will break, for example, sidechain torsion angle computation for the first few chi angles that are partially defined by backbone atoms.

Masked Attribute Prediction Transforms#

class proteinworkshop.tasks.edge_distance_prediction.EdgeDistancePredictionTransform(num_samples: int)[source]#

Bases: BaseTransform

Self-supervision task to predict the pairwise distance between two nodes.

We first sample num_samples edges randomly from the input batch. We then construct a mask to remove the sampled edges from the batch. We store the masked node indices and their pairwise distance as batch.node_mask and batch.edge_distance_labels, respectively. Finally, it masks the edges (and their attributes) using the constructed mask and returns the modified batch.

property required_batch_attributes: Set[str]#

Returns the set of attributes that this transform requires to be present on the batch object for correct operation.

Returns:

Set of required attributes

Return type:

Set[str]

class proteinworkshop.tasks.backbone_dihedral_angle_prediction.BackboneDihedralPredictionTransform[source]#

Bases: BaseTransform

Transform to store backbone dihedral angles as attributes on proteins.

This is used for setting the labels in a SSL context, not for featurisation.

Sets dihedrals as an attribute of the Batch object (i.e. batch.dihedrals). This is retrieved in proteinworkshop.models.base.BaseModel.get_labels() for supervision.

property required_attributes: Set[str]#

Required batch attributes for this transform.

  • coords are required for computing dihedrals. This is a tensor of

    shape \((N, 37, 3)\) where \(N\) is the number of residues, 37 is the number of unique atoms in PDBs, and 3 is the x, y, z position of each atom.

Returns:

Set of required attributes

Return type:

Set[str]

Structural Annotation Prediction#

class proteinworkshop.tasks.ppi_site_prediction.BindingSiteTransform(radius: float = 3.5, ca_only: bool = True)[source]#

Bases: BaseTransform

class proteinworkshop.tasks.binding_site_prediction.BindingSiteTransform(hetatms: List[str], threshold: float, ca_only: bool = False, multilabel: bool = True)[source]#

Bases: BaseTransform

Extracts binding site labels for a given set of HETATMs.

This transform builds a KDTree from the protein coordinates. Atoms belonging to HETATMs (specified by the hetatms arg at initialization) are then queried against the KDTree to obtain indices of residues within threshold distance of the HETATM.

These indices are used to assign node labels to the protein graph. If multilabel is set to True, then each binding HETATM will be assigned a separate label (i.e. whether residue \(i\) is proximal to HETATM \(j\) is given by: \(\hat{y}_{ij} \in \mathbb{R}^{|V| imes |H|}\)). Otherwise, the labels will be assigned as a single label (i.e. is residue \(i\) proximal to any HETATM \(\hat{y} \in \mathbb{R}^{|V|}\)). proximal to any HETATM).

If ca_only is set to True, then only the alpha carbon atoms will be used to determine proximity. If ca_only is set to False, then all atoms will be used to determine proximity. I.e. if any atom in a residue is within threshold distance of a HETATM, then the residue will be labeled accordingly.

Warning

This transform requires that the data.coords and data.hetatms fields to be set on the input Data/Batch. See: required_attributes()

property required_attributes: Set[str]#

Returns the required batch attributes that this transform requires.

I.e. data.coords and data.hetatms must be set.

Returns:

Set of required attributes

Return type:

Set[str]

Misc#

Implementation of a transform to remove residues with missing CA atoms.

class proteinworkshop.tasks.remove_missing_ca.RemoveMissingCa(fill_value: float = 1e-05, ca_idx: int = 1)[source]#

Bases: BaseTransform

Removes residues with missing CA atoms from a protein structure.