protein_workshop.features#

Featuriser#

class proteinworkshop.features.factory.ProteinFeaturiser(representation: Literal['ca', 'ca_bb', 'full_atom'], scalar_node_features: List[Literal['amino_acid_one_hot', 'alpha', 'kappa', 'dihedrals', 'sidechain_torsions', 'sequence_positional_encoding']], vector_node_features: List[Literal['orientation', 'virtual_cb_vector']], edge_types: List[str], scalar_edge_features: List[Literal['edge_distance', 'sequence_distance']], vector_edge_features: List[Literal['edge_vectors', 'pos_emb']])[source]#

Initialise a protein featuriser.

Parameters:
  • representation (StructureRepresentation) – Representation to use for the protein. One of "ca", "ca_bb", "full_atom".

  • scalar_node_features (List[ScalarNodeFeature]) – List of scalar-values node features to compute. Options: "amino_acid_one_hot", "sequence_positional_encoding", "alpha", "kappa", "dihedrals" "sidechain_torsions".

  • vector_node_features (List[VectorNodeFeature]) – List of vector-valued node features to compute. # TODO types

  • edge_types (List[str]) – List of edge types to compute. Options: # TODO types

  • scalar_edge_features (List[ScalarEdgeFeature]) – List of scalar-valued edge features to compute. # TODO types

  • vector_edge_features (List[VectorEdgeFeature]) – List of vector-valued edge features to compute. # TODO types

forward(batch: Batch | ProteinBatch) Batch | ProteinBatch[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Edge Construction#

Edge construction and featurisation utils.

proteinworkshop.features.edges.compute_edges(x: Data | Batch | Protein | ProteinBatch, edge_types: ListConfig | List[str]) Tuple[Tensor, Tensor][source]#

Orchestrates the computation of edges for a given data object.

This function returns a tuple of tensors, where the first tensor is a tensor indicating the edge type of shape (|E|) and the second are the edge indices of shape (2 x |E|).

The edge type tensor can be used to mask out edges of a particular type downstream.

Warning

For spatial edges, (e.g. knn_, eps_), the input data/batch object must have a pos attribute of shape (N x 3).

Parameters:
  • x (Union[Data, Batch, Protein, ProteinBatch]) – The input data object to compute edges for

  • edge_types (Union[ListConfig, List[str]]) – List of edge types to compute. Must be a sequence of knn_{x}, eps_{x}, (where {x} should be replaced by a numerical value) seq_forward, seq_backward.

Raises:
Returns:

Tuple of tensors, where the first tensor is a tensor indicating the edge type of shape (|E|) and the second are the edge indices of shape (2 x |E|).

Return type:

Tuple[torch.Tensor, torch.Tensor]

proteinworkshop.features.edges.sequence_edges(b: Data | Batch | Protein | ProteinBatch, chains: Tensor | None = None, direction: Literal['forward', 'backward'] = 'forward')[source]#

Computes edges between adjacent residues in a sequence.

Parameters:
  • b (Union[Data, Batch, Protein, ProteinBatch]) – Input data object to compute edges for

  • chains (Optional[torch.Tensor], optional) – Tensor of shape (N) indicating the chain ID of each node. This is required for correct boundary handling. Defaults to None

  • direction (Literal["forward", "backward"], optional) – Direction of edges to compute. Must be forward or backward. Defaults to forward

Raises:

ValueError – Raised if direction is not forward or backward

Returns:

Tensor of shape (2 x |E|) indicating the edge indices

Node Features#

Node feature computation functions.

proteinworkshop.features.node_features.compute_scalar_node_features(x: Batch | Data | Protein | ProteinBatch, node_features: ListConfig | List[Literal['amino_acid_one_hot', 'alpha', 'kappa', 'dihedrals', 'sidechain_torsions', 'sequence_positional_encoding']]) Tensor[source]#

Factory function for node features.

See also

proteinworkshop.types.ScalarNodeFeature for a list of node features that can be computed.

This function operates on a torch_geometric.data.Data or torch_geometric.data.Batch object and computes the requested node features.

Parameters:
  • x (Union[Data, Batch]) – Data or Batch protein object.

  • node_features (Union[List[str], ListConfig]) – List of node features to compute.

Returns:

Tensor of node features of shape (N x F), where N is the number of nodes and F is the number of features.

Return type:

torch.Tensor

proteinworkshop.features.node_features.compute_surface_feat(coords: CoordTensor | AtomTensor, k: int, sigma: List[float])[source]#

Coords: (N, 3) k: number of neighbors to consider in KNN graph

proteinworkshop.features.node_features.compute_vector_node_features(x: Batch | Data | Protein | ProteinBatch, vector_features: ListConfig | List[str]) Batch | Data | Protein | ProteinBatch[source]#

Factory function for vector features.

Currently implemented vector features are:

  • orientation: Orientation of each node in the protein backbone

  • virtual_cb_vector: Virtual CB vector for each node in the protein

backbone

Sequence features for protein data objects.

proteinworkshop.features.sequence_features.amino_acid_one_hot(x: Batch | Data, num_classes: int = 23) Tensor[source]#

Returns one-hot encoding of amino acid sequence.

Parameters:
  • x (Union[Batch, Data]) – Protein data object containing a residue_type attribute.

  • num_classes (int, optional) – Number of classes to encode, defaults to 23

Returns:

One-hot encoding of amino acid sequence

Return type:

torch.Tensor

Edge Features#

Utilities for computing edge features.

proteinworkshop.features.edge_features.EDGE_FEATURES: List[str] = ['edge_distance', 'node_features', 'edge_type', 'sequence_distance']#

List of edge features that can be computed.

proteinworkshop.features.edge_features.compute_edge_distance(pos: CoordTensor, edge_index: EdgeTensor) Tensor[source]#

Compute the euclidean distance between each pair of nodes connected by an edge.

Parameters:
  • pos (CoordTensor) – Tensor of shape \((|V|, 3)\) containing the node coordinates.

  • edge_index (EdgeTensor) – Tensor of shape \((2, |E|)\) containing the indices of the nodes forming the edges.

Returns:

Tensor of shape \((|E|, 1)\) containing the euclidean distance between each pair of nodes connected by an edge.

Return type:

torch.Tensor

proteinworkshop.features.edge_features.compute_scalar_edge_features(x: Data | Batch, features: List[str] | ListConfig) Tensor[source]#

Computes scalar edge features from a Data or Batch object.

Parameters:
  • x (Union[Data, Batch]) – Data or Batch protein object.

  • features (Union[List[str], ListConfig]) – List of edge features to compute.

Representation#

proteinworkshop.features.representation.ca_to_bb_repr(batch: Batch) Batch[source]#

Converts a batch of CA representations to backbone representations. I.e. 1 node per residue -> 4 nodes per residue (N, CA, C, O)

This function tiles any existing node features on the CA atoms over the additional nodes in the backbone representation.

proteinworkshop.features.representation.ca_to_bb_sc_repr(batch: Batch) Batch[source]#

Converts a batch of CA representations to backbone + sidechain representations.

proteinworkshop.features.representation.ca_to_ca_sc_repr(batch: Batch) Batch[source]#

Converts a batch of CA representations to C + sidechain representations.

proteinworkshop.features.representation.ca_to_fa_repr(batch: Batch) Batch[source]#

Converts a batch of CA representations to full atom representations.

proteinworkshop.features.representation.coarsen_sidechain(x: Data, aggr: str = 'mean') CoordTensor[source]#

Returns tensor of sidechain centroids: L x 3

proteinworkshop.features.representation.get_full_atom_coords(atom_tensor: AtomTensor, fill_value: float = 1e-05) Tuple[CoordTensor, Tensor, Tensor][source]#

Converts an AtomTensor to a full atom representation (e.g. dense to sparse).

Parameters:
  • atom_tensor (AtomTensor) – AtomTensor of shape (N_residues x 37 x 3)

  • fill_value (float, optional) – Value indicating missing atoms, defaults to 1e-5

Returns:

Tuple of coords (N_atoms x 3), residue_index (N_atoms), atom_type (N_atoms ([0-36]))

Return type:

Tuple[CoordTensor, torch.Tensor, torch.Tensor]

proteinworkshop.features.representation.transform_representation(x: Batch, representation_type: Literal['CA', 'BB', 'FA', 'BB_SC', 'CA_SC']) Batch[source]#

Factory method to transform a batch into a specified representation.

The AtomTensor (i.e. batch.coords with shape (\(|V| imes 37 imes 3\)) is manipulated to produce the corresponding number of nodes according to the desired node representation.

  • CA simply selects the \(C_lpha\) atoms as nodes

    (i.e. batch.coords[:, 1, :])

  • BB selects and unravels the four backbone atoms

    (\(N, C_lpha, C, O\)) as nodes. Existing node features are tiled over the backbone atom nodes on a per-residue basis.

  • FA unravels all the a``AtomTensor`` to result in a full-atom graph,

    i.e. each atom in the structure becomes a node in the graph. Existing node features are tiled over the atom nodes on a per-residue basis.

Parameters:
  • x (Batch) – A minibatch of data

  • representation_type (Literal["CA", "BB", "FA", "BB_SC", "CA_SC"]) – _description_

Raises:

ExperimentConfigurationError – _description_

Returns:

_description_

Return type:

Batch