protein_workshop.features#
Featuriser#
- class proteinworkshop.features.factory.ProteinFeaturiser(representation: Literal['ca', 'ca_bb', 'full_atom'], scalar_node_features: List[Literal['amino_acid_one_hot', 'alpha', 'kappa', 'dihedrals', 'sidechain_torsions', 'sequence_positional_encoding']], vector_node_features: List[Literal['orientation', 'virtual_cb_vector']], edge_types: List[str], scalar_edge_features: List[Literal['edge_distance', 'sequence_distance']], vector_edge_features: List[Literal['edge_vectors', 'pos_emb']])[source]#
Initialise a protein featuriser.
- Parameters:
representation (StructureRepresentation) – Representation to use for the protein. One of
"ca", "ca_bb", "full_atom"
.scalar_node_features (List[ScalarNodeFeature]) – List of scalar-values node features to compute. Options:
"amino_acid_one_hot", "sequence_positional_encoding", "alpha", "kappa", "dihedrals" "sidechain_torsions"
.vector_node_features (List[VectorNodeFeature]) – List of vector-valued node features to compute. # TODO types
edge_types (List[str]) – List of edge types to compute. Options: # TODO types
scalar_edge_features (List[ScalarEdgeFeature]) – List of scalar-valued edge features to compute. # TODO types
vector_edge_features (List[VectorEdgeFeature]) – List of vector-valued edge features to compute. # TODO types
- forward(batch: Batch | ProteinBatch) Batch | ProteinBatch [source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
Edge Construction#
Edge construction and featurisation utils.
- proteinworkshop.features.edges.compute_edges(x: Data | Batch | Protein | ProteinBatch, edge_types: ListConfig | List[str]) Tuple[Tensor, Tensor] [source]#
Orchestrates the computation of edges for a given data object.
This function returns a tuple of tensors, where the first tensor is a tensor indicating the edge type of shape (
|E|
) and the second are the edge indices of shape (2 x |E|
).The edge type tensor can be used to mask out edges of a particular type downstream.
Warning
For spatial edges, (e.g.
knn_
,eps_
), the input data/batch object must have apos
attribute of shape (N x 3
).- Parameters:
x (Union[Data, Batch, Protein, ProteinBatch]) – The input data object to compute edges for
edge_types (Union[ListConfig, List[str]]) – List of edge types to compute. Must be a sequence of
knn_{x}
,eps_{x}
, (where{x}
should be replaced by a numerical value)seq_forward
,seq_backward
.
- Raises:
ValueError – Raised if
x
is not atorch_geometric
Data or Batch objectNotImplementedError – Raised if an edge type is not implemented
- Returns:
Tuple of tensors, where the first tensor is a tensor indicating the edge type of shape (
|E|
) and the second are the edge indices of shape (2 x |E|
).- Return type:
Tuple[torch.Tensor, torch.Tensor]
- proteinworkshop.features.edges.sequence_edges(b: Data | Batch | Protein | ProteinBatch, chains: Tensor | None = None, direction: Literal['forward', 'backward'] = 'forward')[source]#
Computes edges between adjacent residues in a sequence.
- Parameters:
b (Union[Data, Batch, Protein, ProteinBatch]) – Input data object to compute edges for
chains (Optional[torch.Tensor], optional) – Tensor of shape (
N
) indicating the chain ID of each node. This is required for correct boundary handling. Defaults toNone
direction (Literal["forward", "backward"], optional) – Direction of edges to compute. Must be
forward
orbackward
. Defaults toforward
- Raises:
ValueError – Raised if
direction
is notforward
orbackward
- Returns:
Tensor of shape (
2 x |E|
) indicating the edge indices
Node Features#
Node feature computation functions.
- proteinworkshop.features.node_features.compute_scalar_node_features(x: Batch | Data | Protein | ProteinBatch, node_features: ListConfig | List[Literal['amino_acid_one_hot', 'alpha', 'kappa', 'dihedrals', 'sidechain_torsions', 'sequence_positional_encoding']]) Tensor [source]#
Factory function for node features.
See also
proteinworkshop.types.ScalarNodeFeature
for a list of node features that can be computed.This function operates on a
torch_geometric.data.Data
ortorch_geometric.data.Batch
object and computes the requested node features.- Parameters:
- Returns:
Tensor of node features of shape (
N x F
), whereN
is the number of nodes andF
is the number of features.- Return type:
- proteinworkshop.features.node_features.compute_surface_feat(coords: CoordTensor | AtomTensor, k: int, sigma: List[float])[source]#
Coords: (N, 3) k: number of neighbors to consider in KNN graph
- proteinworkshop.features.node_features.compute_vector_node_features(x: Batch | Data | Protein | ProteinBatch, vector_features: ListConfig | List[str]) Batch | Data | Protein | ProteinBatch [source]#
Factory function for vector features.
Currently implemented vector features are:
orientation
: Orientation of each node in the protein backbonevirtual_cb_vector
: Virtual CB vector for each node in the protein
backbone
Sequence features for protein data objects.
- proteinworkshop.features.sequence_features.amino_acid_one_hot(x: Batch | Data, num_classes: int = 23) Tensor [source]#
Returns one-hot encoding of amino acid sequence.
- Parameters:
x (Union[Batch, Data]) – Protein data object containing a
residue_type
attribute.num_classes (int, optional) – Number of classes to encode, defaults to 23
- Returns:
One-hot encoding of amino acid sequence
- Return type:
Edge Features#
Utilities for computing edge features.
- proteinworkshop.features.edge_features.EDGE_FEATURES: List[str] = ['edge_distance', 'node_features', 'edge_type', 'sequence_distance']#
List of edge features that can be computed.
- proteinworkshop.features.edge_features.compute_edge_distance(pos: CoordTensor, edge_index: EdgeTensor) Tensor [source]#
Compute the euclidean distance between each pair of nodes connected by an edge.
- Parameters:
pos (CoordTensor) – Tensor of shape \((|V|, 3)\) containing the node coordinates.
edge_index (EdgeTensor) – Tensor of shape \((2, |E|)\) containing the indices of the nodes forming the edges.
- Returns:
Tensor of shape \((|E|, 1)\) containing the euclidean distance between each pair of nodes connected by an edge.
- Return type:
Representation#
- proteinworkshop.features.representation.ca_to_bb_repr(batch: Batch) Batch [source]#
Converts a batch of CA representations to backbone representations. I.e. 1 node per residue -> 4 nodes per residue (N, CA, C, O)
This function tiles any existing node features on the CA atoms over the additional nodes in the backbone representation.
- proteinworkshop.features.representation.ca_to_bb_sc_repr(batch: Batch) Batch [source]#
Converts a batch of CA representations to backbone + sidechain representations.
- proteinworkshop.features.representation.ca_to_ca_sc_repr(batch: Batch) Batch [source]#
Converts a batch of CA representations to C + sidechain representations.
- proteinworkshop.features.representation.ca_to_fa_repr(batch: Batch) Batch [source]#
Converts a batch of CA representations to full atom representations.
- proteinworkshop.features.representation.coarsen_sidechain(x: Data, aggr: str = 'mean') CoordTensor [source]#
Returns tensor of sidechain centroids: L x 3
- proteinworkshop.features.representation.get_full_atom_coords(atom_tensor: AtomTensor, fill_value: float = 1e-05) Tuple[CoordTensor, Tensor, Tensor] [source]#
Converts an AtomTensor to a full atom representation (e.g. dense to sparse).
- Parameters:
atom_tensor (AtomTensor) – AtomTensor of shape (
N_residues x 37 x 3
)fill_value (float, optional) – Value indicating missing atoms, defaults to
1e-5
- Returns:
Tuple of coords (
N_atoms x 3
), residue_index (N_atoms
), atom_type (N_atoms
([0-36]
))- Return type:
Tuple[CoordTensor, torch.Tensor, torch.Tensor]
- proteinworkshop.features.representation.transform_representation(x: Batch, representation_type: Literal['CA', 'BB', 'FA', 'BB_SC', 'CA_SC']) Batch [source]#
Factory method to transform a batch into a specified representation.
The
AtomTensor
(i.e.batch.coords
with shape (\(|V| imes 37 imes 3\)) is manipulated to produce the corresponding number of nodes according to the desired node representation.CA
simply selects the \(C_lpha\) atoms as nodes(i.e.
batch.coords[:, 1, :]
)
BB
selects and unravels the four backbone atoms(\(N, C_lpha, C, O\)) as nodes. Existing node features are tiled over the backbone atom nodes on a per-residue basis.
FA
unravels all the a``AtomTensor`` to result in a full-atom graph,i.e. each atom in the structure becomes a node in the graph. Existing node features are tiled over the atom nodes on a per-residue basis.
- Parameters:
x (Batch) – A minibatch of data
representation_type (Literal["CA", "BB", "FA", "BB_SC", "CA_SC"]) – _description_
- Raises:
ExperimentConfigurationError – _description_
- Returns:
_description_
- Return type:
Batch