ProteinMPNN¶
Generates protein sequences conditioned on an input protein backbone structure using the ProteinMPNN family of models. Supports vanilla (all-atom), CA-only, and soluble-protein variants, with configurable sampling temperature, amino-acid omissions, chain targeting, and backbone noise. Can operate in MOCK (precomputed), PROD (service-backed), or TEST (fast minimal) modes.
Usage¶
Use this node to design sequences for one or more input PDB backbones. Choose the appropriate ProteinMPNN variant (vanilla, ca_only, or soluble) and set sampling temperatures to control diversity. Provide optional constraints such as omitting specific amino acids or restricting design to certain chains. In larger workflows, feed a PDB dictionary into this node and route the resulting FASTA sequences to downstream evaluation or structure prediction.
Inputs¶
| Field | Required | Type | Description | Example |
|---|---|---|---|---|
| pdb | True | PDB | Dictionary of PDB structures to design against. Keys are structure identifiers (e.g., filenames or IDs), values are PDB text content. | {"targetA": "ATOM ...\nEND\n", "targetB": "ATOM ...\nEND\n"} |
| model_name | True | STRING | ProteinMPNN model variant to use. Vanilla models use all atoms; ca_only models expect CA-only structures; soluble models are trained on soluble proteins. | vanilla_v_48_020 |
| backbone_noise | True | FLOAT | Standard deviation of Gaussian noise added to backbone atoms to encourage robustness and diversity. | 0.1 |
| num_seq_per_target | True | INT | Number of sequences to generate per input structure. | 10 |
| max_length | True | INT | Maximum sequence length allowed for generated designs. | 200000 |
| sampling_temp | True | STRING | Comma-separated temperatures controlling sampling diversity for amino acids. Higher values yield more diverse sequences. | 0.1,0.15,0.2 |
| omit_amino_acids | True | STRING | Comma-separated list of one-letter amino acid codes to exclude from generation. | C,M,W |
| chain_list | True | STRING | Comma-separated list of chain IDs to design. If empty, all chains are designed. | A,B |
| seed | True | INT | Base random seed for reproducibility across sampling. | 42 |
| mode | True | STRING | Execution mode. MOCK: returns predefined mock outputs. PROD: runs the live service. TEST: runs quickly with minimal settings. | PROD |
Outputs¶
| Field | Type | Description | Example |
|---|---|---|---|
| seqs.fasta | FASTA | Generated protein sequences in FASTA format for all input structures and samples. | >targetA_design_0\nMKT...\n>targetA_design_1\nGAS...\n>targetB_design_0\nVLK... |
Important Notes¶
- Model selection: The model_name must be one of the supported variants (e.g., vanilla_v_48_002/010/020/030, ca_only_v_48_002/010/020, soluble_v_48_002/010/020/030). CA-only models require CA-only structures.
- Sampling temperatures: sampling_temp must not be empty. Provide a comma-separated list (e.g., 0.1,0.15,0.2).
- Chain targeting: Leave chain_list empty to design all chains; otherwise, specify chains as a comma-separated list (e.g., A,B).
- Omitted residues: omit_amino_acids uses one-letter codes; listed residues will not be sampled.
- TEST mode behavior: In TEST mode, num_seq_per_target is forced to 1 to speed up runs.
- Timeout scaling: Service timeout scales with the number of input structures; large pdb sets will take longer.
- Backbone noise: Increasing backbone_noise can improve robustness but may affect fit to the original backbone.
- Sequence length: Designs exceeding max_length will be disallowed.
Troubleshooting¶
- Invalid model name: If you see an error about model_name, ensure it matches one of the provided options and formatting (e.g., vanilla_v_48_020).
- Empty sampling_temp: Provide at least one temperature (e.g., 0.1). Comma separation is required for multiple values.
- Incorrect chain IDs: If designs target wrong chains or fail, verify chain_list matches the chains present in the PDBs. Leave empty to design all.
- CA-only mismatch: Using a ca_only model with an all-atom PDB (or vice versa) can cause failures. Pick the model that matches your structure type.
- Service timeout or no output: For large pdb sets or high num_seq_per_target, increase patience or reduce inputs. Try TEST mode to validate configuration quickly.
- Unexpected amino acids: If disallowed residues appear, confirm omit_amino_acids uses correct one-letter codes separated by commas.
- Fewer sequences than expected: In TEST mode, output is intentionally limited to 1 per target. In other modes, check max_length and constraints that may limit sampling.