Boltz Protein Sequence¶

Creates a protein sequence object formatted for Boltz YAML workflows. It parses a provided protein FASTA, extracts the sequence and optional header as a sequence name, and attaches optional MSA, modifications, cyclic flag, and multiple chain IDs. Output is a ready-to-combine sequence payload for downstream Boltz YAML assembly and prediction.

Usage¶

Use this node when you need to define a protein sequence for Boltz-based structure prediction or affinity workflows. Typically, you will: 1) provide a protein FASTA (with or without a header), 2) optionally attach an MSA and modifications, 3) configure chain IDs (and duplicates if needed), and 4) feed the resulting output into Boltz List Combiner -> Boltz YAML Combiner -> Boltz Predict.

Inputs¶

Field	Required	Type	Description	Example
chain_id	True	STRING	Primary chain identifier for this protein (single letter or string).	A
sequence	True	FASTA	Protein sequence in FASTA format. If a header is present (e.g., '>my_protein'), it will be used as the internal sequence name; only the sequence lines are kept.	>my_protein_A MKTLLILAVAAALAAGASA... (AA sequence lines)
msa	False	MSA	Optional multiple sequence alignment content in A3M or CSV format. Leave empty to indicate an 'empty' MSA.	>seq1 MKTLLI---AVAAALA >seq2 MKTLFIVAAAVAAALA
modifications	False	STRING	Optional residue-level modifications, one per line, using 'position:ccd_code' format. Positions are 1-based indices.	5:CSO 12:MSE
cyclic	False	BOOLEAN	Set true if the protein is cyclic.	false
multiple_chains	False	STRING	Comma-separated list of additional chain IDs to create identical copies of this protein (e.g., for homomers).	B,C

Outputs¶

Field	Type	Description	Example
protein_sequence	*	A protein sequence object (as a list with one item) containing fields like id (single or list of chain IDs), sequence, optional _sequence_name, optional msa and _msa_format, optional modifications, and cyclic.	[{'protein': {'id': ['A', 'B'], 'sequence': 'MKTLLILAVAAALAAGASA...', '_sequence_name': 'my_protein_A', 'msa': 'empty', '_msa_format': 'a3m', 'modifications': [{'position': 5, 'ccd': 'CSO'}], 'cyclic': False}}]

Important Notes¶

Sequence header handling: If the FASTA header is present, it will be used as the internal sequence name; otherwise a default name is assigned.
MSA handling: The node determines and stores the MSA format (e.g., 'a3m' or 'csv'); leave MSA empty to indicate an 'empty' MSA.
Multiple chains: Provide comma-separated chain IDs to create identical copies of this protein (e.g., homomers).
Modifications format: Use 'position:ccd_code' per line (e.g., '5:CSO'); invalid lines are ignored with warnings.
Cyclic flag: Set 'cyclic' to true if the protein is a cyclic polymer.
Validation: The protein sequence is required and cannot be empty; FASTA content is parsed to extract only sequence lines.
Combining later: When assembling the final YAML, ensure all chain IDs across all sequences are unique to avoid validation errors in downstream nodes.

Troubleshooting¶

Protein sequence is required: Ensure 'sequence' contains a valid FASTA. If you only have plain sequence text, include it as FASTA (with or without a header).
Empty or whitespace sequence: Remove extra whitespace and verify the FASTA has sequence lines after any header.
Invalid modifications: Confirm each line follows 'position:ccd_code' and positions are integers (1-based).
Unexpected MSA errors: Provide A3M or CSV content; if unavailable, leave MSA empty to use 'empty' MSA.
Duplicate chain IDs in later steps: If YAML combination fails due to duplicate IDs, adjust 'chain_id' or 'multiple_chains' here so all chains are unique in the final workflow.