PDB To Fasta¶

Converts a PDB structure (as text) into a FASTA-formatted amino acid sequence. You can target a specific chain or extract all chains, and optionally control FASTA headers and formatting. Nonstandard residues are handled conservatively and unknowns are represented as 'X'.

Usage¶

Use this node when you have a protein structure in PDB format and need the corresponding amino acid sequence in FASTA format for downstream analysis or modeling. Typical workflow: load or generate a PDB, pass its text content here, optionally specify a chain (or leave empty to extract all), and feed the resulting FASTA to alignment, design, or prediction nodes.

Inputs¶

Field	Required	Type	Description	Example
pdb_string	True	STRING	Full PDB file content as a string. The node parses ATOM records to infer the sequence.	ATOM 1 N MET A 1 ...
chain_id	True	STRING	Chain identifier to extract (e.g., 'A'). Leave empty to extract all chains.	A
include_header	True	BOOLEAN	If true, prepend a FASTA header line for each chain.	true
header_prefix	True	STRING	Prefix used when constructing automatic headers if no custom header is provided.	Chain_
custom_header	False	STRING	Optional custom FASTA header text. If set, it replaces the automatic header for each chain.	My_Protein

Outputs¶

Field	Type	Description	Example
fasta	STRING	FASTA-formatted sequence(s) extracted from the PDB. Multiple chains are output sequentially, each optionally with its own header, and sequences are wrapped at 60 characters per line.	>Chain_A_Protein MSTNPKP...

Important Notes¶

The node parses residues from PDB ATOM records and maps 3-letter amino acid codes to 1-letter codes.
Supported mappings include standard residues (e.g., ALA->A, GLY->G) plus SEC->U, PYL->O, and unknowns as X.
Duplicates are avoided by tracking residue numbers; only one letter per residue is emitted.
If chain_id is empty, sequences for all chains found are included, each as a separate FASTA entry.
If no valid protein residues are found, the node returns a message indicating no valid sequences were found.
Headers: if custom_header is provided, it overrides the automatic header; otherwise headers use the pattern '>[header_prefix][chain]_Protein'.
Line wrapping is applied at 60 characters per sequence line.

Troubleshooting¶

If output says 'No valid protein sequences found.': ensure pdb_string contains valid ATOM records and standard residue names, and that the selected chain exists.
If the wrong chain is extracted: verify the chain_id matches the PDB's chain column (e.g., 'A', 'B').
If headers are missing: set include_header to true, or remove custom_header if it's unintentionally blanking your expected format.
If multiple chains are needed but only one appears: leave chain_id empty to include all chains.
If unexpected 'X' characters appear: the PDB likely contains unknown or nonstandard residues at those positions.