Skip to content

Pdb To Fasta Node

Converts a protein structure provided as PDB text into FASTA sequence(s). It parses ATOM records, maps 3-letter amino acid codes to 1-letter codes, supports selecting a specific chain or extracting all chains, and optionally adds configurable FASTA headers.
Preview

Usage

Use this node when you have PDB content and need the corresponding amino acid sequence(s) in FASTA format for downstream tasks like alignment, design, or structure prediction. Provide the PDB string, optionally target a specific chain, and configure header preferences; the node returns a FASTA-formatted string with line wrapping.

Inputs

FieldRequiredTypeDescriptionExample
pdb_stringTrueSTRINGFull PDB file content as a string. The node extracts sequences from ATOM records.ATOM 1 N MET A 1 ... (full PDB text)
chain_idTrueSTRINGChain identifier to extract (e.g., A). Leave empty to extract sequences for all chains found.A
include_headerTrueBOOLEANWhether to include a FASTA header line for each chain.true
header_prefixTrueSTRINGPrefix used to build the header when include_header is true and no custom header is provided. Format becomes >{prefix}{chain}_Protein.Chain_
custom_headerFalseSTRINGOptional custom header text. If provided and include_header is true, the header will be >{custom_header} instead of using the chain-based default.MyProtein

Outputs

FieldTypeDescriptionExample
fastaSTRINGFASTA-formatted sequence(s) extracted from the PDB. For multiple chains, sequences are concatenated one after another, each optionally with its own header. Lines are wrapped at 60 characters.>Chain_A_Protein MSEQNNTEMTFQIQRIYTKDIS... (wrapped at 60 chars)

Important Notes

  • Chain selection: If chain_id is empty, the node extracts sequences for all chains present; otherwise only the specified chain is returned.
  • Residue mapping: Standard 3-letter amino acids are mapped to 1-letter codes. Special cases include SEC -> U, PYL -> O, and unknown UNK -> X.
  • Parsing behavior: Only ATOM records are considered. Each residue is added once per residue number to avoid duplication across atoms.
  • Headers: If include_header is true and custom_header is provided, the header is exactly that value. Otherwise, the header is composed as >{header_prefix}{chain}_Protein.
  • Line wrapping: Output sequences are wrapped at 60 characters per line.
  • Empty results: If no valid residues are found for the specified chain(s), the output will indicate that no valid protein sequences were found.

Troubleshooting

  • No valid protein sequences found: Confirm the PDB content includes ATOM lines for amino acid residues and that the chain_id exists. If targeting a chain, try leaving chain_id empty to include all chains.
  • Unexpected 'X' characters: These correspond to residues labeled UNK or unmapped 3-letter codes. Ensure your PDB uses standard residue names or adjust upstream processing.
  • Wrong chain extracted: Verify the exact chain identifier from the PDB (column 22). Chain IDs are case-sensitive.
  • Missing headers in output: Ensure include_header is true. If you expected a custom header, confirm custom_header is set and non-empty.
  • Formatting looks off: Line breaks are inserted every 60 characters by design. If you need a single line, post-process the output to remove line wraps.

Example Pipelines

Example
Example