Skip to content

Evaluate Diversity

Evaluates structural diversity across a set of protein structures and assigns cluster indices based on a TM-score threshold. It can run with configurable parallelism and supports modes for production, testing, and mock data. The result is a CSV summarizing which cluster each input structure belongs to.
Preview

Usage

Use this node after generating or loading multiple candidate protein structures to identify distinct structural groups. Typical workflows include screening diverse designs, deduplicating similar structures, or selecting representative candidates by cluster.

Inputs

FieldRequiredTypeDescriptionExample
pdbTruePDBA collection of protein structures to evaluate. Provide a mapping of names to PDB content for one or more structures.{"design_A": "ATOM ...\nEND\n", "design_B": "ATOM ...\nEND\n"}
max_ctm_thresholdTrueFLOATMaximum TM-score threshold used to separate clusters. Lower values yield more clusters (stricter diversity), higher values yield fewer clusters.0.6
num_processesTrueINTNumber of processes to use for parallel evaluation. Increase to speed up processing on larger inputs.8
modeTrueCHOICEExecution mode. MOCK returns predefined data for demonstration, PROD performs the actual evaluation, TEST uses minimal parameters for faster checks.PROD

Outputs

FieldTypeDescriptionExample
score.csvCSVCSV table with clustering results per input structure (e.g., structure name and assigned cluster index).name,cluster_index\ndesign_A,0\ndesign_B,1

Important Notes

  • Modes: MOCK returns canned results; TEST limits to a single structure and sets num_processes to 1 for quick validation; PROD performs the full evaluation.
  • Threshold behavior: max_ctm_threshold must be within 0.0–1.0 and directly influences how strictly structures are grouped.
  • Parallelism: num_processes controls concurrency; higher values can reduce runtime but increase resource usage.
  • Input size: Provide multiple structures to obtain meaningful clustering; a single structure will trivially form one cluster.
  • Output semantics: The CSV's cluster indices are labels for grouping; the specific numeric values are not ordered by quality.

Troubleshooting

  • Empty or minimal CSV output: Ensure the pdb input contains multiple structures; single-structure inputs yield limited clustering information.
  • Unexpectedly few or many clusters: Adjust max_ctm_threshold. Increase it (closer to 1.0) to merge more structures into fewer clusters; decrease it to split more aggressively.
  • Long runtimes on large batches: Increase num_processes if resources allow, or use TEST mode to validate configuration before full runs.
  • Results look repeated or identical in MOCK mode: Switch to PROD mode to run the actual evaluation; MOCK is intended only for demos.
  • Only one structure processed in TEST mode: This is expected—TEST intentionally limits to the first structure and sets num_processes to 1 for speed.

Example Pipelines

Example
Example