Evaluate Diversity¶

Evaluates structural diversity across a set of protein structures and assigns cluster indices based on a TM-score threshold. It can run with configurable parallelism and supports modes for production, testing, and mock data. The result is a CSV summarizing which cluster each input structure belongs to.

Usage¶

Use this node after generating or loading multiple candidate protein structures to identify distinct structural groups. Typical workflows include screening diverse designs, deduplicating similar structures, or selecting representative candidates by cluster.

Inputs¶

Field	Required	Type	Description	Example
pdb	True	PDB	A collection of protein structures to evaluate. Provide a mapping of names to PDB content for one or more structures.	{"design_A": "ATOM ...\nEND\n", "design_B": "ATOM ...\nEND\n"}
max_ctm_threshold	True	FLOAT	Maximum TM-score threshold used to separate clusters. Lower values yield more clusters (stricter diversity), higher values yield fewer clusters.	0.6
num_processes	True	INT	Number of processes to use for parallel evaluation. Increase to speed up processing on larger inputs.	8
mode	True	CHOICE	Execution mode. MOCK returns predefined data for demonstration, PROD performs the actual evaluation, TEST uses minimal parameters for faster checks.	PROD

Outputs¶

Field	Type	Description	Example
score.csv	CSV	CSV table with clustering results per input structure (e.g., structure name and assigned cluster index).	name,cluster_index\ndesign_A,0\ndesign_B,1

Important Notes¶

Modes: MOCK returns canned results; TEST limits to a single structure and sets num_processes to 1 for quick validation; PROD performs the full evaluation.
Threshold behavior: max_ctm_threshold must be within 0.0–1.0 and directly influences how strictly structures are grouped.
Parallelism: num_processes controls concurrency; higher values can reduce runtime but increase resource usage.
Input size: Provide multiple structures to obtain meaningful clustering; a single structure will trivially form one cluster.
Output semantics: The CSV's cluster indices are labels for grouping; the specific numeric values are not ordered by quality.

Troubleshooting¶

Empty or minimal CSV output: Ensure the pdb input contains multiple structures; single-structure inputs yield limited clustering information.
Unexpectedly few or many clusters: Adjust max_ctm_threshold. Increase it (closer to 1.0) to merge more structures into fewer clusters; decrease it to split more aggressively.
Long runtimes on large batches: Increase num_processes if resources allow, or use TEST mode to validate configuration before full runs.
Results look repeated or identical in MOCK mode: Switch to PROD mode to run the actual evaluation; MOCK is intended only for demos.
Only one structure processed in TEST mode: This is expected—TEST intentionally limits to the first structure and sets num_processes to 1 for speed.