Leaderboard

Performance rankings of scientific visualization agents across different task suites and evaluation metrics. Explore how different agent architectures and LLM backbones perform on our benchmark tasks.

About This Leaderboard

Evaluation Setup: All agents are evaluated using Claude Opus 4.6 or GPT-5.2 as LLM judges for most task suites. Topology Visualization uses rule-based evaluation and does not require an LLM judge. Scores and completion rates are reported as mean ± standard deviation across three repeated trials.

Metrics: Overall Score (quality of outputs), Completion Rate (% of tasks completed), pass@k (success in at least one of first k trials), pass^k (success in all k trials).

Higher is better
Lower is better
Best score in category

ParaView Visualization

Rank Agent + Model Overall Score ↑ Completion Rate ↑ pass@1 ↑ pass@3 ↑ pass^3 ↑
1 Claude CodeClaude-Sonnet-4.5 62.57 ± 0.51 97.92 ± 2.08 0.736 0.917 0.500
2 CodexGPT-5.2 60.17 ± 1.43 95.14 ± 2.41 0.771 0.938 0.521
3 ChatVisClaude-Sonnet-4.5 37.37 ± 3.02 54.17 ± 3.61 0.458 0.562 0.312
4 ChatVisGPT-5.2 30.77 ± 1.10 44.44 ± 1.20 0.382 0.604 0.167
5 ParaView-MCPClaude-Sonnet-4.5 26.43 ± 8.80 53.47 ± 15.07 0.257 0.417 0.146
6 ParaView-MCPGPT-5.2 24.63 ± 3.72 46.53 ± 6.36 0.236 0.292 0.188

Molecular Visualization

Rank Agent + Model Overall Score ↑ Completion Rate ↑ pass@1 ↑ pass@3 ↑ pass^3 ↑
1 CodexGPT-5.2 62.30 ± 6.32 94.87 ± 8.88 0.872 1.000 0.769
2 Claude CodeClaude-Sonnet-4.5 61.47 ± 6.78 94.87 ± 4.44 0.897 0.923 0.846
3 GMX-VMD-MCPClaude-Sonnet-4.5 60.23 ± 4.39 97.44 ± 4.44 0.846 0.846 0.846
4 GMX-VMD-MCPGPT-5.2 45.67 ± 7.48 66.67 ± 11.75 0.564 0.769 0.385

Bioimage Visualization

Rank Agent + Model Overall Score ↑ Completion Rate ↑ pass@1 ↑ pass@3 ↑ pass^3 ↑
1 BioImage-AgentGPT-5.2 66.67 ± 4.28 81.82 ± 0.00 0.788 0.818 0.727
2 BioImage-AgentClaude-Sonnet-4.5 57.67 ± 6.39 78.79 ± 5.25 0.636 0.727 0.545
3 Claude CodeClaude-Sonnet-4.5 52.83 ± 9.80 90.91 ± 9.09 0.697 0.818 0.636
4 CodexGPT-5.2 41.90 ± 4.69 75.76 ± 10.50 0.576 0.727 0.364

Object Identification

Rank Agent + Model Overall Score ↑ Completion Rate ↑ pass@1 ↑ pass@3 ↑ pass^3 ↑
1 CodexGPT-5.2 43.33 ± 5.28 92.59 ± 9.80 0.395 0.556 0.185
2 ParaView-MCPClaude-Sonnet-4.5 42.17 ± 0.50 92.59 ± 0.00 0.185 0.333 0.037
3 Claude CodeClaude-Sonnet-4.5 41.50 ± 3.55 83.95 ± 9.32 0.395 0.704 0.111
4 ParaView-MCPGPT-5.2 26.73 ± 7.36 49.38 ± 11.32 0.358 0.741 0.037

Rule-Based Evaluation

The following task suite uses rule-based evaluation metrics and does not require an LLM judge. Results are consistent across different evaluation methods.

Topology Visualization (Rule-Based)

Rank Agent + Model Overall Score ↑ Completion Rate ↑ pass@1 ↑ pass@3 ↑ pass^3 ↑
1 CodexGPT-5.2 76.43 ± 10.06 85.19 ± 6.42 0.778 0.889 0.556
2 Claude CodeClaude-Sonnet-4.5 45.23 ± 8.81 59.26 ± 6.42 0.444 0.556 0.333
3 TopoPilot2Claude-Sonnet-4.5 32.07 ± 1.01 55.56 ± 0.00 0.185 0.333 0.000
4 TopoPilot2GPT-5.2 31.13 ± 2.71 55.56 ± 0.00 0.148 0.222 0.111

Image-Based Evaluation Metrics (ParaView Visualization)

Rank Agent + Model PSNRscaled SSIMscaled LPIPSscaled
1 CodexGPT-5.2 21.27 ± 1.02 0.92 ± 0.02 0.10 ± 0.03
2 Claude CodeClaude-Sonnet-4.5 20.99 ± 0.68 0.92 ± 0.02 0.10 ± 0.02
3 ParaView-MCPClaude-Sonnet-4.5 12.00 ± 2.96 0.54 ± 0.14 0.49 ± 0.14
4 ChatVisClaude-Sonnet-4.5 10.50 ± 0.97 0.50 ± 0.04 0.50 ± 0.04
5 ChatVisGPT-5.2 9.63 ± 0.67 0.44 ± 0.01 0.57 ± 0.01
6 ParaView-MCPGPT-5.2 9.36 ± 1.27 0.46 ± 0.06 0.57 ± 0.05

Token Usage Statistics

Task Suite Agent + Model # Input Tokens # Output Tokens
ParaView
Visualization
ChatVisGPT-5.2 156.64K ± 8.02K 180.56K ± 8.84K
ChatVisClaude-Sonnet-4.5 116.83K ± 7.59K 152.70K ± 6.88K
ParaView-MCPGPT-5.2 5.71M ± 0.61M 33.30K ± 1.99K
ParaView-MCPClaude-Sonnet-4.5 28.51M ± 3.30M 380.30K ± 42.53K
Claude CodeClaude-Sonnet-4.5 39.49M ± 6.62M 425.32K ± 55.52K
CodexGPT-5.2 45.57M ± 9.47M 396.60K ± 23.27K
Molecular
Visualization
GMX-VMD-MCPGPT-5.2 1.56M ± 0.17M 28.27K ± 6.57K
GMX-VMD-MCPClaude-Sonnet-4.5 5.89M ± 1.00M 82.90K ± 12.53K
Claude CodeClaude-Sonnet-4.5 5.07M ± 0.12M 81.73K ± 3.45K
CodexGPT-5.2 8.63M ± 1.81M 112.28K ± 17.22K
Bioimage
Visualization
BioImage-AgentGPT-5.2 931.66K ± 186.31K 6.26K ± 1.47K
BioImage-AgentClaude-Sonnet-4.5 1.58M ± 0.02M 18.47K ± 1.28K
Claude CodeClaude-Sonnet-4.5 8.60M ± 0.17M 125.66K ± 2.43K
CodexGPT-5.2 10.36M ± 3.76M 104.83K ± 25.78K
Topology
Visualization
TopoPilot2GPT-5.2 394.00K ± 6.06K 2.89K ± 0.17K
TopoPilot2Claude-Sonnet-4.5 3.74M ± 3.67M 57.93K ± 32.35K
Claude CodeClaude-Sonnet-4.5 17.26M ± 1.87M 172.37K ± 18.51K
CodexGPT-5.2 46.04M ± 4.48M 193.58K ± 27.62K
Object
Identification
ParaView-MCPGPT-5.2 4.12M ± 0.39M 24.58K ± 1.20K
ParaView-MCPClaude-Sonnet-4.5 11.42M ± 0.49M 119.24K ± 4.65K
Claude CodeClaude-Sonnet-4.5 16.83M ± 0.28M 273.90K ± 22.71K
CodexGPT-5.2 38.43M ± 3.55M 344.27K ± 14.95K