Leaderboard
Performance rankings of scientific visualization agents across different task suites and evaluation metrics. Explore how different agent architectures and LLM backbones perform on our benchmark tasks.
About This Leaderboard
Evaluation Setup: All agents are evaluated using Claude Opus 4.6 or GPT-5.2 as LLM judges for most task suites. Topology Visualization uses rule-based evaluation and does not require an LLM judge. Scores and completion rates are reported as mean ± standard deviation across three repeated trials.
Metrics: Overall Score (quality of outputs), Completion Rate (% of tasks completed), pass@k (success in at least one of first k trials), pass^k (success in all k trials).
ParaView Visualization
| Rank | Agent + Model | Overall Score ↑ | Completion Rate ↑ | pass@1 ↑ | pass@3 ↑ | pass^3 ↑ |
|---|---|---|---|---|---|---|
| 1 | Claude CodeClaude-Sonnet-4.5 | 62.57 ± 0.51 | 97.92 ± 2.08 | 0.736 | 0.917 | 0.500 |
| 2 | CodexGPT-5.2 | 60.17 ± 1.43 | 95.14 ± 2.41 | 0.771 | 0.938 | 0.521 |
| 3 | ChatVisClaude-Sonnet-4.5 | 37.37 ± 3.02 | 54.17 ± 3.61 | 0.458 | 0.562 | 0.312 |
| 4 | ChatVisGPT-5.2 | 30.77 ± 1.10 | 44.44 ± 1.20 | 0.382 | 0.604 | 0.167 |
| 5 | ParaView-MCPClaude-Sonnet-4.5 | 26.43 ± 8.80 | 53.47 ± 15.07 | 0.257 | 0.417 | 0.146 |
| 6 | ParaView-MCPGPT-5.2 | 24.63 ± 3.72 | 46.53 ± 6.36 | 0.236 | 0.292 | 0.188 |
Molecular Visualization
| Rank | Agent + Model | Overall Score ↑ | Completion Rate ↑ | pass@1 ↑ | pass@3 ↑ | pass^3 ↑ |
|---|---|---|---|---|---|---|
| 1 | CodexGPT-5.2 | 62.30 ± 6.32 | 94.87 ± 8.88 | 0.872 | 1.000 | 0.769 |
| 2 | Claude CodeClaude-Sonnet-4.5 | 61.47 ± 6.78 | 94.87 ± 4.44 | 0.897 | 0.923 | 0.846 |
| 3 | GMX-VMD-MCPClaude-Sonnet-4.5 | 60.23 ± 4.39 | 97.44 ± 4.44 | 0.846 | 0.846 | 0.846 |
| 4 | GMX-VMD-MCPGPT-5.2 | 45.67 ± 7.48 | 66.67 ± 11.75 | 0.564 | 0.769 | 0.385 |
Bioimage Visualization
| Rank | Agent + Model | Overall Score ↑ | Completion Rate ↑ | pass@1 ↑ | pass@3 ↑ | pass^3 ↑ |
|---|---|---|---|---|---|---|
| 1 | BioImage-AgentGPT-5.2 | 66.67 ± 4.28 | 81.82 ± 0.00 | 0.788 | 0.818 | 0.727 |
| 2 | BioImage-AgentClaude-Sonnet-4.5 | 57.67 ± 6.39 | 78.79 ± 5.25 | 0.636 | 0.727 | 0.545 |
| 3 | Claude CodeClaude-Sonnet-4.5 | 52.83 ± 9.80 | 90.91 ± 9.09 | 0.697 | 0.818 | 0.636 |
| 4 | CodexGPT-5.2 | 41.90 ± 4.69 | 75.76 ± 10.50 | 0.576 | 0.727 | 0.364 |
Object Identification
| Rank | Agent + Model | Overall Score ↑ | Completion Rate ↑ | pass@1 ↑ | pass@3 ↑ | pass^3 ↑ |
|---|---|---|---|---|---|---|
| 1 | CodexGPT-5.2 | 43.33 ± 5.28 | 92.59 ± 9.80 | 0.395 | 0.556 | 0.185 |
| 2 | ParaView-MCPClaude-Sonnet-4.5 | 42.17 ± 0.50 | 92.59 ± 0.00 | 0.185 | 0.333 | 0.037 |
| 3 | Claude CodeClaude-Sonnet-4.5 | 41.50 ± 3.55 | 83.95 ± 9.32 | 0.395 | 0.704 | 0.111 |
| 4 | ParaView-MCPGPT-5.2 | 26.73 ± 7.36 | 49.38 ± 11.32 | 0.358 | 0.741 | 0.037 |
Rule-Based Evaluation
The following task suite uses rule-based evaluation metrics and does not require an LLM judge. Results are consistent across different evaluation methods.
Topology Visualization (Rule-Based)
| Rank | Agent + Model | Overall Score ↑ | Completion Rate ↑ | pass@1 ↑ | pass@3 ↑ | pass^3 ↑ |
|---|---|---|---|---|---|---|
| 1 | CodexGPT-5.2 | 76.43 ± 10.06 | 85.19 ± 6.42 | 0.778 | 0.889 | 0.556 |
| 2 | Claude CodeClaude-Sonnet-4.5 | 45.23 ± 8.81 | 59.26 ± 6.42 | 0.444 | 0.556 | 0.333 |
| 3 | TopoPilot2Claude-Sonnet-4.5 | 32.07 ± 1.01 | 55.56 ± 0.00 | 0.185 | 0.333 | 0.000 |
| 4 | TopoPilot2GPT-5.2 | 31.13 ± 2.71 | 55.56 ± 0.00 | 0.148 | 0.222 | 0.111 |
Image-Based Evaluation Metrics (ParaView Visualization)
| Rank | Agent + Model | PSNRscaled ↑ | SSIMscaled ↑ | LPIPSscaled ↓ |
|---|---|---|---|---|
| 1 | CodexGPT-5.2 | 21.27 ± 1.02 | 0.92 ± 0.02 | 0.10 ± 0.03 |
| 2 | Claude CodeClaude-Sonnet-4.5 | 20.99 ± 0.68 | 0.92 ± 0.02 | 0.10 ± 0.02 |
| 3 | ParaView-MCPClaude-Sonnet-4.5 | 12.00 ± 2.96 | 0.54 ± 0.14 | 0.49 ± 0.14 |
| 4 | ChatVisClaude-Sonnet-4.5 | 10.50 ± 0.97 | 0.50 ± 0.04 | 0.50 ± 0.04 |
| 5 | ChatVisGPT-5.2 | 9.63 ± 0.67 | 0.44 ± 0.01 | 0.57 ± 0.01 |
| 6 | ParaView-MCPGPT-5.2 | 9.36 ± 1.27 | 0.46 ± 0.06 | 0.57 ± 0.05 |
Token Usage Statistics
| Task Suite | Agent + Model | # Input Tokens | # Output Tokens |
|---|---|---|---|
| ParaView Visualization |
ChatVisGPT-5.2 | 156.64K ± 8.02K | 180.56K ± 8.84K |
| ChatVisClaude-Sonnet-4.5 | 116.83K ± 7.59K | 152.70K ± 6.88K | |
| ParaView-MCPGPT-5.2 | 5.71M ± 0.61M | 33.30K ± 1.99K | |
| ParaView-MCPClaude-Sonnet-4.5 | 28.51M ± 3.30M | 380.30K ± 42.53K | |
| Claude CodeClaude-Sonnet-4.5 | 39.49M ± 6.62M | 425.32K ± 55.52K | |
| CodexGPT-5.2 | 45.57M ± 9.47M | 396.60K ± 23.27K | |
| Molecular Visualization |
GMX-VMD-MCPGPT-5.2 | 1.56M ± 0.17M | 28.27K ± 6.57K |
| GMX-VMD-MCPClaude-Sonnet-4.5 | 5.89M ± 1.00M | 82.90K ± 12.53K | |
| Claude CodeClaude-Sonnet-4.5 | 5.07M ± 0.12M | 81.73K ± 3.45K | |
| CodexGPT-5.2 | 8.63M ± 1.81M | 112.28K ± 17.22K | |
| Bioimage Visualization |
BioImage-AgentGPT-5.2 | 931.66K ± 186.31K | 6.26K ± 1.47K |
| BioImage-AgentClaude-Sonnet-4.5 | 1.58M ± 0.02M | 18.47K ± 1.28K | |
| Claude CodeClaude-Sonnet-4.5 | 8.60M ± 0.17M | 125.66K ± 2.43K | |
| CodexGPT-5.2 | 10.36M ± 3.76M | 104.83K ± 25.78K | |
| Topology Visualization |
TopoPilot2GPT-5.2 | 394.00K ± 6.06K | 2.89K ± 0.17K |
| TopoPilot2Claude-Sonnet-4.5 | 3.74M ± 3.67M | 57.93K ± 32.35K | |
| Claude CodeClaude-Sonnet-4.5 | 17.26M ± 1.87M | 172.37K ± 18.51K | |
| CodexGPT-5.2 | 46.04M ± 4.48M | 193.58K ± 27.62K | |
| Object Identification |
ParaView-MCPGPT-5.2 | 4.12M ± 0.39M | 24.58K ± 1.20K |
| ParaView-MCPClaude-Sonnet-4.5 | 11.42M ± 0.49M | 119.24K ± 4.65K | |
| Claude CodeClaude-Sonnet-4.5 | 16.83M ± 0.28M | 273.90K ± 22.71K | |
| CodexGPT-5.2 | 38.43M ± 3.55M | 344.27K ± 14.95K |