Evaluation Examples

Explore real-world evaluation reports from our benchmark testing. These examples demonstrate how scientific visualization agents perform across different domains and tasks.

Evaluation Setup

Backbone LLM: Claude Sonnet 4.5 - Used as the primary agent to execute visualization tasks
Judge LLM: GPT-5.2 - Employed to evaluate the quality and correctness of agent outputs
Evaluation Approach: LLM-as-a-Judge combined with quantitative metrics (image similarity, task completion, etc.)

Available Evaluation Reports

ParaView Visualization Cases

Core benchmark evaluation for ParaView-based scientific visualization

ParaView Agentic Workflows

ParaView-MCP Agent ChatVis Agent Claude Code Codex

Molecular Visualization Cases

Molecular dynamics and protein structure visualization with GMX and VMD

VMD GROMACS MCP

GMX-VMD-MCP Agent Claude Code Codex

Bioimage Visualization Cases

Biological image analysis and visualization with Napari

Napari Biology MCP

BioImage-Agent Claude Code Codex

Topology Visualization Cases

Topological data analysis and visualization with TTK

Topology TTK MCP

TopoPilot2 Agent Claude Code Codex

Object Identification Cases

Loading anonymized volume datasets, adjusting transfer functions, and identifying unknown objects

ParaView Vision Capability

ParaView-MCP Agent Claude Code Codex

About These Reports

Each report contains detailed evaluation metrics, including task completion rates, visualization quality scores, error analysis, and performance benchmarks. The reports demonstrate the capabilities and limitations of current visualization agents across different scientific domains.

These evaluations help us understand where agents excel and where improvements are needed, driving the development of more reliable scientific visualization tools.