Evaluation Examples

Explore real-world evaluation reports from our benchmark testing. These examples demonstrate how scientific visualization agents perform across different domains and tasks.

Evaluation Setup

  • Backbone LLM: Claude Sonnet 4.5 - Used as the primary agent to execute visualization tasks
  • Judge LLM: GPT-5.2 - Employed to evaluate the quality and correctness of agent outputs
  • Evaluation Approach: LLM-as-a-Judge combined with quantitative metrics (image similarity, task completion, etc.)

Available Evaluation Reports

ParaView Visualization Cases

Core benchmark evaluation for ParaView-based scientific visualization

ParaView Agentic Workflows

Molecular Visualization Cases

Molecular dynamics and protein structure visualization with GMX and VMD

VMD GROMACS MCP

Bioimage Visualization Cases

Biological image analysis and visualization with Napari

Napari Biology MCP

Topology Visualization Cases

Topological data analysis and visualization with TTK

Topology TTK MCP

Object Identification Cases

Loading anonymized volume datasets, adjusting transfer functions, and identifying unknown objects

ParaView Vision Capability

About These Reports

Each report contains detailed evaluation metrics, including task completion rates, visualization quality scores, error analysis, and performance benchmarks. The reports demonstrate the capabilities and limitations of current visualization agents across different scientific domains.

These evaluations help us understand where agents excel and where improvements are needed, driving the development of more reliable scientific visualization tools.