Evaluation Examples
Explore real-world evaluation reports from our benchmark testing. These examples demonstrate how scientific visualization agents perform across different domains and tasks.
Evaluation Setup
- Backbone LLM: Claude Sonnet 4.5 - Used as the primary agent to execute visualization tasks
- Judge LLM: GPT-5.2 - Employed to evaluate the quality and correctness of agent outputs
- Evaluation Approach: LLM-as-a-Judge combined with quantitative metrics (image similarity, task completion, etc.)
Available Evaluation Reports
ParaView Visualization Cases
Core benchmark evaluation for ParaView-based scientific visualization
ParaView Agentic WorkflowsMolecular Visualization Cases
Molecular dynamics and protein structure visualization with GMX and VMD
VMD GROMACS MCPBioimage Visualization Cases
Biological image analysis and visualization with Napari
Napari Biology MCPTopology Visualization Cases
Topological data analysis and visualization with TTK
Topology TTK MCPObject Identification Cases
Loading anonymized volume datasets, adjusting transfer functions, and identifying unknown objects
ParaView Vision CapabilityAbout These Reports
Each report contains detailed evaluation metrics, including task completion rates, visualization quality scores, error analysis, and performance benchmarks. The reports demonstrate the capabilities and limitations of current visualization agents across different scientific domains.
These evaluations help us understand where agents excel and where improvements are needed, driving the development of more reliable scientific visualization tools.