Current Benchmark Coverage

108 Test Cases
8 Application Domains
5 Data Types
14 Visualization Operations

Application Domain Distribution

๐Ÿงฌ
Biology
33 cases
(30.6%)
โšก
Physics
24 cases
(22.2%)
๐Ÿ”ฌ
Others
25 cases
(23.1%)
๐Ÿฅ
Medical Science
12 cases
(11.1%)
๐Ÿ“
Mathematics
9 cases
(8.3%)
๐ŸŒ
Earth System Science
5 cases
(4.6%)
๐Ÿ”ญ
Astronomy
4 cases
(3.7%)
๐Ÿงช
Chemistry
3 cases
(2.8%)

Complexity Level Distribution

Task 74 (68.5%)
Workflow 34 (31.5%)
Total Operations 296

Note: Operation count represents the sum of all visualization operations across all test cases.

Data Type Distribution

Scalar Field 78
Vector Field 17
Multivariate 8
Time-varying 7
Tensor Field 2

Note: A case can have multiple data type tags

Visualization Operation Distribution

Color & Opacity Mapping Assign colors, opacity, or textures to data elements 77
View & Camera Control Adjust camera position, orientation, zoom, or lighting 62
Volume Rendering Render volumetric data directly using ray casting or splatting 40
Glyph & Marker Placement Place oriented, scaled, or typed glyphs at data points 28
Field Computation Derive new scalar, vector, or tensor field from existing data 26
Scientific Insight Derivation Interpret results to answer domain-specific questions 17
Feature Identification & Segmentation Detect, extract, or label discrete structures or regions 10
Surface & Contour Extraction Generate isosurfaces, contour lines, ribbons, or tubes 9
Temporal Processing Perform computations involving the time dimension of data 7
Data Subsetting & Extraction Isolate spatial regions or value-based subsets from a dataset 6
Data Smoothing & Filtering Reduce noise, enhance features, or apply statistical filters 5
Plot & Chart Generation Produce 2D statistical plots, histograms, or line charts 4
Dataset Restructuring Combine, partition, or reorganize multiple datasets 3
Geometric & Topological Transformation Modify the geometry or connectivity structure of a dataset 2

Note: A case can have multiple operation tags.

Browse Test Cases

Filter and explore all 108 test cases in the benchmark

Application Domain

Complexity Level

Data Type

Visualization Operation

Showing 137 of 137 cases
Case Name Application Domain Data Type Complexity Level Visualization Operations
Loading test cases...

Contribute Test Case

Help build a comprehensive benchmark for scientific visualization agents. Contribute a test case by submitting the dataset along with task descriptions and evaluation criteria.

๐Ÿ“ File Upload

Files are uploaded to Firebase Cloud Storage. All submissions are stored securely and will be used for the SciVisAgentBench benchmark.

  • Maximum data size: < 5GB per dataset
  • Ground truth images: PNG, JPG, TIFF, etc. (minimum 1024x1024 pixels recommended)
  • Supported source data formats: VTK, NIfTI, RAW, NRRD, HDF5, etc.

Contributor Information

Dataset Information

Application Domain

Data Type *

What information does the data represent?

Task Description for LLM Agent *

File Upload *

Any format accepted: VTK, NIfTI, RAW, NRRD, HDF5, etc. Multiple files allowed (Max size: 5GB recommended per file).
Optional: Any format accepted (e.g., ParaView state file, or state files of other visualization engines). Multiple files allowed.
Optional: Any format (JSON, YAML, TXT, etc.). Multiple files allowed.

Outcome-Based Evaluation Metrics *

Any format accepted: PNG, JPG, TIFF, etc. Upload multiple views of the expected visualization.
Optional: Any format accepted (e.g., Python, ParaView, Jupyter Notebook, MATLAB, R, or other visualization code). Multiple files allowed.
Optional: Enter the correct answers to any questions in the task description.

Additional Information

About SciVisAgentBench

What is SciVisAgentBench?

SciVisAgentBench is a comprehensive evaluation framework for scientific data analysis and visualization agents. We aim to transform SciVis agents from experimental tools into reliable scientific instruments through systematic evaluation.

Taxonomy of SciVis agent evaluation

Taxonomy of SciVis agent evaluation, organized into two perspectives: outcome-based evaluation assessing the relationship between input specifications and final outputs while treating agents as black boxes, and process-based evaluation analyzing the agent's action path, decision rationale, and intermediate behaviors.

Why Contribute?

  • Help establish standardized evaluation metrics for visualization agents
  • Drive innovation in autonomous scientific visualization
  • Contribute to open science and reproducible research
  • Be recognized as a contributor to this community effort

Evaluation Taxonomy

Our benchmark evaluates agents across multiple dimensions, including outcome quality, process efficiency, and task complexity. We combine LLM-as-a-judge with quantitative metrics for robust assessment.

See our GitHub repository for evaluation examples and deployment guides.

Team

The core team for this project comprises the University of Notre Dame and Lawrence Livermore National Laboratory. Main contributors include Kuangshi Ai (kai@nd.edu), Shusen Liu (liu42@llnl.gov), Kaiyuan Tang (ktang2@nd.edu), and Haichao Miao (miao1@llnl.gov).

Contributors

We are grateful to all contributors who have helped build this benchmark.

Contributor Institution # of Questions Subjects
No contributions yet. Be the first to contribute!