🎯 SciVisAgentBench Evaluation Report

claude_code Generated: 2026-03-12T14:42:06.938602

📊 Overall Performance

Overall Score

53.8%
113.0/210 Points

Test Cases

5/9
Completed Successfully

Avg Vision Score

0.0%
Visualization Quality
0/0

PSNR (Scaled)

N/A
Peak SNR (0/5 valid)

SSIM (Scaled)

N/A
Structural Similarity

LPIPS (Scaled)

N/A
Perceptual Distance

Completion Rate

55.6%
Tasks Completed

ℹ️ About Scaled Metrics

Scaled metrics account for completion rate to enable fair comparison across different evaluation modes. Formula: PSNRscaled = (completed_cases / total_cases) × avg(PSNR), SSIMscaled = (completed_cases / total_cases) × avg(SSIM), LPIPSscaled = 1.0 - (completed_cases / total_cases) × (1.0 - avg(LPIPS)). Cases with infinite PSNR (perfect match) are excluded from the PSNR calculation.

🔧 Configuration

anthropic
claude-sonnet-4-5
https://livai-api.llnl.gov
$3.00
$15.00

📝 brain

27.0/30 (90.0%)

📋 Task Description

1. Load the file "brain/data/brain.vti". It is a symmetric tensor field, where the (1,1), (1,2) and (2,2) components of the tensor are respectively given by the arrays A, B, and D. 2. Compute degenerate points of the tensor field. 3. Save the degenerate points as "brain/results/{agent_mode}/brain.vtk" in legacy VTK format. Label the type of degenerate point for each point in an array called DegeneracyType. Use a value of 0 for trisectors and 1 for wedges. 4. Analyze the visualization and answer the following questions: Q1: Are there more trisectors than wedges? (yes/no) Q2: Out of all degenerate points, the sum of one point's coordinates is the highest. What is this highest sum, rounded to the nearest integer? (A) 124 (B) 136 (C) 148 (D) 160 Save the answers to the analysis questions in plain text as "brain/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

7.0/10 (70.0%)
evaluateDegeneratePoints 7.0/10 (70.0%)

Score

20/20 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: yes 2. Q2 correct answer: (B)

Agent's Answers

Q1: yes Q2: (B) 136

Judge's Evaluation

Evaluation:

Criterion 1 (Q1): The provided answer is "yes", which exactly matches the correct answer "yes". Score: 10/10. Criterion 2 (Q2): The provided answer is "(B) 136". The criterion specifies the correct answer as "(B)"; the selected option matches exactly. The additional numeric value "136" does not contradict the choice and does not change the selected option. Score: 10/10.

📊 Detailed Metrics

Rule-Based Score
7.0/10
70.0%
Text Q&A Score
20/20
100.0%
Input Tokens
1,175,187
Output Tokens
22,854
Total Tokens
1,198,041
Total Cost
$3.8684

📝 cylinder

29.0/30 (96.7%)

📋 Task Description

1. Please load the file "cylinder/data/cylinder.vti" 2. Apply persistence simplification of 0.01 to the Speed field. 3. Compute the Morse-Smale segmentation of the simplified Speed field. 4. Save the Morse-Smale segmentation as "cylinder/results/{agent_mode}/cylinder.vti". It should have a point array called Partition. For each point x, the array "Partition" should store the id number of the region in the segmentation that x belongs to. 5. Analyze the visualization and answer the following questions: Q1: How many unique partition regions are there? (A) 152 (B) 163 (C) 174 (D) 185 Q2: How many points are in the largest partition region? (A) 6879 (B) 7968 (C) 8796 (D) 9687 Save the answers to the analysis questions in plain text as "cylinder/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

9.0/10 (90.0%)
evaluateMSSEgmentation 9.0/10 (90.0%)

Score

20/20 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: (A) 2. Q2 correct answer: (D)

Agent's Answers

Q1: A Q2: D

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: A): The provided answer for Q1 is 'A', which exactly matches the specified correct answer. Score: 10/10. Criterion 2 (Q2 correct answer: D): The provided answer for Q2 is 'D', which exactly matches the specified correct answer. Score: 10/10. Total: 20/20. Both answers fully meet the evaluation criteria with exact matches.

📊 Detailed Metrics

Rule-Based Score
9.0/10
90.0%
Text Q&A Score
20/20
100.0%
Input Tokens
2,387,572
Output Tokens
25,122
Total Tokens
2,412,694
Total Cost
$7.5395

📝 dragon

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the dataset from "dragon/data/dragon.vtu". 2. Compute the Morse-Smale complex on the scalar field named "density". Make sure 1-Separatrices are computed. 3. Compute the critical points on the previous elevation scalar field. 4. Save the critical points as "dragon/results/{agent_mode}/dragon.vtk" in legacy VTK format. - The output should contain the critical points as point dataset - Include an array called "CriticalType" that labels each point according to what type of critical type it is. Use the following convention: * 0 for minima * 1 for 1-saddles * 2 for 2-saddles * 3 for maxima - The point coordinates should be in world coordinates Do not save any files other than the specified result files.

Score

0/10 (0.0%)
evaluateDragonCriticalPoints 0/10 (0.0%)
2.0

📊 Detailed Metrics

Rule-Based Score
0/10
0.0%
Input Tokens
2,256,039
Output Tokens
18,202
Total Tokens
2,274,241
Total Cost
$7.0411

📝 isabel

⚠️ LOW SCORE
18.0/40 (45.0%)

📋 Task Description

1. Load the file "isabel/data/isabel.vti". 2. Apply persistent simplification to the field "sf" with a persistence threshold of 0.04 3. Compute the merge tree of the simplified field. 4. Save the nodes of the merge tree as "isabel/results/{agent_mode}/isabel_nodes.vtk" in legacy VTK format. This file should have two point arrays. One should be called "CriticalType" and should store the type of critical point for each node. It should follow the following convention: 0: minima. 1: 1-saddles. 2: 2-saddles. 3: maxima. 4: degenerate critical points. The other point array should be called "Scalar" and should contain the scalar field value at each point in the merge tree. 5. Save the edges of the merge tree as "isabel/results/{agent_mode}/isabel_edges.vtk" in legacy VTK format. The file should store each edge as a separate cell with type vtkLine. 6. Analyze the visualization and answer the following questions: Q1: The parent node of the leaf (377, 265, 0) has coordinates (x,y,z). What is x+y+z? (A) 627 (B) 854 (C) 992 (D) 1039 Q2: How many edges are there in the merge tree? (A) 154 (B) 195 (C) 204 (D) 254 Q3: What is the highest scalar field value of a minimum, rounded to the nearest whole number? (A) 12 (B) 26 (C) 31 (D) 58 Save the answers to the analysis questions in plain text as "isabel/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

8.0/10 (80.0%)
evaluateMergetree 8.0/10 (80.0%)

Score

10/30 (33.3%)

Questions & Correct Answers

1. Q1 correct answer: (A) 2. Q2 correct answer: (B) 3. Q3 correct answer: (C)

Agent's Answers

Q1: (B) Q2: (B) Q3: (B)

Judge's Evaluation

Evaluation:

Evaluation against the provided correct answers: 1) Q1 (Correct: A; Given: B) — Score: 0/10 The provided answer does not match the required correct option (A). Therefore it fails to meet the criterion. 2) Q2 (Correct: B; Given: B) — Score: 10/10 The provided answer exactly matches the required correct option (B). Fully meets the criterion. 3) Q3 (Correct: C; Given: B) — Score: 0/10 The provided answer does not match the required correct option (C). Therefore it fails to meet the criterion. Total: 0 + 10 + 0 = 10/30.

📊 Detailed Metrics

Rule-Based Score
8.0/10
80.0%
Text Q&A Score
10/30
33.3%
Input Tokens
2,801,574
Output Tokens
29,307
Total Tokens
2,830,881
Total Cost
$8.8443

📝 molecule

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the data file "molecule/data/molecule.vti". 2. Compute the Morse-Smale segmentation on the scalar field named "log(s)". 3. Save the Morse-Smale segmentation as "molecule/results/{agent_mode}/molecule.vti". It should have a point array called "Segmentation". For each point x, the array "Segmentation" should store the id number of the region in the segmentation that x belongs to. Do not save any files other than the specified result files.

Score

0/10 (0.0%)
evaluateMoleculeSegmentation 0.0/10 (0.0%)

📊 Detailed Metrics

Rule-Based Score
0/10
0.0%
Input Tokens
870,712
Output Tokens
12,046
Total Tokens
882,758
Total Cost
$2.7928

📝 moons

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the data file "moons/data/moons.vti". 2. Apply topological simplification to the field "SplatterValues" with a persistence threshold of 10. 3. Compute the Morse-Smale segmentation on the simplified scalar field. 4. Save only the Ascending Manifold as "moons/results/{agent_mode}/moons.vti". It should have a point array called "AscendingManifold". For each point x, the array "AscendingManifold" should store the id number of the region that x belongs to. Do not save any files other than the specified result files.

Score

0/10 (0.0%)
evaluateMoonAscendingManifold 0/10 (0.0%)
Missing files: RS: /home/liu42/gitRepo/2026_AgentBench/SciVisAgentBench/SciVisAgentBench-tasks/topology/moons/results/claude_code_claude-sonnet-4-5_exp1/moons.vti

📊 Detailed Metrics

Rule-Based Score
0/10
0.0%
Input Tokens
5,065,982
Output Tokens
33,341
Total Tokens
5,099,323
Total Cost
$15.6981

📝 noisyTerrain

9.0/10 (90.0%)

📋 Task Description

1. Load the dataset from "noisyTerrain/data/noisyTerrain.vtu". 2. Compute the persistence diagram on the scalar field named "Blend". 3. Apply a threshold to filter out pairs with persistence value less than 1. 4. Save the persistence diagram as "noisyTerrain/results/{agent_mode}/noisyTerrain.vtk" in legacy VTK format. - The output should contain the points in the persistence diagram as point data, and each persistence pair is represented as a cell. - Include the following three scalar arrays with the given names and purposes: * "Birth" array: store the birth value of each pair. * "Persistence" array: store the persistence value of each pair. * "IsFinite" array: use 1 to mark finite persistence and 0 to mark infinite persistence. Do not save any files other than the specified result files.

Score

9.0/10 (90.0%)
evaluateNoisyTerrainPersistenceDiagram 9.0/10 (90.0%)

📊 Detailed Metrics

Rule-Based Score
9.0/10
90.0%
Input Tokens
871,373
Output Tokens
7,033
Total Tokens
878,406
Total Cost
$2.7196

📝 ocean

10/40 (25.0%)

📋 Task Description

1. Please load the asymmetric tensor field from "ocean/data/ocean.vti". The (1,1), (1,2), (2,1) and (2,2) entries are respectively given by the arrays A, B, C, and D 2. Compute the eigenvector partition of the dataset. 3. Save the degenerate points as "ocean/results/{agent_mode}/ocean_points.vtk" in legacy VTK format. Include a point array called DegeneracyType which classifies each degenerate point. It should have a value of 0 for trisectors and 1 for wedges. 4. Save the partition information from the eigenvector partition as "ocean/results/{agent_mode}/ocean_eigenvector.vti" as VTK image data. It should have a point array called Partition that stores the region identifiers as follows: 0: W_{c,s}. 1: W_{r,s}. 2: W_{r,n}. 3: W_{c,n} 5. Compute the eigenvalue partition of the dataset. 6. Save the partition information from the eigenvalue partition as "ocean/results/{agent_mode}/ocean_eigenvalue.vti" as VTK image data. It should have a point array called Partition that stores the region identifiers as follows: 0: positive scaling. 1: counterclockwise rotation. 2: negative scaling. 3: clockwise rotation. 4: anisotropic stretching. 7. Analyze the visualization and answer the following questions: Q1: Are there more trisectors than wedges? (yes/no) Q2: How many points have the most common classification in the eigenvector partition? (A) 752342 (B) 802842 (C) 826348 (D) 994682 Q3: Which is the least common classification in the eigenvalue partition? (A) Positive scaling (B) counterclockwise rotation (C) negative scaling (D) clockwise rotation Save the answers to the analysis questions in plain text as "ocean/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

0/10 (0.0%)
evaluate2DAsymmetricTFTopology 0/10 (0.0%)
np.float64(1.0)

Score

10/30 (33.3%)

Questions & Correct Answers

1. Q1 correct answer: no 2. Q2 correct answer: (C) 3. Q3 correct answer: (C)

Agent's Answers

Q1: yes Q2: (A) 5091 Q3: (C) negative scaling

Judge's Evaluation

Evaluation:

Q1 (score 0/10): The provided answer is "yes", but the evaluation criterion states the correct answer is "no". This is fully incorrect. Q2 (score 0/10): The provided answer is "(A) 5091", but the evaluation criterion states the correct answer is "(C)". The selected option does not match the correct option, so it is incorrect. Q3 (score 10/10): The provided answer is "(C) negative scaling", and the evaluation criterion states the correct answer is "(C)". The option matches and the description is consistent, so it is correct.

📊 Detailed Metrics

Rule-Based Score
0/10
0.0%
Text Q&A Score
10/30
33.3%
Input Tokens
857,892
Output Tokens
20,307
Total Tokens
878,199
Total Cost
$2.8783

📝 QMCPACK

30.0/30 (100.0%)

📋 Task Description

1. Please load the dataset from "QMCPACK/data/QMCPACK.vti". 2. Compute the critical points of the scalar field. 3. Save the critical points as "QMCPACK/results/{agent_mode}/QMCPACK.vtk" in legacy VTK format. - The output should contain the critical points as point data - Include an array called "CriticalType" that labels each point according to what type of critical type it is. Use the following convention: * 0 for minima * 1 for 1-saddles * 2 for 2-saddles * 3 for maxima * 4 for degenerate critical points - The point coordinates should be in index space (grid coordinates), not world coordinates 4. Analyze the visualization and answer the following questions: Q1: How many index 1 saddles are there: (A) 248 (B) 274 (C) 299 (D) 344 Q2: What is the type of critical point closest to coordinates (4,58,12): (A) minimum (B) 1-saddle (C) 2-saddle (D) maximum Save the answers to the analysis questions in plain text as "QMCPACK/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

10.0/10 (100.0%)
evaluateQmcpackCriticalPoints 10.0/10 (100.0%)

Score

20/20 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: (C) 2. Q2 correct answer: (D)

Agent's Answers

Q1: (C) 299 Q2: (D) maximum

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: (C)): The provided answer for Q1 is “(C) 299”, which matches the required correct option (C). Score: 10/10. Criterion 2 (Q2 correct answer: (D)): The provided answer for Q2 is “(D) maximum”, which matches the required correct option (D). Score: 10/10. Total: 20/20.

📊 Detailed Metrics

Rule-Based Score
10.0/10
100.0%
Text Q&A Score
20/20
100.0%
Input Tokens
636,134
Output Tokens
9,328
Total Tokens
645,462
Total Cost
$2.0483