đŸŽ¯ SciVisAgentBench Evaluation Report

codex_cli Generated: 2026-03-16T18:24:12.279842

📊 Overall Performance

Overall Score

66.2%
139.0/210 Points

Test Cases

7/9
Completed Successfully

Avg Vision Score

0.0%
Visualization Quality
0/0

PSNR (Scaled)

N/A
Peak SNR (0/7 valid)

SSIM (Scaled)

N/A
Structural Similarity

LPIPS (Scaled)

N/A
Perceptual Distance

Completion Rate

77.8%
Tasks Completed

â„šī¸ About Scaled Metrics

Scaled metrics account for completion rate to enable fair comparison across different evaluation modes. Formula: PSNRscaled = (completed_cases / total_cases) × avg(PSNR), SSIMscaled = (completed_cases / total_cases) × avg(SSIM), LPIPSscaled = 1.0 - (completed_cases / total_cases) × (1.0 - avg(LPIPS)). Cases with infinite PSNR (perfect match) are excluded from the PSNR calculation.

🔧 Configuration

openai
gpt-5.2
N/A
$5.00
$15.00

📝 brain

âš ī¸ LOW SCORE
9.0/30 (30.0%)

📋 Task Description

1. Load the file "brain/data/brain.vti". It is a symmetric tensor field, where the (1,1), (1,2) and (2,2) components of the tensor are respectively given by the arrays A, B, and D. 2. Compute degenerate points of the tensor field. 3. Save the degenerate points as "brain/results/{agent_mode}/brain.vtk" in legacy VTK format. Label the type of degenerate point for each point in an array called DegeneracyType. Use a value of 0 for trisectors and 1 for wedges. 4. Analyze the visualization and answer the following questions: Q1: Are there more trisectors than wedges? (yes/no) Q2: Out of all degenerate points, the sum of one point's coordinates is the highest. What is this highest sum, rounded to the nearest integer? (A) 124 (B) 136 (C) 148 (D) 160 Save the answers to the analysis questions in plain text as "brain/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

9.0/10 (90.0%)
evaluateDegeneratePoints 9.0/10 (90.0%)

Score

0/20 (0.0%)

Questions & Correct Answers

1. Q1 correct answer: yes 2. Q2 correct answer: (B)

Agent's Answers

Q1: no Q2: 136

Judge's Evaluation

Evaluation:

Criterion 1 (Q1): The provided answer is "no", but the correct answer is "yes". This does not meet the criterion at all. Score: 0/10. Criterion 2 (Q2): The provided answer is "136", but the correct answer is "(B)". The response does not match the required option format and does not provide the correct choice. Score: 0/10. Total score: 0/20.

📊 Detailed Metrics

Rule-Based Score
9.0/10
90.0%
Text Q&A Score
0/20
0.0%
Input Tokens
1,642,320
Output Tokens
22,195
Total Tokens
1,664,515
Total Cost
$8.5445

📝 cylinder

29.0/30 (96.7%)

📋 Task Description

1. Please load the file "cylinder/data/cylinder.vti" 2. Apply persistence simplification of 0.01 to the Speed field. 3. Compute the Morse-Smale segmentation of the simplified Speed field. 4. Save the Morse-Smale segmentation as "cylinder/results/{agent_mode}/cylinder.vti". It should have a point array called Partition. For each point x, the array "Partition" should store the id number of the region in the segmentation that x belongs to. 5. Analyze the visualization and answer the following questions: Q1: How many unique partition regions are there? (A) 152 (B) 163 (C) 174 (D) 185 Q2: How many points are in the largest partition region? (A) 6879 (B) 7968 (C) 8796 (D) 9687 Save the answers to the analysis questions in plain text as "cylinder/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

9.0/10 (90.0%)
evaluateMSSEgmentation 9.0/10 (90.0%)

Score

20/20 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: (A) 2. Q2 correct answer: (D)

Agent's Answers

Q1: A Q2: D

Judge's Evaluation

Evaluation:

Criterion 1 (Q1): The provided answer is A, which matches the correct answer (A). Score: 10/10. Criterion 2 (Q2): The provided answer is D, which matches the correct answer (D). Score: 10/10. Overall, both answers exactly match the expected correct options, so full credit is awarded.

📊 Detailed Metrics

Rule-Based Score
9.0/10
90.0%
Text Q&A Score
20/20
100.0%
Input Tokens
15,883,391
Output Tokens
45,241
Total Tokens
15,928,632
Total Cost
$80.0956

📝 dragon

9.0/10 (90.0%)

📋 Task Description

1. Load the dataset from "dragon/data/dragon.vtu". 2. Compute the Morse-Smale complex on the scalar field named "density". Make sure 1-Separatrices are computed. 3. Compute the critical points on the previous elevation scalar field. 4. Save the critical points as "dragon/results/{agent_mode}/dragon.vtk" in legacy VTK format. - The output should contain the critical points as point dataset - Include an array called "CriticalType" that labels each point according to what type of critical type it is. Use the following convention: * 0 for minima * 1 for 1-saddles * 2 for 2-saddles * 3 for maxima - The point coordinates should be in world coordinates Do not save any files other than the specified result files.

Score

9.0/10 (90.0%)
evaluateDragonCriticalPoints 9.0/10 (90.0%)

📊 Detailed Metrics

Rule-Based Score
9.0/10
90.0%
Input Tokens
4,970,864
Output Tokens
21,463
Total Tokens
4,992,327
Total Cost
$25.1763

📝 isabel

39.0/40 (97.5%)

📋 Task Description

1. Load the file "isabel/data/isabel.vti". 2. Apply persistent simplification to the field "sf" with a persistence threshold of 0.04 3. Compute the merge tree of the simplified field. 4. Save the nodes of the merge tree as "isabel/results/{agent_mode}/isabel_nodes.vtk" in legacy VTK format. This file should have two point arrays. One should be called "CriticalType" and should store the type of critical point for each node. It should follow the following convention: 0: minima. 1: 1-saddles. 2: 2-saddles. 3: maxima. 4: degenerate critical points. The other point array should be called "Scalar" and should contain the scalar field value at each point in the merge tree. 5. Save the edges of the merge tree as "isabel/results/{agent_mode}/isabel_edges.vtk" in legacy VTK format. The file should store each edge as a separate cell with type vtkLine. 6. Analyze the visualization and answer the following questions: Q1: The parent node of the leaf (377, 265, 0) has coordinates (x,y,z). What is x+y+z? (A) 627 (B) 854 (C) 992 (D) 1039 Q2: How many edges are there in the merge tree? (A) 154 (B) 195 (C) 204 (D) 254 Q3: What is the highest scalar field value of a minimum, rounded to the nearest whole number? (A) 12 (B) 26 (C) 31 (D) 58 Save the answers to the analysis questions in plain text as "isabel/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

9.0/10 (90.0%)
evaluateMergetree 9.0/10 (90.0%)

Score

30/30 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: (A) 2. Q2 correct answer: (B) 3. Q3 correct answer: (C)

Agent's Answers

Q1: A Q2: B Q3: C

Judge's Evaluation

Evaluation:

Criterion 1 (Q1): The provided answer is A, which matches the correct answer (A). Score: 10/10. Criterion 2 (Q2): The provided answer is B, which matches the correct answer (B). Score: 10/10. Criterion 3 (Q3): The provided answer is C, which matches the correct answer (C). Score: 10/10. All answers exactly match the specified correct answers, so full credit is awarded for each criterion.

📊 Detailed Metrics

Rule-Based Score
9.0/10
90.0%
Text Q&A Score
30/30
100.0%
Input Tokens
12,823,080
Output Tokens
28,616
Total Tokens
12,851,696
Total Cost
$64.5446

📝 molecule

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the data file "molecule/data/molecule.vti". 2. Compute the Morse-Smale segmentation on the scalar field named "log(s)". 3. Save the Morse-Smale segmentation as "molecule/results/{agent_mode}/molecule.vti". It should have a point array called "Segmentation". For each point x, the array "Segmentation" should store the id number of the region in the segmentation that x belongs to. Do not save any files other than the specified result files.

Score

0/10 (0.0%)
evaluateMoleculeSegmentation 0.0/10 (0.0%)

📊 Detailed Metrics

Rule-Based Score
0/10
0.0%
Input Tokens
1,901,411
Output Tokens
17,365
Total Tokens
1,918,776
Total Cost
$9.7675

📝 moons

6.0/10 (60.0%)

📋 Task Description

1. Load the data file "moons/data/moons.vti". 2. Apply topological simplification to the field "SplatterValues" with a persistence threshold of 10. 3. Compute the Morse-Smale segmentation on the simplified scalar field. 4. Save only the Ascending Manifold as "moons/results/{agent_mode}/moons.vti". It should have a point array called "AscendingManifold". For each point x, the array "AscendingManifold" should store the id number of the region that x belongs to. Do not save any files other than the specified result files.

Score

6.0/10 (60.0%)
evaluateMoonAscendingManifold 6.0/10 (60.0%)

📊 Detailed Metrics

Rule-Based Score
6.0/10
60.0%
Input Tokens
617,574
Output Tokens
11,707
Total Tokens
629,281
Total Cost
$3.2635

📝 noisyTerrain

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the dataset from "noisyTerrain/data/noisyTerrain.vtu". 2. Compute the persistence diagram on the scalar field named "Blend". 3. Apply a threshold to filter out pairs with persistence value less than 1. 4. Save the persistence diagram as "noisyTerrain/results/{agent_mode}/noisyTerrain.vtk" in legacy VTK format. - The output should contain the points in the persistence diagram as point data, and each persistence pair is represented as a cell. - Include the following three scalar arrays with the given names and purposes: * "Birth" array: store the birth value of each pair. * "Persistence" array: store the persistence value of each pair. * "IsFinite" array: use 1 to mark finite persistence and 0 to mark infinite persistence. Do not save any files other than the specified result files.

Score

0/10 (0.0%)
evaluateNoisyTerrainPersistenceDiagram 0.0/10 (0.0%)

📊 Detailed Metrics

Rule-Based Score
0/10
0.0%
Input Tokens
896,068
Output Tokens
12,536
Total Tokens
908,604
Total Cost
$4.6684

📝 ocean

âš ī¸ LOW SCORE
17.0/40 (42.5%)

📋 Task Description

1. Please load the asymmetric tensor field from "ocean/data/ocean.vti". The (1,1), (1,2), (2,1) and (2,2) entries are respectively given by the arrays A, B, C, and D 2. Compute the eigenvector partition of the dataset. 3. Save the degenerate points as "ocean/results/{agent_mode}/ocean_points.vtk" in legacy VTK format. Include a point array called DegeneracyType which classifies each degenerate point. It should have a value of 0 for trisectors and 1 for wedges. 4. Save the partition information from the eigenvector partition as "ocean/results/{agent_mode}/ocean_eigenvector.vti" as VTK image data. It should have a point array called Partition that stores the region identifiers as follows: 0: W_{c,s}. 1: W_{r,s}. 2: W_{r,n}. 3: W_{c,n} 5. Compute the eigenvalue partition of the dataset. 6. Save the partition information from the eigenvalue partition as "ocean/results/{agent_mode}/ocean_eigenvalue.vti" as VTK image data. It should have a point array called Partition that stores the region identifiers as follows: 0: positive scaling. 1: counterclockwise rotation. 2: negative scaling. 3: clockwise rotation. 4: anisotropic stretching. 7. Analyze the visualization and answer the following questions: Q1: Are there more trisectors than wedges? (yes/no) Q2: How many points have the most common classification in the eigenvector partition? (A) 752342 (B) 802842 (C) 826348 (D) 994682 Q3: Which is the least common classification in the eigenvalue partition? (A) Positive scaling (B) counterclockwise rotation (C) negative scaling (D) clockwise rotation Save the answers to the analysis questions in plain text as "ocean/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

7.0/10 (70.0%)
evaluate2DAsymmetricTFTopology 7.0/10 (70.0%)

Score

10/30 (33.3%)

Questions & Correct Answers

1. Q1 correct answer: no 2. Q2 correct answer: (C) 3. Q3 correct answer: (C)

Agent's Answers

Q1: yes Q2: A Q3: C

Judge's Evaluation

Evaluation:

Criterion 1 (Q1): Provided answer is "yes" but the correct answer is "no". This does not meet the criterion. Score: 0/10. Criterion 2 (Q2): Provided answer is "A" but the correct answer is "C". This does not meet the criterion. Score: 0/10. Criterion 3 (Q3): Provided answer is "C" and the correct answer is "C". This fully meets the criterion. Score: 10/10. Total: 0 + 0 + 10 = 10/30.

📊 Detailed Metrics

Rule-Based Score
7.0/10
70.0%
Text Q&A Score
10/30
33.3%
Input Tokens
2,689,873
Output Tokens
29,910
Total Tokens
2,719,783
Total Cost
$13.8980

📝 QMCPACK

30.0/30 (100.0%)

📋 Task Description

1. Please load the dataset from "QMCPACK/data/QMCPACK.vti". 2. Compute the critical points of the scalar field. 3. Save the critical points as "QMCPACK/results/{agent_mode}/QMCPACK.vtk" in legacy VTK format. - The output should contain the critical points as point data - Include an array called "CriticalType" that labels each point according to what type of critical type it is. Use the following convention: * 0 for minima * 1 for 1-saddles * 2 for 2-saddles * 3 for maxima * 4 for degenerate critical points - The point coordinates should be in index space (grid coordinates), not world coordinates 4. Analyze the visualization and answer the following questions: Q1: How many index 1 saddles are there: (A) 248 (B) 274 (C) 299 (D) 344 Q2: What is the type of critical point closest to coordinates (4,58,12): (A) minimum (B) 1-saddle (C) 2-saddle (D) maximum Save the answers to the analysis questions in plain text as "QMCPACK/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

10.0/10 (100.0%)
evaluateQmcpackCriticalPoints 10.0/10 (100.0%)

Score

20/20 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: (C) 2. Q2 correct answer: (D)

Agent's Answers

Q1: C (count=299) Q2: D (maximum)

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: C): The provided answer for Q1 is "C (count=299)", which matches the required correct answer (C). Score: 10/10. Criterion 2 (Q2 correct answer: D): The provided answer for Q2 is "D (maximum)", which matches the required correct answer (D). Score: 10/10. Total: 20/20. Both answers exactly match the specified correct options; the added parenthetical notes do not conflict with correctness.

📊 Detailed Metrics

Rule-Based Score
10.0/10
100.0%
Text Q&A Score
20/20
100.0%
Input Tokens
2,666,357
Output Tokens
14,716
Total Tokens
2,681,073
Total Cost
$13.5525