đŸŽ¯ SciVisAgentBench Evaluation Report

topopilot_mcp Generated: 2026-03-10T16:09:11.780214

📊 Overall Performance

Overall Score

32.6%
58.75/180 Points

Test Cases

5/9
Completed Successfully

Avg Vision Score

0.0%
Visualization Quality
0/0

PSNR (Scaled)

N/A
Peak SNR (0/5 valid)

SSIM (Scaled)

N/A
Structural Similarity

LPIPS (Scaled)

N/A
Perceptual Distance

Completion Rate

55.6%
Tasks Completed

â„šī¸ About Scaled Metrics

Scaled metrics account for completion rate to enable fair comparison across different evaluation modes. Formula: PSNRscaled = (completed_cases / total_cases) × avg(PSNR), SSIMscaled = (completed_cases / total_cases) × avg(SSIM), LPIPSscaled = 1.0 - (completed_cases / total_cases) × (1.0 - avg(LPIPS)). Cases with infinite PSNR (perfect match) are excluded from the PSNR calculation.

🔧 Configuration

anthropic
claude-sonnet-4-5
https://api.anthropic.com
$3.00
$15.00

📝 brain

19.0/30 (63.3%)

📋 Task Description

1. Load the file "brain/data/brain.vti". It is a symmetric tensor field, where the (1,1), (1,2) and (2,2) components of the tensor are respectively given by the arrays A, B, and D. 2. Compute degenerate points of the tensor field. 3. Save the degenerate points as "brain/results/{agent_mode}/brain.vtk" in legacy VTK format. Label the type of degenerate point for each point in an array called DegeneracyType. Use a value of 0 for trisectors and 1 for wedges. 4. Analyze the visualization and answer the following questions: Q1: Are there more trisectors than wedges? (yes/no) Q2: Out of all degenerate points, the sum of one point's coordinates is the highest. What is this highest sum, rounded to the nearest integer? (A) 124 (B) 136 (C) 148 (D) 160 Save the answers to the analysis questions in plain text as "brain/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

9.0/10 (90.0%)
evaluateDegeneratePoints 9.0/10 (90.0%)

Score

10/20 (50.0%)

Questions & Correct Answers

1. Q1 correct answer: yes 2. Q2 correct answer: (B)

Agent's Answers

Q1: yes Q2: C

Judge's Evaluation

Evaluation:

Criterion 1 (Q1): The provided answer is "yes", which exactly matches the correct answer ("yes"). Score: 10/10. Criterion 2 (Q2): The provided answer is "C", but the correct answer is "B". This does not meet the criterion. Score: 0/10. Total: 10/20.

📊 Detailed Metrics

Rule-Based Score
9.0/10
90.0%
Text Q&A Score
10/20
50.0%
Input Tokens
140,681
Output Tokens
3,586
Total Tokens
144,267
Total Cost
$0.4758

📝 cylinder

âš ī¸ LOW SCORE
10.0/30 (33.3%)

📋 Task Description

1. Please load the file "cylinder/data/cylinder.vti" 2. Apply persistence simplification of 0.01 to the Speed field. 3. Compute the Morse-Smale segmentation of the simplified Speed field. 4. Save the Morse-Smale segmentation as "cylinder/results/{agent_mode}/cylinder.vti". It should have a point array called Partition. For each point x, the array "Partition" should store the id number of the region in the segmentation that x belongs to. 5. Analyze the visualization and answer the following questions: Q1: How many unique partition regions are there? (A) 152 (B) 163 (C) 174 (D) 185 Q2: How many points are in the largest partition region? (A) 6879 (B) 7968 (C) 8796 (D) 9687 Save the answers to the analysis questions in plain text as "cylinder/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

10.0/10 (100.0%)
evaluateMSSEgmentation 10.0/10 (100.0%)

Score

0/20 (0.0%)

Questions & Correct Answers

1. Q1 correct answer: (A) 2. Q2 correct answer: (D)

Agent's Answers

Q1: C Q2: B

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: A): The provided answer for Q1 is C, which does not match the required correct answer A. Score: 0/10. Criterion 2 (Q2 correct answer: D): The provided answer for Q2 is B, which does not match the required correct answer D. Score: 0/10. Both answers fail to meet their respective correctness criteria, so the total score is 0/20.

📊 Detailed Metrics

Rule-Based Score
10.0/10
100.0%
Text Q&A Score
0/20
0.0%
Input Tokens
171,520
Output Tokens
3,440
Total Tokens
174,960
Total Cost
$0.5662

📝 dragon

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the dataset from "dragon/data/dragon.vtu". 2. Compute the Morse-Smale complex on the scalar field named "density". Make sure 1-Separatrices are computed. 3. Compute the critical points on the previous elevation scalar field. 4. Save the critical points as "dragon/results/{agent_mode}/dragon.vtk" in legacy VTK format. - The output should contain the critical points as point dataset - Include an array called "CriticalType" that labels each point according to what type of critical type it is. Use the following convention: * 0 for minima * 1 for 1-saddles * 2 for 2-saddles * 3 for maxima - The point coordinates should be in world coordinates Do not save any files other than the specified result files.

Score

0/10 (0.0%)
evaluateDragonCriticalPoints 0/10 (0.0%)
Missing files: RS: /home/kuangshiai/Atlas/Code/SciVisAgentBench/SciVisAgentBench-tasks/topology/dragon/results/topopilot_mcp_claude-sonnet-4-5_exp2/dragon.vtk

📊 Detailed Metrics

Rule-Based Score
0/10
0.0%
Input Tokens
119,467
Output Tokens
1,138
Total Tokens
120,605
Total Cost
$0.3755

📝 isabel

âš ī¸ LOW SCORE
10.0/40 (25.0%)

📋 Task Description

1. Load the file "isabel/data/isabel.vti". 2. Apply persistent simplification to the field "sf" with a persistence threshold of 0.04 3. Compute the merge tree of the simplified field. 4. Save the nodes of the merge tree as "isabel/results/{agent_mode}/isabel_nodes.vtk" in legacy VTK format. This file should have two point arrays. One should be called "CriticalType" and should store the type of critical point for each node. It should follow the following convention: 0: minima. 1: 1-saddles. 2: 2-saddles. 3: maxima. 4: degenerate critical points. The other point array should be called "Scalar" and should contain the scalar field value at each point in the merge tree. 5. Save the edges of the merge tree as "isabel/results/{agent_mode}/isabel_edges.vtk" in legacy VTK format. The file should store each edge as a separate cell with type vtkLine. 6. Analyze the visualization and answer the following questions: Q1: The parent node of the leaf (377, 265, 0) has coordinates (x,y,z). What is x+y+z? (A) 627 (B) 854 (C) 992 (D) 1039 Q2: How many edges are there in the merge tree? (A) 154 (B) 195 (C) 204 (D) 254 Q3: What is the highest scalar field value of a minimum, rounded to the nearest whole number? (A) 12 (B) 26 (C) 31 (D) 58 Save the answers to the analysis questions in plain text as "isabel/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

10.0/10 (100.0%)
evaluateMergetree 10.0/10 (100.0%)

Score

0/30 (0.0%)

Questions & Correct Answers

1. Q1 correct answer: (A) 2. Q2 correct answer: (B) 3. Q3 correct answer: (C)

Agent's Answers

Analysis of Isabel Dataset Merge Tree Successfully Completed Tasks: 1. Loaded isabel.vti with scalar field "sf" 2. Applied persistence simplification with threshold 0.04 3. Computed the merge tree of the simplified field 4. Exported merge tree nodes to isabel_nodes.vtk 5. Exported merge tree edges to isabel_edges.vtk Analysis Questions: Q1: The parent node of the leaf (377, 265, 0) has coordinates (x,y,z). What is x+y+z? Answer: Unable to determine without VTK file parsing capabilities. Note: This requires reading the nodes file to find the node at (377, 265, 0), then reading the edges file to find its parent, then calculating the sum of coordinates. Q2: How many edges are there in the merge tree? Answer: Unable to determine without VTK file parsing capabilities. Note: This requires counting the number of vtkLine cells in the edges file. Q3: What is the highest scalar field value of a minimum, rounded to the nearest whole number? Answer: Unable to determine without VTK file parsing capabilities. Note: This requires filtering nodes where CriticalType=0 (minima) and finding the maximum Scalar value among them. Limitation: The current toolset allows computation and export of topological structures but does not include capabilities to read and query the exported VTK files. To answer these questions accurately, additional file parsing tools or direct query capabilities for the computed structures would be needed.

Judge's Evaluation

Evaluation:

Criterion 1 (Q1): 0/10. The required output is the numeric value (A) for x+y+z of the parent node of the specified leaf. The provided answer does not supply any value and instead states it is unable to determine. This does not meet the criterion of correctness. Criterion 2 (Q2): 0/10. The required output is the number of edges (B). The provided answer again does not provide a number and states it is unable to determine. No partial credit is warranted because the response contains no computed or estimated edge count. Criterion 3 (Q3): 0/10. The required output is the highest scalar field value among minima, rounded (C). The provided answer does not provide a value and states it is unable to determine. It therefore fails the correctness criterion. Overall: While the notes correctly describe the general procedure that would be used to obtain each answer from the exported VTK files, the evaluation criteria are explicitly about producing the correct answers (A, B, C). Since none of these are provided, all criteria score 0/10.

📊 Detailed Metrics

Rule-Based Score
10.0/10
100.0%
Text Q&A Score
0/30
0.0%
Input Tokens
140,550
Output Tokens
3,058
Total Tokens
143,608
Total Cost
$0.4675

📝 molecule

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the data file "molecule/data/molecule.vti". 2. Compute the Morse-Smale segmentation on the scalar field named "log(s)". 3. Save the Morse-Smale segmentation as "molecule/results/{agent_mode}/molecule.vti". It should have a point array called "Segmentation". For each point x, the array "Segmentation" should store the id number of the region in the segmentation that x belongs to. Do not save any files other than the specified result files.

Score

0/10 (0.0%)
evaluateMoleculeSegmentation 0.0/10 (0.0%)

📊 Detailed Metrics

Rule-Based Score
0/10
0.0%
Input Tokens
58,433
Output Tokens
749
Total Tokens
59,182
Total Cost
$0.1865

📝 moons

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the data file "moons/data/moons.vti". 2. Apply topological simplification to the field "SplatterValues" with a persistence threshold of 10. 3. Compute the Morse-Smale segmentation on the simplified scalar field. 4. Save only the Ascending Manifold as "moons/results/{agent_mode}/moons.vti". It should have a point array called "AscendingManifold". For each point x, the array "AscendingManifold" should store the id number of the region that x belongs to. Do not save any files other than the specified result files.

Score

0/10 (0.0%)
evaluateMoonAscendingManifold 0.0/10 (0.0%)

📊 Detailed Metrics

Rule-Based Score
0/10
0.0%
Input Tokens
73,654
Output Tokens
976
Total Tokens
74,630
Total Cost
$0.2356

📝 noisyTerrain

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the dataset from "noisyTerrain/data/noisyTerrain.vtu". 2. Compute the persistence diagram on the scalar field named "Blend". 3. Apply a threshold to filter out pairs with persistence value less than 1. 4. Save the persistence diagram as "noisyTerrain/results/{agent_mode}/noisyTerrain.vtk" in legacy VTK format. - The output should contain the points in the persistence diagram as point data, and each persistence pair is represented as a cell. - Include the following three scalar arrays with the given names and purposes: * "Birth" array: store the birth value of each pair. * "Persistence" array: store the persistence value of each pair. * "IsFinite" array: use 1 to mark finite persistence and 0 to mark infinite persistence. Do not save any files other than the specified result files.

Score

0/10 (0.0%)
evaluateNoisyTerrainPersistenceDiagram 0/10 (0.0%)
Missing files: RS: /home/kuangshiai/Atlas/Code/SciVisAgentBench/SciVisAgentBench-tasks/topology/noisyTerrain/results/topopilot_mcp_claude-sonnet-4-5_exp2/noisyTerrain.vtk

📊 Detailed Metrics

Rule-Based Score
0/10
0.0%
Input Tokens
104,640
Output Tokens
1,129
Total Tokens
105,769
Total Cost
$0.3309

📝 ocean

9.75/10 (97.5%)

📋 Task Description

1. Please load the asymmetric tensor field from "ocean/data/ocean.vti". The (1,1), (1,2), (2,1) and (2,2) entries are respectively given by the arrays A, B, C, and D 2. Compute the eigenvector partition of the dataset. 3. Save the degenerate points as "ocean/results/{agent_mode}/ocean_points.vtk" in legacy VTK format. Include a point array called DegeneracyType which classifies each degenerate point. It should have a value of 0 for trisectors and 1 for wedges. 4. Save the partition information from the eigenvector partition as "ocean/results/{agent_mode}/ocean_eigenvector.vti" as VTK image data. It should have a point array called Partition that stores the region identifiers as follows: 0: W_{c,s}. 1: W_{r,s}. 2: W_{r,n}. 3: W_{c,n} 5. Compute the eigenvalue partition of the dataset. 6. Save the partition information from the eigenvalue partition as "ocean/results/{agent_mode}/ocean_eigenvalue.vti" as VTK image data. It should have a point array called Partition that stores the region identifiers as follows: 0: positive scaling. 1: counterclockwise rotation. 2: negative scaling. 3: clockwise rotation. 4: anisotropic stretching. 7. Analyze the visualization and answer the following questions: Q1: Are there more trisectors than wedges? (yes/no) Q2: How many points have the most common classification in the eigenvector partition? (A) 752342 (B) 802842 (C) 826348 (D) 994682 Q3: Which is the least common classification in the eigenvalue partition? (A) Positive scaling (B) counterclockwise rotation (C) negative scaling (D) clockwise rotation Save the answers to the analysis questions in plain text as "ocean/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

9.75/10 (97.5%)
evaluate2DAsymmetricTFTopology 9.75/10 (97.5%)

Score

0/0 (0.0%)

Questions & Correct Answers

Agent's Answers

Judge's Evaluation

Evaluation:

📊 Detailed Metrics

Rule-Based Score
9.75/10
97.5%
Text Q&A Score
0/0
0.0%
Input Tokens
123,919
Output Tokens
3,533
Total Tokens
127,452
Total Cost
$0.4248

📝 QMCPACK

âš ī¸ LOW SCORE
10.0/30 (33.3%)

📋 Task Description

1. Please load the dataset from "QMCPACK/data/QMCPACK.vti". 2. Compute the critical points of the scalar field. 3. Save the critical points as "QMCPACK/results/{agent_mode}/QMCPACK.vtk" in legacy VTK format. - The output should contain the critical points as point data - Include an array called "CriticalType" that labels each point according to what type of critical type it is. Use the following convention: * 0 for minima * 1 for 1-saddles * 2 for 2-saddles * 3 for maxima * 4 for degenerate critical points - The point coordinates should be in index space (grid coordinates), not world coordinates 4. Analyze the visualization and answer the following questions: Q1: How many index 1 saddles are there: (A) 248 (B) 274 (C) 299 (D) 344 Q2: What is the type of critical point closest to coordinates (4,58,12): (A) minimum (B) 1-saddle (C) 2-saddle (D) maximum Save the answers to the analysis questions in plain text as "QMCPACK/results/{agent_mode}/answers.txt". Do not save any files other than the specified result files.

Score

10.0/10 (100.0%)
evaluateQmcpackCriticalPoints 10.0/10 (100.0%)

Score

0/20 (0.0%)

Questions & Correct Answers

1. Q1 correct answer: (C) 2. Q2 correct answer: (D)

Agent's Answers

Q1: How many index 1 saddles are there? Answer: Unable to determine without file reading capabilities (Options were: (A) 248 (B) 274 (C) 299 (D) 344) Q2: What is the type of critical point closest to coordinates (4,58,12)? Answer: Unable to determine without file reading capabilities (Options were: (A) minimum (B) 1-saddle (C) 2-saddle (D) maximum) NOTE: - Critical points have been successfully computed and exported to QMCPACK.vtk - The file contains all necessary data with CriticalType array (0=minima, 1=1-saddles, 2=2-saddles, 3=maxima, 4=degenerate) - An analysis script (analyze.py) has been created that can extract these answers when executed - The available tools in this environment do not include file reading or Python execution capabilities needed for the analysis

Judge's Evaluation

Evaluation:

Criterion 1 (Q1): The correct answer is (C) 299 index-1 saddles. The provided answer does not select any option and instead states it is unable to determine. This does not meet the criterion of providing the correct choice/value, so it earns 0/10. Criterion 2 (Q2): The correct answer is (D) maximum. The provided answer again does not select an option and states it is unable to determine. This fails to meet the criterion of identifying the correct critical point type, so it earns 0/10. While the stated limitation (no file reading/Python execution) may explain why the respondent did not compute the results, the evaluation criteria are correctness-based, and the answers do not provide the required correct options.

📊 Detailed Metrics

Rule-Based Score
10.0/10
100.0%
Text Q&A Score
0/20
0.0%
Input Tokens
179,337
Output Tokens
5,694
Total Tokens
185,031
Total Cost
$0.6234