đŸŽ¯ SciVisAgentBench Evaluation Report

codex_cli Generated: 2026-03-18T09:54:45.646132

📊 Overall Performance

Overall Score

36.6%
97/265 Points

Test Cases

7/11
Completed Successfully

Avg Vision Score

26.0%
Visualization Quality
26/130

PSNR (Scaled)

N/A
Peak SNR (0/7 valid)

SSIM (Scaled)

N/A
Structural Similarity

LPIPS (Scaled)

N/A
Perceptual Distance

Completion Rate

63.6%
Tasks Completed

â„šī¸ About Scaled Metrics

Scaled metrics account for completion rate to enable fair comparison across different evaluation modes. Formula: PSNRscaled = (completed_cases / total_cases) × avg(PSNR), SSIMscaled = (completed_cases / total_cases) × avg(SSIM), LPIPSscaled = 1.0 - (completed_cases / total_cases) × (1.0 - avg(LPIPS)). Cases with infinite PSNR (perfect match) are excluded from the PSNR calculation.

🔧 Configuration

openai
gpt-5.2
N/A
$1.75
$14.00

📝 case_1

âš ī¸ LOW SCORE
21/45 (46.7%)

📋 Task Description

1. Load the "data/dataset_002/dataset_002_ch0.tif" dataset into napari as channel 0 and "data/dataset_002/dataset_002_ch1.tif" as channel 1. 2. Set the colormap for channel 0 to red and channel 1 to green. 3. Switch to the 3D view. 4. Use additive blending for all channels to create an overlay visualization. 5. Go the timestep 14. Q1: Does the cell show protrusions? (Yes/No) 6. Take a screenshot of the result, save it to "eval_visualization_tasks/case_1/results/{agent_mode}/screenshot_1.png" 7. Answer Q1 in a plain text file "eval_visualization_tasks/case_1/results/{agent_mode}/multi_channel_answer.txt".

đŸ–ŧī¸ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
3/20
Goals
2
Points/Goal
10
Goal 1
2/10
Criterion: Does the visualization show a green cell with red blobs on the inside?
Judge's Assessment: Ground truth shows roughly round green cells with multiple distinct red puncta/blobs inside. The result image shows a single elongated/diagonal structure rendered mostly green/yellow with some red signal, but it does not clearly depict a green cell containing discrete red blobs; the red appears sparse and not clearly internal puncta within a cell-like boundary.
Goal 2
1/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: The result rendering is very dissimilar to the ground truth: different morphology (elongated slab vs multiple round cells), different composition (UI screenshot with controls visible), and different visual style/contrast. It does not resemble the expected microscopy-like view of green cells with internal red puncta.

Overall Assessment

The result does not convincingly show a green cell with internal red blobs and is visually far from the ground truth in both content and presentation (including an application UI overlay and mismatched morphology).

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore, it fully satisfies the evaluation criterion.

📊 Detailed Metrics

Visualization Quality
3/20
Output Generation
5/5
Efficiency
3/10
Text Q&A Score
10/10
100.0%
Input Tokens
503,849
Output Tokens
6,164
Total Tokens
510,013
Total Cost
$2.6117

📝 case_2

10/10 (100.0%)

📋 Task Description

1. Load the "data/dataset_002/Points.csv" dataset into napari. 2. Check if the points layer has been created. Q1: Was the points layer created successfully? (Yes/No) 3. Answer Q1 in a plain text file "eval_visualization_tasks/case_2/results/{agent_mode}/points_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore, it fully satisfies the evaluation criterion.

📊 Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
348,575
Output Tokens
4,327
Total Tokens
352,902
Total Cost
$1.8078

📝 case_3

10/10 (100.0%)

📋 Task Description

1. Load the "data/dataset_002/Shapes.csv" dataset into napari. 2. Check if the shapes layer has been created. Q1: Was the shapes layer created successfully? (Yes/No) 3. Answer Q1 in a plain text file "eval_visualization_tasks/case_3/results/{agent_mode}/shapes_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore it fully satisfies the criterion.

📊 Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
215,355
Output Tokens
3,215
Total Tokens
218,570
Total Cost
$1.1250

📝 case_4

❌ FAILED
0/0 (0.0%)

📋 Task Description

1. Load the "data/dataset_002/Labels.tif" dataset into napari. 2. Check if a new layer called "Labels" has been created. Q1: Was the layer created successfully? (Yes/No) 3. Answer Q1 in a plain text file "eval_visualization_tasks/case_4/results/{agent_mode}/labels_answer.txt".

Score

0/0 (0.0%)

Questions & Correct Answers

Agent's Answers

Judge's Evaluation

Evaluation:

📊 Detailed Metrics

Text Q&A Score
0/0
0.0%
Input Tokens
1,615
Output Tokens
3,000
Total Tokens
4,615
Total Cost
$0.0531

📝 case_5

❌ FAILED
0/45 (0.0%)

📋 Task Description

1. Load the dataset into napari: data/dataset_001/dataset_001.tiff 2. Read the target figure: data/dataset_001/dataset_001.png but don't load it into napari. 3. Read the dataset description: data/dataset_001/dataset_001.yaml. 4. Set the same colormaps and blending modes as the target figure. 5. Adjust contrast and gamma as needed to match the target figure. 6. Take a screenshot of your recreation. 7. If the recreation does not match the target figure, adjust the visualization settings and take a screenshot again. 8. Stop when the recreation matches the target figure or you have tried five different visualization settings. 9. Save the final screenshot to "eval_visualization_tasks/case_5/results/{agent_mode}/screenshot.png".

đŸ–ŧī¸ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
1/30
Goals
3
Points/Goal
10
Goal 1
1/10
Criterion: Does the result screenshot look similar to the ground truth image?
Judge's Assessment: The ground truth shows a multichannel fluorescence microscopy image with visible green neurite-like structures, red nuclei, and faint blue puncta on a dark background. The result image is essentially completely black with no discernible structures, so it does not resemble the ground truth composition or content.
Goal 2
0/10
Criterion: Are the same colormaps and blending modes used as in the target figure?
Judge's Assessment: The ground truth clearly uses multiple color channels (green, red, and some blue/cyan) blended additively over a dark background. The result shows no visible color information at all, so the colormaps and blending modes are not matched/represented.
Goal 3
0/10
Criterion: Is the contrast and gamma adjusted to match the target figure?
Judge's Assessment: The ground truth has adjusted contrast/gamma to reveal dim structures while keeping the background dark. The result appears underexposed or fully clipped to black, indicating contrast/gamma are not set to reveal the data and do not match the target.

Overall Assessment

Overall, the result fails to reproduce the ground truth visualization: it is nearly entirely black, with no visible multichannel signal, incorrect (absent) colormapping/blending, and severely mismatched contrast/gamma.

📊 Detailed Metrics

Visualization Quality
1/30
Output Generation
5/5
Efficiency
3/10
Input Tokens
521,172
Output Tokens
8,894
Total Tokens
530,066
Total Cost
$2.7393

📝 case_6

âš ī¸ LOW SCORE
12/35 (34.3%)

📋 Task Description

1. Read the file "data/dataset_003/eval_iso_surface_determination_target_1.txt" to get the target iso-surface values for different tooth structures. 2. Load data/dataset_003/dataset_003.tif into napari. 3. Switch to 3D view mode and set the rendering to iso. 4. Find the iso surface value that shows the target clearly. 5. Rotate the camera to several angles and take a screenshot of the result each time to check if the target structure is clearly visible from different angles. 6. If the target structure is not clearly visible, adjust the iso surface value and take a screenshot again. 7. Stop when the target structure is clearly visible or you have tried five different iso surface values. 8. Save the final screenshot to "eval_visualization_tasks/case_6/results/{agent_mode}/screenshot.png".

đŸ–ŧī¸ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
7/20
Goals
2
Points/Goal
10
Goal 1
2/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: The ground truth shows a smooth, glossy, light-gray tooth-like surface on a black background, viewed upright and centered. The result shows a low-level triangular mesh rendering (orange with black wireframe) on a white background, with a different camera angle/orientation and a much more faceted appearance. Overall visual style, shading, background, and viewpoint do not match.
Goal 2
5/10
Criterion: Does the visualization show the target structure clearly?
Judge's Assessment: The target structure (tooth-like geometry with roots) is present in the result, but it is not shown as clearly as in the ground truth due to heavy mesh/wireframe overlay, faceting, and a less informative viewing angle. The overall shape is still recognizable, but surface details and silhouette clarity are reduced.

Overall Assessment

The result captures the general tooth geometry but differs strongly from the ground truth in rendering style (mesh/wireframe vs smooth shaded), background color, and camera orientation, which reduces similarity and clarity.

📊 Detailed Metrics

Visualization Quality
7/20
Output Generation
5/5
Efficiency
0/10
Input Tokens
1,248,285
Output Tokens
16,129
Total Tokens
1,264,414
Total Cost
$6.4834

📝 case_7

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the image "data/dataset_002/dataset_002_ch0.tif" and set channel 0 to a magenta colormap. 2. Switch to a 3D MIP view. 3. Take a screenshot and analyze it to count how many complete cells are visible (not cut off by edges). Q1: answer with the number of complete cells you counted, for example "5" if you see 5 complete cells. 4. Save the answer of Q1 to the questions in plain text as "eval_visualization_tasks/case_7/results/{agent_mode}/Q1_answer.txt".

Score

0/10 (0.0%)

Questions & Correct Answers

1. Q1 correct answer: 2

Agent's Answers

3

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: 2): The provided task answer is "3", but the expected correct answer is "2". This does not meet the criterion, so it receives 0/10. No partial credit is warranted because the criterion specifies an exact correct value.

📊 Detailed Metrics

Text Q&A Score
0/10
0.0%
Input Tokens
1,396,100
Output Tokens
13,328
Total Tokens
1,409,428
Total Cost
$7.1804

📝 case_8

10/10 (100.0%)

📋 Task Description

1. Load the image "data/dataset_001/dataset_001.tiff". 2. Get basic statistics (min, max, mean, std) for the loaded layer. 3. Extract the raw layer data and examine its properties. 4. Save the current layer to a file for further analysis. Q1: Was the statistical analysis and data export successful? (Yes/No) 6. Save the answer of Q1 in plain text as "eval_visualization_tasks/case_8/results/{agent_mode}/layer_statistics_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore it fully satisfies the criterion.

📊 Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
268,418
Output Tokens
3,931
Total Tokens
272,349
Total Cost
$1.4011

📝 case_9

10/10 (100.0%)

📋 Task Description

1. Load the image "data/dataset_001/dataset_001.tiff". 2. Add point annotations at random locations on the image. 3. Add shape annotations (rectangles or circles) at random locations on the image. Q1: Check if layers have been generated. (Yes/No) 4. Save the answer of Q1 in plain text as "eval_visualization_tasks/case_9/results/{agent_mode}/annotation_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore it fully satisfies the evaluation criterion.

📊 Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
275,474
Output Tokens
3,605
Total Tokens
279,079
Total Cost
$1.4314

📝 case_10

24/35 (68.6%)

📋 Task Description

1. Load the image "data/dataset_002/dataset_002_ch0.tif" into napari. 2. Trace the cell surface on the current slice by adding a polygon shape in a new shape layer. 3. Use a screenshot to validate whether the polygon correctly traces the cell surface. 4. If the trace is not accurate, adjust the polygon and take a new screenshot to validate. 5. Stop when the trace is accurate or you have tried five different attempts. 6. Save the results and the final screenshot to "eval_visualization_tasks/case_10/results/{agent_mode}/cell_surface_trace.png".

đŸ–ŧī¸ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
16/20
Goals
2
Points/Goal
10
Goal 1
7/10
Criterion: Does the final screenshot show a polygon shape that accurately traces the outline of the cell surface?
Judge's Assessment: A magenta polygon is present and generally follows the visible bright cell-like region. However, the outline looks jagged and somewhat irregular, with several vertices that appear to cut across the boundary or extend slightly outside/inside the apparent edge, suggesting the trace is approximate rather than a clean, accurate contour.
Goal 2
9/10
Criterion: Is the polygon layer correctly overlaid on the image?
Judge's Assessment: The polygon layer is clearly visible and properly overlaid on top of the image data. The alignment appears consistent with the underlying structure, with appropriate contrast (magenta on dark background) and no obvious offset between the polygon and the image.

Overall Assessment

Without ground truth, the result appears to be a reasonable cell-surface outline with good overlay/registration. The main limitation is contour quality/accuracy: the polygon is quite jagged and does not consistently hug the apparent boundary, indicating noticeable tracing imperfections.

📊 Detailed Metrics

Visualization Quality
16/20
Output Generation
5/5
Efficiency
3/10
Input Tokens
569,110
Output Tokens
9,828
Total Tokens
578,938
Total Cost
$2.9930

📝 case_11

❌ FAILED
0/55 (0.0%)

📋 Task Description

1. Load the "data/dataset_002/dataset_002_ch0.tif" dataset into napari as channel 0 and "data/dataset_002/dataset_002_ch1.tif" as channel 1. 2. Depending on the number of channels, set the colormap for the first channel 0 to red and channel 1 to green. 3. Switch to the 3D view. 4. Zoom in to the cell in the middle. 5. Rotate the camera to a side view. 6. Take a screenshot of the zoomed-in view and save it to "eval_visualization_tasks/case_11/results/{agent_mode}/zoom_screenshot.png". 7. Take a screenshot of the side view and save it to "eval_visualization_tasks/case_11/results/{agent_mode}/rotate_screenshot.png".

đŸ–ŧī¸ Visualization Comparison - Set 1

Ground Truth

Ground Truth 1

Agent Result

Result 1

Score Summary

Total Score
0/20
Goals
2
Points/Goal
10
Goal 1
0/10
Criterion: Does the visualization show a zoomed-in view of the cell in the middle?
Judge's Assessment: Ground truth shows a zoomed-in view of a central cell (bright green structure with red puncta). The result image is essentially completely black with no visible cell or zoomed region, so the zoomed-in middle cell view is not present.
Goal 2
0/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: The ground truth contains detailed fluorescence-like texture (green cell body and protrusions, red spots, dark background). The result rendering is blank/black and does not resemble the ground truth in content, color, or structure.

Overall Assessment

The result appears to be an empty/failed render (black frame), so it neither shows the required zoomed-in central cell nor matches the ground truth appearance.

đŸ–ŧī¸ Visualization Comparison - Set 2

Ground Truth

Ground Truth 2

Agent Result

Result 2

Score Summary

Total Score
0/20
Goals
2
Points/Goal
10
Goal 1
0/10
Criterion: Does the visualization show a side view of the cell?
Judge's Assessment: The ground truth shows a clear side-view fluorescence rendering of a cell (elongated green structure with internal red puncta). The result image is essentially completely black with no visible cell structure, so a side view is not shown.
Goal 2
0/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: The result does not resemble the ground truth at all: it lacks the green cell body, red internal features, overall shape, and any comparable intensity distribution. It appears to be an empty/failed render.

Overall Assessment

The generated visualization appears blank and fails to display the cell or match the ground-truth side-view rendering in any way.

📊 Detailed Metrics

Visualization Quality
0/40
Output Generation
5/5
Efficiency
3/10
Input Tokens
672,141
Output Tokens
6,445
Total Tokens
678,586
Total Cost
$3.4574