🎯 SciVisAgentBench Evaluation Report

claude_code Generated: 2026-03-13T14:18:17.696268

πŸ“Š Overall Performance

Overall Score

60.7%
167/275 Points

Test Cases

11/11
Completed Successfully

Avg Vision Score

51.0%
Visualization Quality
60/130

PSNR (Scaled)

N/A
Peak SNR (0/11 valid)

SSIM (Scaled)

N/A
Structural Similarity

LPIPS (Scaled)

N/A
Perceptual Distance

Completion Rate

100.0%
Tasks Completed

ℹ️ About Scaled Metrics

Scaled metrics account for completion rate to enable fair comparison across different evaluation modes. Formula: PSNRscaled = (completed_cases / total_cases) Γ— avg(PSNR), SSIMscaled = (completed_cases / total_cases) Γ— avg(SSIM), LPIPSscaled = 1.0 - (completed_cases / total_cases) Γ— (1.0 - avg(LPIPS)). Cases with infinite PSNR (perfect match) are excluded from the PSNR calculation.

πŸ”§ Configuration

anthropic
claude-sonnet-4-5
https://livai-api.llnl.gov
$3.00
$15.00

πŸ“ case_1

40/45 (88.9%)

πŸ“‹ Task Description

1. Load the "data/dataset_002/dataset_002_ch0.tif" dataset into napari as channel 0 and "data/dataset_002/dataset_002_ch1.tif" as channel 1. 2. Set the colormap for channel 0 to red and channel 1 to green. 3. Switch to the 3D view. 4. Use additive blending for all channels to create an overlay visualization. 5. Go the timestep 14. Q1: Does the cell show protrusions? (Yes/No) 6. Take a screenshot of the result, save it to "eval_visualization_tasks/case_1/results/{agent_mode}/screenshot_1.png" 7. Answer Q1 in a plain text file "eval_visualization_tasks/case_1/results/{agent_mode}/multi_channel_answer.txt".

πŸ–ΌοΈ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
16/20
Goals
2
Points/Goal
10
Goal 1
9/10
Criterion: Does the visualization show a green cell with red blobs on the inside?
Judge's Assessment: Ground truth shows a bright green cell with multiple distinct red punctate blobs inside. The result image also clearly shows a green cell with several red blobs within the cell body. The red signal is present and localized similarly, though the overall scene contains fewer surrounding structures due to cropping.
Goal 2
7/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: The result is visually similar in terms of the main cell (same green morphology with internal red puncta), but it is not a full match to the ground truth framing: it appears cropped/zoomed to focus on the central cell and omits the large right-side cell and other background context present in the ground truth. Brightness/contrast and background texture are broadly comparable, but the composition differs noticeably.

Overall Assessment

The core biological content (green cell with internal red blobs) is well reproduced, but the result does not match the ground truth view/composition because it is more tightly cropped and lacks other cells/regions visible in the reference.

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore, it fully satisfies the evaluation criterion with no discrepancies.

πŸ“Š Detailed Metrics

Visualization Quality
16/20
Output Generation
5/5
Efficiency
9/10
Text Q&A Score
10/10
100.0%
Input Tokens
172,591
Output Tokens
3,457
Total Tokens
176,048
Total Cost
$0.5696

πŸ“ case_2

10/10 (100.0%)

πŸ“‹ Task Description

1. Load the "data/dataset_002/Points.csv" dataset into napari. 2. Check if the points layer has been created. Q1: Was the points layer created successfully? (Yes/No) 3. Answer Q1 in a plain text file "eval_visualization_tasks/case_2/results/{agent_mode}/points_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore, it fully satisfies the evaluation criterion with no discrepancies.

πŸ“Š Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
1,095,362
Output Tokens
11,374
Total Tokens
1,106,736
Total Cost
$3.4567

πŸ“ case_3

10/10 (100.0%)

πŸ“‹ Task Description

1. Load the "data/dataset_002/Shapes.csv" dataset into napari. 2. Check if the shapes layer has been created. Q1: Was the shapes layer created successfully? (Yes/No) 3. Answer Q1 in a plain text file "eval_visualization_tasks/case_3/results/{agent_mode}/shapes_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore it fully satisfies the criterion with no discrepancies.

πŸ“Š Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
696,202
Output Tokens
9,043
Total Tokens
705,245
Total Cost
$2.2243

πŸ“ case_4

10/10 (100.0%)

πŸ“‹ Task Description

1. Load the "data/dataset_002/Labels.tif" dataset into napari. 2. Check if a new layer called "Labels" has been created. Q1: Was the layer created successfully? (Yes/No) 3. Answer Q1 in a plain text file "eval_visualization_tasks/case_4/results/{agent_mode}/labels_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore, it fully satisfies the evaluation criterion with no discrepancies.

πŸ“Š Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
221,692
Output Tokens
3,208
Total Tokens
224,900
Total Cost
$0.7132

πŸ“ case_5

⚠️ LOW SCORE
14/45 (31.1%)

πŸ“‹ Task Description

1. Load the dataset into napari: data/dataset_001/dataset_001.tiff 2. Read the target figure: data/dataset_001/dataset_001.png but don't load it into napari. 3. Read the dataset description: data/dataset_001/dataset_001.yaml. 4. Set the same colormaps and blending modes as the target figure. 5. Adjust contrast and gamma as needed to match the target figure. 6. Take a screenshot of your recreation. 7. If the recreation does not match the target figure, adjust the visualization settings and take a screenshot again. 8. Stop when the recreation matches the target figure or you have tried five different visualization settings. 9. Save the final screenshot to "eval_visualization_tasks/case_5/results/{agent_mode}/screenshot.png".

πŸ–ΌοΈ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
9/30
Goals
3
Points/Goal
10
Goal 1
3/10
Criterion: Does the result screenshot look similar to the ground truth image?
Judge's Assessment: The result image does not look similar to the ground truth. The ground truth shows many red/orange nuclei-like blobs with green neurite structures and some faint blue puncta, distributed across the field. The result instead shows a different scene: fewer, larger green cell bodies with extensive orange punctate/filamentous signal and a noticeably different spatial distribution and morphology. Overall composition and channel content do not match.
Goal 2
2/10
Criterion: Are the same colormaps and blending modes used as in the target figure?
Judge's Assessment: Colormaps/channel mapping differ. Ground truth appears to use green for processes, red/orange for nuclei, and a faint blue/cyan punctate channel. The result uses strong green plus a dominant orange channel with little/no visible blue and no clear red nuclei channel, indicating different channel assignments and/or blending.
Goal 3
4/10
Criterion: Is the contrast and gamma adjusted to match the target figure?
Judge's Assessment: Contrast/gamma are not matched. The ground truth has relatively subdued green with clearly visible red nuclei and faint blue puncta on a dark background. The result has much higher-intensity green and orange with more saturated highlights and less separation of structures consistent with the ground truth’s intensity balance.

Overall Assessment

Across appearance, channel coloring/blending, and intensity tuning, the result diverges substantially from the ground truth. The morphology and channel emphasis suggest a different visualization setup rather than a close reproduction.

πŸ“Š Detailed Metrics

Visualization Quality
9/30
Output Generation
5/5
Efficiency
0/10
Input Tokens
1,592,808
Output Tokens
17,577
Total Tokens
1,610,385
Total Cost
$5.0421

πŸ“ case_6

⚠️ LOW SCORE
12/35 (34.3%)

πŸ“‹ Task Description

1. Read the file "data/dataset_003/eval_iso_surface_determination_target_1.txt" to get the target iso-surface values for different tooth structures. 2. Load data/dataset_003/dataset_003.tif into napari. 3. Switch to 3D view mode and set the rendering to iso. 4. Find the iso surface value that shows the target clearly. 5. Rotate the camera to several angles and take a screenshot of the result each time to check if the target structure is clearly visible from different angles. 6. If the target structure is not clearly visible, adjust the iso surface value and take a screenshot again. 7. Stop when the target structure is clearly visible or you have tried five different iso surface values. 8. Save the final screenshot to "eval_visualization_tasks/case_6/results/{agent_mode}/screenshot.png".

πŸ–ΌοΈ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
7/20
Goals
2
Points/Goal
10
Goal 1
3/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: The ground truth shows a centered, high-contrast tooth iso-surface on a black background with strong specular highlights and no axes. The result image instead shows a very faint, semi-transparent tooth-like blob inside a 3D matplotlib-style axes box with gridlines, labels, and a different camera angle. Overall composition, lighting/shading, background, and rendering style differ substantially, so visual similarity is low.
Goal 2
4/10
Criterion: Does the visualization show the target structure clearly?
Judge's Assessment: In the ground truth, crown/neck/root geometry is clearly visible with crisp surface detail and clear separation of roots. In the result, the structure is present but washed out/low opacity and partially obscured by the grid/axes; surface detail and root separation are not clearly defined, making the target structure only partially clear.

Overall Assessment

The result captures a rough tooth-shaped iso-surface but does not match the ground truth’s clean, high-contrast rendering and does not display the tooth anatomy with comparable clarity due to low opacity and distracting axes/grid.

πŸ“Š Detailed Metrics

Visualization Quality
7/20
Output Generation
5/5
Efficiency
0/10
Input Tokens
772,322
Output Tokens
16,663
Total Tokens
788,985
Total Cost
$2.5669

πŸ“ case_7

10/10 (100.0%)

πŸ“‹ Task Description

1. Load the image "data/dataset_002/dataset_002_ch0.tif" and set channel 0 to a magenta colormap. 2. Switch to a 3D MIP view. 3. Take a screenshot and analyze it to count how many complete cells are visible (not cut off by edges). Q1: answer with the number of complete cells you counted, for example "5" if you see 5 complete cells. 4. Save the answer of Q1 to the questions in plain text as "eval_visualization_tasks/case_7/results/{agent_mode}/Q1_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: 2

Agent's Answers

2

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: 2): The provided task answer is "2", which exactly matches the expected correct answer (2). Therefore, it fully satisfies the criterion and earns full credit.

πŸ“Š Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
773,275
Output Tokens
8,519
Total Tokens
781,794
Total Cost
$2.4476

πŸ“ case_8

10/10 (100.0%)

πŸ“‹ Task Description

1. Load the image "data/dataset_001/dataset_001.tiff". 2. Get basic statistics (min, max, mean, std) for the loaded layer. 3. Extract the raw layer data and examine its properties. 4. Save the current layer to a file for further analysis. Q1: Was the statistical analysis and data export successful? (Yes/No) 6. Save the answer of Q1 in plain text as "eval_visualization_tasks/case_8/results/{agent_mode}/layer_statistics_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore it fully satisfies the evaluation criterion.

πŸ“Š Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
784,876
Output Tokens
9,207
Total Tokens
794,083
Total Cost
$2.4927

πŸ“ case_9

10/10 (100.0%)

πŸ“‹ Task Description

1. Load the image "data/dataset_001/dataset_001.tiff". 2. Add point annotations at random locations on the image. 3. Add shape annotations (rectangles or circles) at random locations on the image. Q1: Check if layers have been generated. (Yes/No) 4. Save the answer of Q1 in plain text as "eval_visualization_tasks/case_9/results/{agent_mode}/annotation_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore it fully satisfies the criterion with no discrepancies.

πŸ“Š Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
948,443
Output Tokens
16,854
Total Tokens
965,297
Total Cost
$3.0981

πŸ“ case_10

24/35 (68.6%)

πŸ“‹ Task Description

1. Load the image "data/dataset_002/dataset_002_ch0.tif" into napari. 2. Trace the cell surface on the current slice by adding a polygon shape in a new shape layer. 3. Use a screenshot to validate whether the polygon correctly traces the cell surface. 4. If the trace is not accurate, adjust the polygon and take a new screenshot to validate. 5. Stop when the trace is accurate or you have tried five different attempts. 6. Save the results and the final screenshot to "eval_visualization_tasks/case_10/results/{agent_mode}/cell_surface_trace.png".

πŸ–ΌοΈ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
16/20
Goals
2
Points/Goal
10
Goal 1
7/10
Criterion: Does the final screenshot show a polygon shape that accurately traces the outline of the cell surface?
Judge's Assessment: A closed red polygon is drawn around the bright, roughly circular cell-like structure and generally follows the outer boundary of the signal. However, the trace is quite jagged with many sharp vertices, and in several places it appears to deviate from the smooth cell edge (some segments cut inward or extend outward relative to the brightest rim), suggesting an imprecise outline rather than a clean surface contour.
Goal 2
9/10
Criterion: Is the polygon layer correctly overlaid on the image?
Judge's Assessment: The polygon layer is clearly visible and properly overlaid on top of the grayscale image, centered on the cell structure. The red line contrasts well with the background and aligns spatially with the cell region without obvious offset or scaling issues.

Overall Assessment

Without ground truth, the overlay appears correctly positioned and readable, and it broadly encircles the cell surface. The main limitation is the polygon’s rough, jagged geometry and local mismatches to the apparent cell boundary, which reduce confidence in the accuracy of the surface trace.

πŸ“Š Detailed Metrics

Visualization Quality
16/20
Output Generation
5/5
Efficiency
3/10
Input Tokens
530,893
Output Tokens
12,455
Total Tokens
543,348
Total Cost
$1.7795

πŸ“ case_11

⚠️ LOW SCORE
17/55 (30.9%)

πŸ“‹ Task Description

1. Load the "data/dataset_002/dataset_002_ch0.tif" dataset into napari as channel 0 and "data/dataset_002/dataset_002_ch1.tif" as channel 1. 2. Depending on the number of channels, set the colormap for the first channel 0 to red and channel 1 to green. 3. Switch to the 3D view. 4. Zoom in to the cell in the middle. 5. Rotate the camera to a side view. 6. Take a screenshot of the zoomed-in view and save it to "eval_visualization_tasks/case_11/results/{agent_mode}/zoom_screenshot.png". 7. Take a screenshot of the side view and save it to "eval_visualization_tasks/case_11/results/{agent_mode}/rotate_screenshot.png".

πŸ–ΌοΈ Visualization Comparison - Set 1

Ground Truth

Ground Truth 1

Agent Result

Result 1

Score Summary

Total Score
7/20
Goals
2
Points/Goal
10
Goal 1
3/10
Criterion: Does the visualization show a zoomed-in view of the cell in the middle?
Judge's Assessment: Ground truth shows a zoomed-in view with the main cell centered and filling most of the frame. The result image shows only a partial cell near the upper-right, with much of the frame being background; the central cell is not in the middle and the view is not comparably zoomed-in on the cell body.
Goal 2
4/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: Both images use similar fluorescence coloring (green structure with red puncta), but the composition differs strongly: the ground truth has a large, centered, well-framed cell with visible protrusions, while the result is cropped/shifted, shows less of the cell, and has different apparent scale and framing. Overall similarity is limited to color/texture rather than matching layout.

Overall Assessment

The result partially matches the visual style (green/red fluorescence) but fails to reproduce the key requirement of a centered, zoomed-in middle cell and does not closely match the ground-truth framing and scale.

πŸ–ΌοΈ Visualization Comparison - Set 2

Ground Truth

Ground Truth 2

Agent Result

Result 2

Score Summary

Total Score
5/20
Goals
2
Points/Goal
10
Goal 1
2/10
Criterion: Does the visualization show a side view of the cell?
Judge's Assessment: Ground truth shows a clear side view with the cell elongated vertically (upright profile). The result image shows the cell elongated horizontally (more like a top/plan view or rotated 90Β°), so the intended side-view orientation is largely not met.
Goal 2
3/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: The result differs strongly from the ground truth in overall orientation and composition: the ground truth has a single upright cell with a dark background, while the result has a horizontally oriented cell and also includes an additional bright structure below (suggesting a second cell/plane or reflection). Color channels (green with red puncta) are present in both, but the spatial arrangement and appearance are not similar.

Overall Assessment

The result preserves the general fluorescence coloring (green structure with red spots) but fails to match the ground-truth side-view orientation and overall visual layout, making it a poor match to the expected rendering.

πŸ“Š Detailed Metrics

Visualization Quality
12/40
Output Generation
5/5
Efficiency
0/10
Input Tokens
1,207,987
Output Tokens
19,578
Total Tokens
1,227,565
Total Cost
$3.9176