đŸŽ¯ SciVisAgentBench Evaluation Report

napari_mcp Generated: 2026-03-07T16:19:50.103061

📊 Overall Performance

Overall Score

61.8%
170/275 Points

Test Cases

9/11
Completed Successfully

Avg Vision Score

44.7%
Visualization Quality
53/130

PSNR (Scaled)

N/A
Peak SNR (0/9 valid)

SSIM (Scaled)

N/A
Structural Similarity

LPIPS (Scaled)

N/A
Perceptual Distance

Completion Rate

81.8%
Tasks Completed

â„šī¸ About Scaled Metrics

Scaled metrics account for completion rate to enable fair comparison across different evaluation modes. Formula: PSNRscaled = (completed_cases / total_cases) × avg(PSNR), SSIMscaled = (completed_cases / total_cases) × avg(SSIM), LPIPSscaled = 1.0 - (completed_cases / total_cases) × (1.0 - avg(LPIPS)). Cases with infinite PSNR (perfect match) are excluded from the PSNR calculation.

🔧 Configuration

anthropic
claude-sonnet-4-5
https://api.anthropic.com
$3.00
$15.00

📝 case_1

42/45 (93.3%)

📋 Task Description

1. Load the "data/dataset_002/dataset_002_ch0.tif" dataset into napari as channel 0 and "data/dataset_002/dataset_002_ch1.tif" as channel 1. 2. Set the colormap for channel 0 to red and channel 1 to green. 3. Switch to the 3D view. 4. Use additive blending for all channels to create an overlay visualization. 5. Go the timestep 14. Q1: Does the cell show protrusions? (Yes/No) 6. Take a screenshot of the result, save it to "eval_visualization_workflows/screenshot_1.png" 7. Answer Q1 in a plain text file "eval_visualization_workflows/multi_channel_answer.txt".

đŸ–ŧī¸ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
18/20
Goals
2
Points/Goal
10
Goal 1
10/10
Criterion: Does the visualization show a green cell with red blobs on the inside?
Judge's Assessment: The result image clearly shows green cells with distinct red fluorescent blobs/spots inside the central green cell, matching the described appearance (green cell body with internal red puncta).
Goal 2
8/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: The result rendering is very similar in content and spatial arrangement to the ground truth (same cells, same green/red signal patterns). The main difference is presentation: the result is shown with substantial black padding/border around the image and appears slightly reduced in scale/contrast compared with the ground truth, but the underlying visualization matches well.

Overall Assessment

The core biological visual features (green cells with internal red blobs) match the ground truth very well. Similarity is high; only minor framing/scale differences (added black borders and slightly different intensity) reduce the match slightly.

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore, it fully satisfies the evaluation criterion with no discrepancies.

📊 Detailed Metrics

Visualization Quality
18/20
Output Generation
5/5
Efficiency
9/10
Text Q&A Score
10/10
100.0%
Input Tokens
127,185
Output Tokens
1,426
Total Tokens
128,611

📝 case_2

10/10 (100.0%)

📋 Task Description

1. Load the "data/dataset_002/Points.csv" dataset into napari. 2. Check if the points layer has been created. Q1: Was the points layer created successfully? (Yes/No) 3. Answer Q1 in a plain text file "eval_visualization_workflows/points_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Q1: Was the points layer created successfully? Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided answer to Q1 is "Yes," which exactly matches the expected correct answer. Therefore, it fully satisfies the evaluation criterion.

📊 Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
98,858
Output Tokens
1,147
Total Tokens
100,005

📝 case_3

10/10 (100.0%)

📋 Task Description

1. Load the "data/dataset_002/Shapes.csv" dataset into napari. 2. Check if the shapes layer has been created. Q1: Was the shapes layer created successfully? (Yes/No) 3. Answer Q1 in a plain text file "eval_visualization_workflows/shapes_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Q1: Was the shapes layer created successfully? (Yes/No) Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided answer to Q1 is "Yes," which exactly matches the expected correct answer. Therefore, it fully satisfies the evaluation criterion.

📊 Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
102,278
Output Tokens
1,164
Total Tokens
103,442

📝 case_4

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the "data/dataset_002/Labels.tif" dataset into napari. 2. Check if a new layer called "Labels" has been created. Q1: Was the layer created successfully? (Yes/No) 3. Answer Q1 in a plain text file "eval_visualization_workflows/labels_answer.txt".

Score

0/10 (0.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

No

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is "No", which directly contradicts the required correct answer "Yes". Therefore it does not meet the criterion at all. Score for this criterion: 0/10. Total score: 0/10.

📊 Detailed Metrics

Text Q&A Score
0/10
0.0%
Input Tokens
72,442
Output Tokens
863
Total Tokens
73,305

📝 case_5

âš ī¸ LOW SCORE
17/45 (37.8%)

📋 Task Description

1. Load the dataset into napari: data/dataset_001/dataset_001.tiff 2. Read the target figure: data/dataset_001/dataset_001.png but don't load it into napari. 3. Read the dataset description: data/dataset_001/dataset_001.yaml. 4. Set the same colormaps and blending modes as the target figure. 5. Adjust contrast and gamma as needed to match the target figure. 6. Take a screenshot of your recreation. 7. If the recreation does not match the target figure, adjust the visualization settings and take a screenshot again. 8. Stop when the recreation matches the target figure or you have tried five different visualization settings. 9. Save the final screenshot to "eval_figure_recreation/screenshot.png".

đŸ–ŧī¸ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
4/30
Goals
3
Points/Goal
10
Goal 1
2/10
Criterion: Does the result screenshot look similar to the ground truth image?
Judge's Assessment: The result image does not resemble the ground truth in overall appearance. The ground truth has a dark/black background with sparse teal/blue puncta, green neurite-like structures, and red nuclei-like blobs. The result has a brown/yellow noisy background with strong magenta/pink filament structures and bright cyan nuclei, with a very different spatial distribution and overall look. This is a major mismatch in the visual outcome.
Goal 2
1/10
Criterion: Are the same colormaps and blending modes used as in the target figure?
Judge's Assessment: Colormaps/channel coloring and blending do not match. Ground truth uses a green+red+dim blue/teal composite on a black background (typical additive RGB fluorescence). The result uses prominent magenta/pink and cyan with a yellow/brown cast, suggesting different channel assignments and/or a different blend/compositing (and possibly a background layer). Overall, the color scheme and blending are fundamentally different.
Goal 3
1/10
Criterion: Is the contrast and gamma adjusted to match the target figure?
Judge's Assessment: Contrast/gamma are not matched. The ground truth is high-contrast on a very dark background with relatively subdued signal intensity. The result is much brighter overall, with elevated background and visible grain/texture across the field, indicating very different intensity scaling and gamma/levels.

Overall Assessment

Across appearance, color/blending, and intensity scaling, the result diverges strongly from the ground truth. The result looks like a different staining/channel mapping and different level adjustments, with a bright noisy background rather than the dark high-contrast fluorescence look of the target.

📊 Detailed Metrics

Visualization Quality
4/30
Output Generation
5/5
Efficiency
8/10
Input Tokens
144,091
Output Tokens
2,870
Total Tokens
146,961

📝 case_6

23/35 (65.7%)

📋 Task Description

1. Read the file "data/dataset_003/eval_iso_surface_determination_target_1.txt" to get the target iso-surface values for different tooth structures. 2. Load data/dataset_003/dataset_003.tif into napari. 3. Switch to 3D view mode and set the rendering to iso. 4. Find the iso surface value that shows the target clearly. 5. Rotate the camera to several angles and take a screenshot of the result each time to check if the target structure is clearly visible from different angles. 6. If the target structure is not clearly visible, adjust the iso surface value and take a screenshot again. 7. Stop when the target structure is clearly visible or you have tried five different iso surface values. 8. Save the final screenshot to "eval_iso_surface_determination/screenshot.png".

đŸ–ŧī¸ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
9/20
Goals
2
Points/Goal
10
Goal 1
3/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: The ground truth shows a vertically oriented tooth centered in the frame with a black background and a glossy gray/white material. The result image also uses a black background and similar shading, but the camera/view is completely different: the tooth is rotated to a horizontal/side-on view and is heavily zoomed/cropped so the overall silhouette and composition do not match the ground truth. Specular highlights and overall tone are somewhat similar, but the viewpoint mismatch dominates.
Goal 2
6/10
Criterion: Does the visualization show the target structure clearly?
Judge's Assessment: The target structure (a tooth) is visible and rendered with smooth shading, and some surface detail is apparent. However, because the tooth is shown from a different angle and is very close/cropped, the full tooth anatomy (crown + roots) is not as clearly and completely presented as in the ground truth, reducing structural clarity.

Overall Assessment

The result captures a similar rendering style (black background, glossy grayscale surface) and does depict the tooth, but it does not match the ground-truth viewpoint/composition and is overly zoomed/cropped. The structure is identifiable but less clearly presented than the reference due to orientation and framing differences.

📊 Detailed Metrics

Visualization Quality
9/20
Output Generation
5/5
Efficiency
9/10
Input Tokens
181,411
Output Tokens
1,700
Total Tokens
183,111

📝 case_7

❌ FAILED
0/10 (0.0%)

📋 Task Description

1. Load the image "data/dataset_002/dataset_002_ch0.tif" and set channel 0 to a magenta colormap. 2. Switch to a 3D MIP view. 3. Take a screenshot and analyze it to count how many complete cells are visible (not cut off by edges). Q1: answer with the number of complete cells you counted, for example "5" if you see 5 complete cells. 4. Save the answer of Q1 to the questions in plain text as "eval_analysis_workflows/Q1_answer.txt".

Score

0/10 (0.0%)

Questions & Correct Answers

1. Q1 correct answer: 2

Agent's Answers

7

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: 2): The provided task answer is "7", but the expected correct answer is "2". This does not meet the criterion, as it is numerically incorrect and provides no additional context or justification that could partially satisfy the requirement. Therefore, the criterion receives 0/10, yielding a total score of 0/10.

📊 Detailed Metrics

Text Q&A Score
0/10
0.0%
Input Tokens
177,596
Output Tokens
1,485
Total Tokens
179,081

📝 case_8

10/10 (100.0%)

📋 Task Description

1. Load the image "data/dataset_001/dataset_001.tiff". 2. Get basic statistics (min, max, mean, std) for the loaded layer. 3. Extract the raw layer data and examine its properties. 4. Save the current layer to a file for further analysis. Q1: Was the statistical analysis and data export successful? (Yes/No) 6. Save the answer of Q1 in plain text as "eval_analysis_workflows/layer_statistics_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore, it fully satisfies the evaluation criterion with no discrepancies.

📊 Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
99,222
Output Tokens
1,025
Total Tokens
100,247

📝 case_9

10/10 (100.0%)

📋 Task Description

1. Load the image "data/dataset_001/dataset_001.tiff". 2. Add point annotations at random locations on the image. 3. Add shape annotations (rectangles or circles) at random locations on the image. Q1: Check if layers have been generated. (Yes/No) 4. Save the answer of Q1 in plain text as "eval_analysis_workflows/annotation_answer.txt".

Score

10/10 (100.0%)

Questions & Correct Answers

1. Q1 correct answer: Yes

Agent's Answers

Yes

Judge's Evaluation

Evaluation:

Criterion 1 (Q1 correct answer: Yes): The provided task answer is exactly "Yes", which matches the expected correct answer. Therefore it fully satisfies the evaluation criterion with no discrepancies.

📊 Detailed Metrics

Text Q&A Score
10/10
100.0%
Input Tokens
99,989
Output Tokens
1,096
Total Tokens
101,085

📝 case_10

20/35 (57.1%)

📋 Task Description

1. Load the image "data/dataset_002/dataset_002_ch0.tif" into napari. 2. Trace the cell surface on the current slice by adding a polygon shape in a new shape layer. 3. Use a screenshot to validate whether the polygon correctly traces the cell surface. 4. If the trace is not accurate, adjust the polygon and take a new screenshot to validate. 5. Stop when the trace is accurate or you have tried five different attempts. 6. Save the results and the final screenshot to "eval_annotation_workflows/cell_surface_trace.png".

đŸ–ŧī¸ Visualization Comparison

Ground Truth

Ground Truth

Agent Result

Result

Score Summary

Total Score
8/20
Goals
2
Points/Goal
10
Goal 1
2/10
Criterion: Does the final screenshot show a polygon shape that accurately traces the outline of the cell surface?
Judge's Assessment: The displayed polygon is a large, smooth, multi-vertex oval-like shape placed in the lower-central region. It does not appear to follow any visible cell boundary in the image: the main bright/structured region is above it, and a circular dim structure is on the right, neither of which is traced by the polygon. The polygon looks generic and disconnected from the apparent cell surface outline.
Goal 2
6/10
Criterion: Is the polygon layer correctly overlaid on the image?
Judge's Assessment: A polygon layer is clearly rendered and visibly overlaid on top of the grayscale image (solid light-gray fill). Registration/visibility is fine in the sense that it sits on the image and is not shifted outside the frame. However, the polygon is opaque and obscures underlying content, and its placement does not align with a cell, so the overlay is technically present but not useful for the intended segmentation/outline visualization.

Overall Assessment

Without ground truth, the polygon does not appear to trace any cell surface outline in the scene, suggesting poor segmentation/annotation accuracy. The overlay itself is clearly applied on the image, but the fill opacity and misalignment relative to visible structures reduce interpretability and correctness.

📊 Detailed Metrics

Visualization Quality
8/20
Output Generation
5/5
Efficiency
7/10
Input Tokens
210,264
Output Tokens
2,027
Total Tokens
212,291

📝 case_11

28/55 (50.9%)

📋 Task Description

1. Load the "data/dataset_002/dataset_002_ch0.tif" dataset into napari as channel 0 and "data/dataset_002/dataset_002_ch1.tif" as channel 1. 2. Depending on the number of channels, set the colormap for the first channel 0 to red and channel 1 to green. 3. Switch to the 3D view. 4. Zoom in to the cell in the middle. 5. Rotate the camera to a side view. 6. Take a screenshot of the zoomed-in view and save it to "eval_camera_operations/zoom_screenshot.png". 7. Take a screenshot of the side view and save it to "eval_camera_operations/rotate_screenshot.png".

đŸ–ŧī¸ Visualization Comparison - Set 1

Ground Truth

Ground Truth 1

Agent Result

Result 1

Score Summary

Total Score
7/20
Goals
2
Points/Goal
10
Goal 1
4/10
Criterion: Does the visualization show a zoomed-in view of the cell in the middle?
Judge's Assessment: Ground truth shows a zoomed-in, centered view of a roughly spherical cell, filling much of the frame, with clear membrane protrusions and internal red puncta. The result image does show a single cell-like object, but it is not similarly centered/zoomed: it sits off to the right with substantial empty black space on the left, and the cell occupies less of the frame than in the ground truth. Thus the intent of a centered zoom-in is only partially met.
Goal 2
3/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: The result rendering differs notably from the ground truth in content and appearance. The ground truth has strong green structure with many bright internal details and distinct red spots; the result is mostly a green outline/rim with a dim interior and no visible red channel features. Overall morphology/texture, contrast, and channel composition do not match well.

Overall Assessment

The result partially captures the idea of focusing on a single cell, but it is not framed as a centered zoom-in like the ground truth, and the visual appearance (especially missing red puncta and different green texture/contrast) is substantially different from the expected rendering.

đŸ–ŧī¸ Visualization Comparison - Set 2

Ground Truth

Ground Truth 2

Agent Result

Result 2

Score Summary

Total Score
7/20
Goals
2
Points/Goal
10
Goal 1
4/10
Criterion: Does the visualization show a side view of the cell?
Judge's Assessment: The ground truth shows an elongated, vertically oriented cell in a side-view-like presentation (thick, volumetric appearance with internal structures and apparent depth). The result image shows a more rounded/oval cell with a bright peripheral membrane outline and looks more like a top-down or surface view rather than a side view. Some 3D impression exists, but the side-view aspect is not convincingly matched.
Goal 2
3/10
Criterion: Does the result rendering look similar to ground truth?
Judge's Assessment: The result rendering differs strongly from the ground truth in overall morphology and signal distribution: ground truth is tall/elongated with substantial internal green texture and several red puncta; the result is a single rounded object dominated by a thin green rim and lacks the same internal volumetric texture and red channel features. Background and orientation also do not match closely.

Overall Assessment

The result only weakly satisfies the side-view requirement and does not resemble the ground truth rendering. The biggest mismatches are cell shape (elongated vs rounded), view/orientation, and fluorescence distribution (volumetric interior + red puncta vs mostly membrane outline with little/no red).

📊 Detailed Metrics

Visualization Quality
14/40
Output Generation
5/5
Efficiency
9/10
Input Tokens
172,354
Output Tokens
1,882
Total Tokens
174,236