FigmaBench is a benchmark for evaluating whether large language models can produce structured UI designs through tool use. Unlike screenshot-to-code benchmarks that test visual reproduction, FigmaBench measures whether models can operate a real design tool API — creating complete, well-structured, visually coherent Figma designs from natural language briefs.
Each bar shows a model's composite score, decomposed into the 6 reward layers. All models evaluated in one-shot mode on 100 tasks. L6 (VLM Judge) currently available for Opus only.
Each task consists of a natural language brief, a set of required elements, design tokens, and a golden reference — a hand-crafted Figma design representing the ideal output.
100 tasks span 10 UI categories, each with 10 tasks: Authentication, Onboarding, Dashboard, Lists, Detail Views, Forms, Settings, Navigation, Modals, and Design System.
36 easy (3–6 elements) · 37 medium (7–13 elements) · 27 hard (14–26 elements). Average 10.2 required elements per task, 38.9 golden reference nodes.
Click on any component to see its description. Commands flow right into Figma; results flow back for scoring.
The reward stack scores designs across 6 independent dimensions. L1–L4 can be scored from JSON schema alone (no Figma needed), enabling offline RL training. L5–L6 require a Figma instance for screenshots.
Each layer captures a distinct aspect of design quality. Together they produce a weighted composite score. L1–L4 are scored from JSON schema alone (offline-friendly). L5–L6 require Figma screenshots.
| Layer | What It Measures | Weight | Method | Needs Figma? |
|---|---|---|---|---|
| L1 Element Matching | Are all required UI elements present? Uses a 7-level matching cascade (exact → fuzzy → semantic) to handle naming variations. | 1.0 | Bag-of-words matching | No |
| L2 Layout & Constraints | Is auto-layout used correctly? Checks layout mode, padding, spacing, alignment, and constraint consistency. | 0.5 | Property validation | No |
| L3 Color & Tokens | Do colors and fonts match the specified tokens? Uses perceptual color distance (CIE76 ΔE) — not exact hex matching. | 0.3 | Color: 70%, Font: 30% | No |
| L4 Schema Similarity | How close is the generated node tree to the golden reference? Structural comparison of the full Figma JSON tree. | 0.3 | Tree diff | No |
| L5 Screenshot Compare | How similar is the screenshot to the golden reference? Pixel-level comparison using structural similarity (SSIM). | 0.2 | SSIM on screenshots | Yes |
| L6 VLM Judge | Multimodal quality assessment — a vision-language model rates the design side-by-side with the golden reference. | 0.5 | VLM API call | Yes |
wᵢ = layer weight, sᵢ = layer score ∈ [0, 1]Key insight: L1–L4 enable offline RL training from JSON schema alone — no Figma instance needed. L5–L6 are reserved for evaluation.
L1: Element Matching (weight: 1.0) checks whether all required UI elements are present in the generated design. The score is found / total. But "finding" an element is harder than it sounds — a model might name its sign-in button SignInBtn instead of sign_in_button. The matcher needs to be robust to naming variations without producing false positives.
| Level | Strategy | Example |
|---|---|---|
| 1 | Exact bag-of-words in a single node | sign_in_button → {"sign","in","button"} ⊆ node words |
| 2 | Long words (≥3 chars) in a single node | Filters noise from short tokens like "1", "a" |
| 3 | Global word pool (4-char prefix fuzzy) | All words found somewhere in tree (not same node) |
| 4 | Content words only (ignore UI structure words) | "sign" in pool, "button" ignored as generic UI term |
| 5 | Partial match (≥50%) with multi-word anchor | Catches partial name overlaps |
| 6 | Exact substring | "email" found in "Email Input Field" |
| 7 | Alphanumeric substring (≥4 chars) | "signin" found in "SignInButton" |
Both node.name and node.characters (text content) are searched. The cascade ensures that reasonable naming variations don't penalize structurally correct designs.
L2: Layout Quality (weight: 0.5) measures whether auto-layout is used correctly — proper nesting, spacing, and alignment of UI elements. This layer rewards designs that use Figma's layout system idiomatically rather than relying on absolute positioning.
The score starts at 1.0 and each violation subtracts a penalty. Select a task to see the real analysis against its golden tree. Use the "what-if" toggles to simulate failures and see how the score degrades.
L3: Design System Compliance (weight: 0.3) measures how well the generated design matches the task's design tokens. The score combines two sub-metrics: L3 = 0.6 × color_compliance + 0.4 × font_compliance.
fontFamily against the specified font in design tokens (default: Inter).L4: Schema Similarity (weight: 0.3) goes beyond checking if elements exist — it measures how closely the generated tree's shape matches the golden reference. Two designs might have all the right elements but wildly different structures.
Both trees are flattened into multisets of (type, name) tuples. VECTOR nodes are excluded — golden references contain icon paths that models can't produce via the API. Names are normalized to lowercase first-word. The score combines three components:
Score = 0.5 × Jaccard + 0.25 × depth_sim + 0.25 × count_sim
Try removing or adding nodes in the generated tree to see how each component responds. Hover any node to highlight matches across both trees and the multiset table.
| Tuple | Golden | Gen | ∩min | ∪max | |
|---|---|---|---|---|---|
L5: Visual Fidelity (weight: 0.2) — requires Figma. Combines SSIM structural similarity with DCT-based perceptual hashing (pHash) for a two-component visual comparison between the generated and golden reference screenshots.
0.6 × SSIM + 0.4 × max(0, 1 − hamming / 32)| Luminance | 1.000 |
| Contrast | 1.000 |
| Structure | 1.000 |
| SSIM | 1.000 |
L6: VLM Judge (weight: 0.5) — requires Figma. A vision-language model (Claude) compares the generated screenshot against the golden reference, scoring on 6 criteria each rated 0–10. The VLM also assigns an independent “overall” score. Use the degradation slider to simulate a bad generation, then adjust criteria scores to play the role of the VLM judge.
min(1.0, max(0.0, overall / 10))Requires Figma Desktop for L5/L6 scoring. Focuses on mobile UI (375px) with Inter font. 100 tasks is modest — scaling to 500+ planned.
@article{solerno2026figmabench,
title = {FigmaBench: Benchmarking LLM Design Capability
via Figma Plugin API Tool Use},
author = {Özdemirden, A. Şemsettin},
year = {2026},
organization = {solerno-ai}
}