Can LLMs Design in Figma?

FigmaBench

Benchmarking LLM Design Capability via Figma Plugin API Tool Use
Overview

What is FigmaBench?

FigmaBench is a benchmark for evaluating whether large language models can produce structured UI designs through tool use. Unlike screenshot-to-code benchmarks that test visual reproduction, FigmaBench measures whether models can operate a real design tool API — creating complete, well-structured, visually coherent Figma designs from natural language briefs.

Golden Reference
Golden reference screenshot
Generated Design
Task Brief (input)
Compare Models

Leaderboard

Each bar shows a model's composite score, decomposed into the 6 reward layers. All models evaluated in one-shot mode on 100 tasks. L6 (VLM Judge) currently available for Opus only.

L1 Element Matching L2 Layout & Constraints L3 Design System L4 Schema L5 Visual L6 VLM Judge
Explore the Dataset

Tasks

Task Examples

Each task consists of a natural language brief, a set of required elements, design tokens, and a golden reference — a hand-crafted Figma design representing the ideal output.

Task Categories

100 tasks span 10 UI categories, each with 10 tasks: Authentication, Onboarding, Dashboard, Lists, Detail Views, Forms, Settings, Navigation, Modals, and Design System.

Task Difficulty

36 easy (3–6 elements) · 37 medium (7–13 elements) · 27 hard (14–26 elements). Average 10.2 required elements per task, 38.9 golden reference nodes.

How It Works

Architecture

Click on any component to see its description. Commands flow right into Figma; results flow back for scoring.

Hover a component
The pipeline starts with a task brief (top-left) and ends with a composite score (bottom-left). Commands flow right → into Figma; results flow left ← back for scoring.
Scoring System

Evaluation

6-Layer Reward Stack

The reward stack scores designs across 6 independent dimensions. L1–L4 can be scored from JSON schema alone (no Figma needed), enabling offline RL training. L5–L6 require a Figma instance for screenshots.

6-Layer Reward Stack Overview

Each layer captures a distinct aspect of design quality. Together they produce a weighted composite score. L1–L4 are scored from JSON schema alone (offline-friendly). L5–L6 require Figma screenshots.

LayerWhat It MeasuresWeightMethodNeeds Figma?
L1 Element Matching Are all required UI elements present? Uses a 7-level matching cascade (exact → fuzzy → semantic) to handle naming variations. 1.0 Bag-of-words matching No
L2 Layout & Constraints Is auto-layout used correctly? Checks layout mode, padding, spacing, alignment, and constraint consistency. 0.5 Property validation No
L3 Color & Tokens Do colors and fonts match the specified tokens? Uses perceptual color distance (CIE76 ΔE) — not exact hex matching. 0.3 Color: 70%, Font: 30% No
L4 Schema Similarity How close is the generated node tree to the golden reference? Structural comparison of the full Figma JSON tree. 0.3 Tree diff No
L5 Screenshot Compare How similar is the screenshot to the golden reference? Pixel-level comparison using structural similarity (SSIM). 0.2 SSIM on screenshots Yes
L6 VLM Judge Multimodal quality assessment — a vision-language model rates the design side-by-side with the golden reference. 0.5 VLM API call Yes
C = Σ(wᵢ × sᵢ) / Σ(wᵢ)
where wᵢ = layer weight, sᵢ = layer score ∈ [0, 1]
Without VLM (L1–L5): Σ(wᵢ) = 1.0 + 0.5 + 0.3 + 0.3 + 0.2 = 2.3
With VLM (L1–L6): Σ(wᵢ) = 2.3 + 0.5 = 2.8
Dividing by Σ(wᵢ) normalizes the composite score C to [0, 1] regardless of which layers are included.

Key insight: L1–L4 enable offline RL training from JSON schema alone — no Figma instance needed. L5–L6 are reserved for evaluation.

Element Matching

L1: Element Matching (weight: 1.0) checks whether all required UI elements are present in the generated design. The score is found / total. But "finding" an element is harder than it sounds — a model might name its sign-in button SignInBtn instead of sign_in_button. The matcher needs to be robust to naming variations without producing false positives.

The 7-Level Matching Cascade

LevelStrategyExample
1Exact bag-of-words in a single nodesign_in_button → {"sign","in","button"} ⊆ node words
2Long words (≥3 chars) in a single nodeFilters noise from short tokens like "1", "a"
3Global word pool (4-char prefix fuzzy)All words found somewhere in tree (not same node)
4Content words only (ignore UI structure words)"sign" in pool, "button" ignored as generic UI term
5Partial match (≥50%) with multi-word anchorCatches partial name overlaps
6Exact substring"email" found in "Email Input Field"
7Alphanumeric substring (≥4 chars)"signin" found in "SignInButton"

Both node.name and node.characters (text content) are searched. The cascade ensures that reasonable naming variations don't penalize structurally correct designs.

Required Elements
Golden Reference Tree
Levels:
Score
0/0

Layout & Constraints

L2: Layout Quality (weight: 0.5) measures whether auto-layout is used correctly — proper nesting, spacing, and alignment of UI elements. This layer rewards designs that use Figma's layout system idiomatically rather than relying on absolute positioning.

Penalty-Based Scoring

The score starts at 1.0 and each violation subtracts a penalty. Select a task to see the real analysis against its golden tree. Use the "what-if" toggles to simulate failures and see how the score degrades.

Layout Tree
Violation Checks
1.0
Score
1.000

Color Space & Tokens

L3: Design System Compliance (weight: 0.3) measures how well the generated design matches the task's design tokens. The score combines two sub-metrics: L3 = 0.6 × color_compliance + 0.4 × font_compliance.

How It Works

  1. Color compliance (60%): Extract all fill and stroke colors from the design tree. For each color: if its chroma ≤ 15 (neutral — white/black/gray), it auto-passes. Otherwise, compute CIE76 ΔE to every design token color; pass if min ΔE ≤ threshold (default 30).
  2. Font compliance (40%): Check every TEXT node's fontFamily against the specified font in design tokens (default: Inter).
30
Inter
Color Compliance
Font Compliance
0.6 × 1.000 + 0.4 × 1.000 = 1.000
Score
1.000

Schema Similarity

L4: Schema Similarity (weight: 0.3) goes beyond checking if elements exist — it measures how closely the generated tree's shape matches the golden reference. Two designs might have all the right elements but wildly different structures.

Three-Component Scoring

Both trees are flattened into multisets of (type, name) tuples. VECTOR nodes are excluded — golden references contain icon paths that models can't produce via the API. Names are normalized to lowercase first-word. The score combines three components:

Score = 0.5 × Jaccard + 0.25 × depth_sim + 0.25 × count_sim
  • Jaccard = Σmin / Σmax over all (type, name) counters
  • depth_sim = min(golden_depth, gen_depth) / max(...)
  • count_sim = min(golden_nodes, gen_nodes) / max(...)

Try removing or adding nodes in the generated tree to see how each component responds. Hover any node to highlight matches across both trees and the multiset table.

Golden Reference
Generated (editable)
Multiset (type, name) Comparison
TupleGoldenGen∩min∪max
0.5 × 1.000 + 0.25 × 1.000 + 0.25 × 1.000 = 1.000
1.00
1.00
1.00
Jaccard (0.5) Depth (0.25) Count (0.25)
Score
1.000

Add Node to Generated Tree

Screenshot Comparison

L5: Visual Fidelity (weight: 0.2) — requires Figma. Combines SSIM structural similarity with DCT-based perceptual hashing (pHash) for a two-component visual comparison between the generated and golden reference screenshots.

How It Works

  1. SSIM (Structural Similarity Index): Decomposes image similarity into luminance, contrast, and structure components. Computed block-wise (8×8) per RGB channel and averaged.
  2. pHash (Perceptual Hash): Resizes to 32×32, applies 2D DCT, thresholds the 8×8 low-frequency block by median to produce a 64-bit hash. Hamming distance measures perceptual difference.
  3. Final score: 0.6 × SSIM + 0.4 × max(0, 1 − hamming / 32)
Golden Reference
Golden reference
Generated (simulated)
SSIM Heatmap
0%
0px
SSIM Window Breakdown
Luminance1.000
Contrast1.000
Structure1.000
SSIM1.000
Perceptual Hash (pHash)
Golden
Gen
XOR
Hamming: 0/64 → 1.000
0.6 × 1.000 + 0.4 × 1.000 = 1.000
0.600
0.400
SSIM (0.6 × 1.000) Hash (0.4 × 1.000)
Score
1.000
375×812 viewport

VLM Quality Judge

L6: VLM Judge (weight: 0.5) — requires Figma. A vision-language model (Claude) compares the generated screenshot against the golden reference, scoring on 6 criteria each rated 0–10. The VLM also assigns an independent “overall” score. Use the degradation slider to simulate a bad generation, then adjust criteria scores to play the role of the VLM judge.

How It Works

  1. Two images are sent as base64 to the VLM: the golden reference screenshot and the generated screenshot, along with the task brief for context.
  2. 6 criteria (0–10 each): Layout & Spacing, Typography, Color, Completeness, Polish, Match.
  3. Overall score: The VLM assigns an independent “overall” (0–10), not necessarily the mean of the 6 criteria.
  4. Final score: min(1.0, max(0.0, overall / 10))
Golden Reference
Golden reference
Generated (simulated)
0%

        
Layout & Spacing 10
Proper alignment, consistent spacing, visual hierarchy
Typography 10
Readable fonts, appropriate sizes, proper weight usage
Color 10
Harmonious palette, proper contrast, consistent usage
Completeness 10
All required UI elements present and functional-looking
Polish 10
Overall refinement, attention to detail, professional quality
Match 10
How closely the generated design matches the reference

        
Score = overall / 10 = 10 / 10 = 1.000
Score
1.000
What's Next

Conclusion

What Makes FigmaBench Unique

  • Structured API tool use, not GUI clicks. The model produces Figma Plugin API calls — testing design tool competence at the schema level.
  • Multi-layer continuous rewards. 6 independent scoring dimensions enable fine-grained evaluation and reinforcement learning.
  • A real production tool. Designs are created in live Figma — real rendering, font loading, auto-layout computation.

Limitations

Requires Figma Desktop for L5/L6 scoring. Focuses on mobile UI (375px) with Inter font. 100 tasks is modest — scaling to 500+ planned.

Citation

@article{solerno2026figmabench,
  title        = {FigmaBench: Benchmarking LLM Design Capability
                  via Figma Plugin API Tool Use},
  author       = {Özdemirden, A. Şemsettin},
  year         = {2026},
  organization = {solerno-ai}
}