Can LLMs Design in Figma?

FigmaBench

Benchmarking LLM Design Capability via Figma Plugin API Tool Use

Overview

What is FigmaBench?

FigmaBench is a benchmark for evaluating whether large language models can produce structured UI designs through tool use. Unlike screenshot-to-code benchmarks that test visual reproduction, FigmaBench measures whether models can operate a real design tool API — creating complete, well-structured, visually coherent Figma designs from natural language briefs.

Golden Reference

Generated Design

Task Brief (input)

Compare Models

Leaderboard

Each bar shows a model's composite score, decomposed into the 6 reward layers. All models evaluated in one-shot mode on 100 tasks. L6 (VLM Judge) currently available for Opus only.

L1 Element Matching L2 Layout & Constraints L3 Design System L4 Schema L5 Visual L6 VLM Judge

Explore the Dataset

Tasks

Task Examples

Each task consists of a natural language brief, a set of required elements, design tokens, and a golden reference — a hand-crafted Figma design representing the ideal output.

Task Categories

100 tasks span 10 UI categories, each with 10 tasks: Authentication, Onboarding, Dashboard, Lists, Detail Views, Forms, Settings, Navigation, Modals, and Design System.

Task Difficulty

36 easy (3–6 elements) · 37 medium (7–13 elements) · 27 hard (14–26 elements). Average 10.2 required elements per task, 38.9 golden reference nodes.

How It Works

Architecture

Click on any component to see its description. Commands flow right into Figma; results flow back for scoring.

Hover a component

The pipeline starts with a task brief (top-left) and ends with a composite score (bottom-left). Commands flow right → into Figma; results flow left ← back for scoring.

Scoring System

Evaluation

6-Layer Reward Stack

The reward stack scores designs across 6 independent dimensions. L1–L4 can be scored from JSON schema alone (no Figma needed), enabling offline RL training. L5–L6 require a Figma instance for screenshots.

6-Layer Reward Stack Overview

Each layer captures a distinct aspect of design quality. Together they produce a weighted composite score. L1–L4 are scored from JSON schema alone (offline-friendly). L5–L6 require Figma screenshots.

Layer	What It Measures	Weight	Method	Needs Figma?
L1 Element Matching	Are all required UI elements present? Uses a 7-level matching cascade (exact → fuzzy → semantic) to handle naming variations.	1.0	Bag-of-words matching	No
L2 Layout & Constraints	Is auto-layout used correctly? Checks layout mode, padding, spacing, alignment, and constraint consistency.	0.5	Property validation	No
L3 Color & Tokens	Do colors and fonts match the specified tokens? Uses perceptual color distance (CIE76 ΔE) — not exact hex matching.	0.3	Color: 70%, Font: 30%	No
L4 Schema Similarity	How close is the generated node tree to the golden reference? Structural comparison of the full Figma JSON tree.	0.3	Tree diff	No
L5 Screenshot Compare	How similar is the screenshot to the golden reference? Pixel-level comparison using structural similarity (SSIM).	0.2	SSIM on screenshots	Yes
L6 VLM Judge	Multimodal quality assessment — a vision-language model rates the design side-by-side with the golden reference.	0.5	VLM API call	Yes

C = Σ(wᵢ × sᵢ) / Σ(wᵢ)

where wᵢ = layer weight, sᵢ = layer score ∈ [0, 1]

Without VLM (L1–L5): Σ(wᵢ) = 1.0 + 0.5 + 0.3 + 0.3 + 0.2 = 2.3
With VLM (L1–L6): Σ(wᵢ) = 2.3 + 0.5 = 2.8
Dividing by Σ(wᵢ) normalizes the composite score C to [0, 1] regardless of which layers are included.

Key insight: L1–L4 enable offline RL training from JSON schema alone — no Figma instance needed. L5–L6 are reserved for evaluation.

Element Matching

L1: Element Matching (weight: 1.0) checks whether all required UI elements are present in the generated design. The score is found / total. But "finding" an element is harder than it sounds — a model might name its sign-in button SignInBtn instead of sign_in_button. The matcher needs to be robust to naming variations without producing false positives.

The 7-Level Matching Cascade

Level	Strategy	Example
1	Exact bag-of-words in a single node	`sign_in_button` → {"sign","in","button"} ⊆ node words
2	Long words (≥3 chars) in a single node	Filters noise from short tokens like "1", "a"
3	Global word pool (4-char prefix fuzzy)	All words found somewhere in tree (not same node)
4	Content words only (ignore UI structure words)	"sign" in pool, "button" ignored as generic UI term
5	Partial match (≥50%) with multi-word anchor	Catches partial name overlaps
6	Exact substring	`"email"` found in `"Email Input Field"`
7	Alphanumeric substring (≥4 chars)	`"signin"` found in `"SignInButton"`

Both node.name and node.characters (text content) are searched. The cascade ensures that reasonable naming variations don't penalize structurally correct designs.

Task

Required Elements

Golden Reference Tree

Levels:

Score

0/0

Layout & Constraints

L2: Layout Quality (weight: 0.5) measures whether auto-layout is used correctly — proper nesting, spacing, and alignment of UI elements. This layer rewards designs that use Figma's layout system idiomatically rather than relying on absolute positioning.

Penalty-Based Scoring

The score starts at 1.0 and each violation subtracts a penalty. Select a task to see the real analysis against its golden tree. Use the "what-if" toggles to simulate failures and see how the score degrades.

Task

Layout Tree

Violation Checks

1.0

Score

1.000

Color Space & Tokens

L3: Design System Compliance (weight: 0.3) measures how well the generated design matches the task's design tokens. The score combines two sub-metrics: L3 = 0.6 × color_compliance + 0.4 × font_compliance.

How It Works

Color compliance (60%): Extract all fill and stroke colors from the design tree. For each color: if its chroma ≤ 15 (neutral — white/black/gray), it auto-passes. Otherwise, compute CIE76 ΔE to every design token color; pass if min ΔE ≤ threshold (default 30).
Font compliance (40%): Check every TEXT node's fontFamily against the specified font in design tokens (default: Inter).

Task

Color Compliance

Font Compliance

0.6 × 1.000 + 0.4 × 1.000 = 1.000

Score

1.000

Schema Similarity

L4: Schema Similarity (weight: 0.3) goes beyond checking if elements exist — it measures how closely the generated tree's shape matches the golden reference. Two designs might have all the right elements but wildly different structures.

Three-Component Scoring

Both trees are flattened into multisets of (type, name) tuples. VECTOR nodes are excluded — golden references contain icon paths that models can't produce via the API. Names are normalized to lowercase first-word. The score combines three components:

Score = 0.5 × Jaccard + 0.25 × depth_sim + 0.25 × count_sim

Jaccard = Σmin / Σmax over all (type, name) counters
depth_sim = min(golden_depth, gen_depth) / max(...)
count_sim = min(golden_nodes, gen_nodes) / max(...)

Try removing or adding nodes in the generated tree to see how each component responds. Hover any node to highlight matches across both trees and the multiset table.

Task

Golden Reference

Generated (editable)

Multiset (type, name) Comparison

Tuple	Golden	Gen	∩min	∪max

0.5 × 1.000 + 0.25 × 1.000 + 0.25 × 1.000 = 1.000

1.00

Jaccard (0.5) Depth (0.25) Count (0.25)

Score

1.000

Screenshot Comparison

L5: Visual Fidelity (weight: 0.2) — requires Figma. Combines SSIM structural similarity with DCT-based perceptual hashing (pHash) for a two-component visual comparison between the generated and golden reference screenshots.

How It Works

SSIM (Structural Similarity Index): Decomposes image similarity into luminance, contrast, and structure components. Computed block-wise (8×8) per RGB channel and averaged.
pHash (Perceptual Hash): Resizes to 32×32, applies 2D DCT, thresholds the 8×8 low-frequency block by median to produce a 64-bit hash. Hamming distance measures perceptual difference.
Final score: 0.6 × SSIM + 0.4 × max(0, 1 − hamming / 32)

Task

Golden Reference

Generated (simulated)

SSIM Heatmap

Noise 0%

Shift 0px

Hue 0°

SSIM Window Breakdown

Luminance	1.000
Contrast	1.000
Structure	1.000
SSIM	1.000

Perceptual Hash (pHash)

Golden

Gen

XOR

Hamming: 0/64 → 1.000

0.6 × 1.000 + 0.4 × 1.000 = 1.000

0.600

0.400

SSIM (0.6 × 1.000) Hash (0.4 × 1.000)

Score

1.000

375×812 viewport

VLM Quality Judge

L6: VLM Judge (weight: 0.5) — requires Figma. A vision-language model (Claude) compares the generated screenshot against the golden reference, scoring on 6 criteria each rated 0–10. The VLM also assigns an independent “overall” score. Use the degradation slider to simulate a bad generation, then adjust criteria scores to play the role of the VLM judge.

How It Works

Two images are sent as base64 to the VLM: the golden reference screenshot and the generated screenshot, along with the task brief for context.
6 criteria (0–10 each): Layout & Spacing, Typography, Color, Completeness, Polish, Match.
Overall score: The VLM assigns an independent “overall” (0–10), not necessarily the mean of the 6 criteria.
Final score: min(1.0, max(0.0, overall / 10))

Task

Golden Reference

Generated (simulated)

Degradation 0%

VLM Prompt (what the model receives)

Simulated VLM Response

Layout & Spacing 10

Proper alignment, consistent spacing, visual hierarchy

Typography 10

Readable fonts, appropriate sizes, proper weight usage

Color 10

Harmonious palette, proper contrast, consistent usage

Completeness 10

All required UI elements present and functional-looking

Polish 10

Overall refinement, attention to detail, professional quality

Match 10

How closely the generated design matches the reference

JSON Response

Score = overall / 10 = 10 / 10 = 1.000

Score

1.000

What's Next

Conclusion

What Makes FigmaBench Unique

Structured API tool use, not GUI clicks. The model produces Figma Plugin API calls — testing design tool competence at the schema level.
Multi-layer continuous rewards. 6 independent scoring dimensions enable fine-grained evaluation and reinforcement learning.
A real production tool. Designs are created in live Figma — real rendering, font loading, auto-layout computation.

Limitations

Requires Figma Desktop for L5/L6 scoring. Focuses on mobile UI (375px) with Inter font. 100 tasks is modest — scaling to 500+ planned.

Citation

@article{solerno2026figmabench,
  title        = {FigmaBench: Benchmarking LLM Design Capability
                  via Figma Plugin API Tool Use},
  author       = {Özdemirden, A. Şemsettin},
  year         = {2026},
  organization = {solerno-ai}
}

FigmaBench

What is FigmaBench?

Leaderboard

Tasks

Task Examples

Task Categories

Task Difficulty

Architecture

Evaluation

6-Layer Reward Stack

6-Layer Reward Stack Overview

Element Matching

The 7-Level Matching Cascade

Layout & Constraints

Penalty-Based Scoring

Color Space & Tokens

How It Works

Schema Similarity

Three-Component Scoring

Add Node to Generated Tree

Screenshot Comparison

How It Works

VLM Quality Judge

How It Works

Conclusion

What Makes FigmaBench Unique

Limitations

Citation