about

about benchcad

Industrial CAD code generation requires more than recognizing the outer shape of a part: it requires understanding the part's 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would actually be designed and manufactured. Models that pass the eye too often fail the caliper — two programs may render to similar envelopes while differing substantially in editability, operation choice, and engineering detail.

BenchCAD is the first public CAD benchmark that combines four properties simultaneously:

  • Execution-verified at scale — 17,900 sandbox-executed CadQuery parts across 106 industrial families.
  • Standard-anchored — 49% of families (52/106) bound to real ISO / DIN / EN / ASME / IEC specification tables.
  • Operation-rich — 46 distinct CadQuery ops including helix, twistExtrude, polarArray, advanced sweeps and lofts.
  • Capability-decomposed — four matched tasks (Vision2Code · Vision QA · Code QA · Code Edit) that isolate visual recognition, parametric abstraction, and code synthesis.
data

three released datasets

Every family hand-crafted by domain experts from industrial standards. Croissant 1.0 metadata · code MIT · data CC-BY-4.0.

BenchCAD 17,900

Verified CadQuery parts — code · STEP · 4 canonical views · parameters · operation traces.

BenchCAD-QA 2,400

Paired image / code numeric QA items along a four-level capability hierarchy.

BenchCAD-Edit 748

Verified before / after edit pairs across five edit types T1–T5.

BenchCAD family distribution — 106 part families across industrial sectors

106 industrial part families — fasteners, transmission, structural, fluid, panels, hardware, enclosures. 49% (52/106) anchored to real specification tables across 47 ISO / DIN / EN / ASME / IEC codes.

evaluation

capability hierarchy

The same questions are evaluated under Vision QA (renders) and Code QA (source); the matched pair isolates whether a failure stems from visual recognition or from reasoning over the queried attribute.

BenchCAD-QA capability hierarchy: L1 Holistic Visual Recognition, L2 CAD Operation Understanding, L3 Industrial Parametric Abstraction, L4 Spatial / Code Reasoning, with paired Vision QA and Code QA examples per level.

L1 Holistic Visual Recognition · L2 CAD Operation Understanding · L3 Industrial Parametric Abstraction · L4 Compositional Spatial / Code Reasoning. Scoring is execution-grounded — voxel IoU for geometry, symmetric ratio accuracy for QA. No LLM judge.

people

team

The researchers behind BenchCAD.

HZ
Haozhe Zhang
KL
Kaichen Liu
MC
Miaomiao Chen
LL
Lei Li
SY
Shaojie Yang
CP
Cheng Peng
HC
Hanjie Chen
· · Anthropic
cite

BibTeX

bibtex
@misc{benchcad2026,
  title        = {BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD},
  author       = {Zhang, Haozhe and Liu, Kaichen and Chen, Miaomiao and Li, Lei
                  and Yang, Shaojie and Peng, Cheng and Chen, Hanjie},
  year         = {2026},
  eprint       = {2605.10865},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2605.10865}
}