§ 1The Problem
Claim construction is the single most outcome-determining step in patent litigation. Markman hearings decide what the patent actually covers; Cybor made that determination de novo on appeal. AI tools are now generating claim-construction arguments without any independent measure of whether their constructions match the constructions a court would reach.
The risk is asymmetric: a tool that confidently produces a plausible-but-wrong construction can be more dangerous than no tool at all, because it crowds out the structured intrinsic-evidence analysis that Phillips v. AWH requires. AI-generated claim-construction memos are already being filed; there is no neutral way to know which tools are safer to rely on.
ClaimConstructionBench scores AI claim-construction tools against (a) the actual final constructions in the public record — PTAB Final Written Decisions and district-court Markman orders — and (b) a panel of claim-construction experts.
§ 2Domains
Four claim-construction domains, mapped to the Phillips hierarchy of intrinsic-then-extrinsic evidence.
| Domain | What it measures | Weight |
|---|---|---|
| Term identification | Which claim terms actually need construction (the parties’ agreement / dispute matrix). | 15% |
| Intrinsic evidence weighting | Construction from the claims, specification (including embodiments and definitions), and prosecution history. The Phillips core. | 45% |
| Extrinsic evidence integration | Where intrinsic evidence is insufficient, integration of dictionaries, treatises, expert testimony, and ordinary meaning to a person of ordinary skill. | 25% |
| Doctrinal compliance | Phillips hierarchy respected; Nautilus definiteness considered; O2 Micro no-construction-needed properly identified; lexicographer / disavowal doctrines applied. | 15% |
§ 3Difficulty Tiers
Five tiers calibrated to the claim-construction complexity courts encounter.
| Tier | Context | What it tests | Human accuracy |
|---|---|---|---|
| Tier 1 | Plain-meaning | Term has unambiguous ordinary meaning; no construction required (O2 Micro). | ~95% |
| Tier 2 | Spec-defined | Term is explicitly defined in the specification (lexicographer); construction follows the definition. | ~88% |
| Tier 3 | Prosecution-narrowed | Term scope was narrowed during prosecution; construction must reflect prosecution disclaimer. | ~75% |
| Tier 4 | Extrinsic-required | Intrinsic evidence is ambiguous; construction depends on extrinsic evidence and POSITA framing. | ~62% |
| Tier 5 | Means-plus / functional | § 112(f) means-plus-function or other functional claiming with structural-equivalent analysis required. | ~50% |
Human-accuracy figures are calibration estimates; final figures publish with v1.0 once the Methodology Council seats academic reviewers.
§ 4Four-Layer Scoring
ClaimConstructionBench combines four scoring layers. Composite score equals the weighted sum, except where a hard floor applies (§ 7).
Term identification
Did the tool identify the actually-disputed terms? Scored against the parties’ joint claim-construction statement (where public) or the court’s order recitation. Misses (terms the court construed but the tool ignored) and false positives (terms the tool flagged but the court did not construe) both penalized.
Track A — Final court construction
Ground-truth signal. The AI’s proposed construction is compared against the PTAB’s Final Written Decision construction or the district court’s Markman order. Scoring uses semantic equivalence (paraphrase-tolerant) plus a structural test: does the AI’s construction read on the same accused products / prior art that the court’s construction reads on?
Track B — Expert panel
Expert signal. A panel combining academics (Stanford, Berkeley, NYU candidates) and practicing patent litigators scores the AI’s reasoning against the rubrics in § 5. Inter-rater reliability publishes per release; outlier scores reviewed.
Citation integrity (Therasense floor)
Cited intrinsic evidence (specification passages, prosecution-history entries, parent-application disclosures) and cited extrinsic evidence (dictionaries, treatises, prior cases) must verify against the authoritative source. Verification failure triggers a hard floor: composite equals 0.0. See § 7.
§ 5Evaluation Rubrics
Doctrinal Rubric
- Phillips hierarchy — intrinsic before extrinsic, claims before specification, specification before prosecution history. (2.0× weight)
- Lexicographer / disavowal — explicit definitions and disclaimers honored when present. (1.5×)
- O2 Micro — "plain meaning" not used to dodge a real dispute. (1.5×)
- Nautilus definiteness — awareness when a term is indefinite under reasonable certainty. (1.0×)
- 112(f) recognition — means-plus-function triggers structural-equivalent analysis. (1.0×)
Reasoning Quality Rubric
- Intrinsic evidence integration — spec embodiments, prosecution disclaimers, parent applications. (2.0×)
- POSITA framing — person of ordinary skill in the relevant art identified and applied. (1.5×)
- Counter-construction — the opposing party’s likely construction addressed. (1.5×)
- Reading-on test — construction’s consequences for accused products / asserted prior art articulated. (1.0×)
- Professional quality — clarity, citation form, persuasiveness. (0.5×)
§ 6Reproducibility (Glass Box)
Per Charter § 8.3, every published ClaimConstructionBench run is reproducible by third parties within a 5% scoring delta. Same five pillars as the other accredited benchmarks:
- Test set publication. Cases are public-record PTAB Final Written Decisions and district-court Markman orders; the curated set is open. The held-out set publishes one release cycle after initial scoring.
- Rubric transparency. Both rubrics are public, versioned, annotated with calibration examples drawn from canonical Federal Circuit decisions (Phillips, Nautilus, Cybor, O2 Micro).
- Output availability. Vendor outputs (with consent) and panel scores publish under Apache-2.0. Both successful and failed constructions published.
- Failure-mode analysis. Each scored run includes a failure-mode taxonomy (fabricated case citation, missed prosecution disclaimer, wrong POSITA framing, intrinsic-evidence hallucination, etc.).
- Continuous reporting. Public leaderboard updates monthly. Methodology revisions go through Methodology Council review per Charter § 7.2 with academic-reviewer concurrence.
§ 7Hallucination Floor (Therasense Standard)
Therasense, Inc. v. Becton, Dickinson & Co., 649 F.3d 1276 (Fed. Cir. 2011) (en banc). For claim-construction contexts, the Therasense floor extends beyond cited prior art to cited intrinsic and extrinsic evidence: a construction supported by a fabricated specification passage, a fabricated prosecution-history entry, or a fabricated case citation is materially indistinguishable from a fabricated prior-art citation in a prosecution context.
Any cited intrinsic evidence (specification passage, prosecution-history entry, parent-application disclosure) must verify byte-for-byte against the authoritative source. Any cited case must verify against an authoritative reporter (CourtListener, Westlaw, Lexis, official PTAB / court PDFs). Any cited dictionary or treatise definition must verify against the named edition. Verification failure triggers a hard floor: composite equals 0.0.
This is not a deduction. A construction memo built on one fabricated specification passage does not earn 90%; it earns 0%. The integrity-floor mechanic is doctrinally analogous to Therasense’s “but-for materiality” threshold: catastrophic misrepresentation is not averaged into a passing grade.
§ 8Data Repository
The ClaimConstructionBench repository will publish at github.com/openipcouncil/claimconstructionbench when a maintainer is seated. Planned contents:
METHODOLOGY.md— this document, versioned.cases/ptab/— PTAB Final Written Decisions with their construction tables extracted.cases/district/— district-court Markman orders with their construction tables extracted.rubrics/doctrinal.json,rubrics/reasoning_quality.json— full rubric specifications.panel_scores/— per-release panel ratings (anonymized panelist ID, rubric breakdown, free-text rationale).vendor_outputs/— vendor-submitted outputs with their consent. Apache-2.0.harness/— the evaluation harness (semantic-equivalence scorer, reading-on test runner, citation-verification client, panel-scoring intake).leaderboard.md— auto-generated rolling results.
§ 9Governance
ClaimConstructionBench will be owned by the academic-partnership maintainer named at accreditation. Per Charter § 2 principle 2, the Open IP Council does not own this benchmark; it accredits it. Until a maintainer is seated, the OIPC Methodology Council holds the methodology in trust as a draft.
Maintainer seating is governed by Open Question 4 (third Founding Member is a candidate path) and by the Methodology Council’s general accreditation review per Charter § 6.
No single AI vendor, law firm, or platform may hold majority influence over ClaimConstructionBench’s test set, rubrics, or panel composition. The maintainer publishes its conflict-of-interest disclosures per Charter § 8.2.
§ 10Get Involved
- Academic institutions — if your institution wants to maintain ClaimConstructionBench, write to
contact@openipcouncil.orgwith a proposed maintainership team and a v1.0 milestone plan. Stanford CodeX, Berkeley CTSP, NYU IP, and the Engelberg Center are first-look candidates. - Patent litigators — apply to the Track B panel. 5+ years of Markman practice required; PTAB experience preferred.
- Vendors — submit your tool for evaluation when the maintainer-seated harness goes live (target 2027).
- Public comment — methodology objections via the OIPC public-comment process. Subject line should include
CCB-METHODOLOGY-v0.1.