Pattern Language Specification v1.0
Status: Design Document
Date: 2026-06-25
Scope: Regex-based deterministic matching for legal citation references in CountryProfile YAML files
1. Purpose
This specification defines the patterns field for CountryProfile YAML files. Each level in a CountryProfile's levels array MAY include a patterns list that defines regex patterns for deterministic (non-LLM) extraction of citation references in legal text.
The pattern language serves two purposes:
- Deterministic extraction — fast, reproducible identification of citation references without LLM calls
- Canonical normalization — mapping diverse textual forms to a single canonical identifier
Patterns complement (not replace) the few_shots field, which provides LLM guidance. Patterns are for machines; few-shots are for models.
⚠️ Patterns are for extraction, not validation.
Patterns are used to find clause structures in legal text. Validation is handled by LLM Judge (based on validation_rules). Attempting validation with patterns fails on 32% of few-shots (text-only references, number-only references, format mismatches). See docs/pattern-unmatchable-cases.md for details.
2. Field Placement
The patterns field is an optional list that appears inside each level definition in the levels array:
levels:
- key: article
label: "Article"
numbering: "Arabic numeral + Article"
description: "Basic unit of legislation"
# ... existing fields unchanged ...
patterns: # ← NEW FIELD
- regex: '제(?P<number>\d+)조'
canonical: "article:{number}"
captures:
number: "Arabic numeral identifying the article"
source: "Korean Law Information Center (law.go.kr)"
description: "Standard article reference"
The patterns key MUST NOT conflict with existing level fields (key, label, numbering, description, is_unnumbered, is_unnumbered_first, is_unnumbered_when_single, inserted_pattern, above_article, citing_format).
3. Pattern Entry Schema
Each entry in the patterns list MUST have exactly five keys:
| Key | Type | Required | Description |
|---|---|---|---|
regex |
string | YES | Python-compatible regex (RE2-style). Uses named groups (?P<name>...) for captures. |
canonical |
string | YES | Normalized form template. References capture groups as {name}. |
captures |
map[string→string] | YES | Each named group → human description of what it captures. |
source |
string | YES | Authoritative origin (official DB, legal guide, citation manual). |
description |
string | YES | Human-readable explanation of when this pattern matches. |
3.1 regex
- SHOULD be a valid Python
repattern. Prefer RE2-compatible constructs (no lookbehind, no lookahead). Backreferences are permitted when needed (e.g., same-letter pair enforcement\2) but MUST have a companion RE2-compatible fallback pattern using explicit enumeration. - MUST use named groups
(?P<name>...)for every captured element. - SHOULD use non-capturing groups
(?:...)for structural alternation. - SHOULD anchor to natural boundaries (whitespace, punctuation, line start/end) to avoid false positives.
- MAY use character classes for Unicode script ranges where needed.
Example:
regex: '제(?P<base>\d+)조의(?P<inserted>\d+)'
3.2 canonical
- Uses
{name}placeholders that reference named groups fromregex. - Follows the canonical form grammar (Section 4).
- MAY contain literal separators (
:,-,/,.). - Captures that are descriptive metadata (titles, names, years) MUST NOT appear in canonical form.
Example:
canonical: "article:{base}-{inserted}"
# For regex match of 제34조의2 → article:34-2
3.3 captures
A flat map where each key matches a named group in regex, and each value is a human-readable description.
captures:
base: "Parent article number before insertion"
inserted: "Inserted sub-article number"
3.4 source
MUST cite the specific authoritative source. Format:
"[Source Name] ([URL or identifier]) — [what it covers]"
Grading follows CountryProfile source grades: official_1st, official_2nd, academic_institutional, unofficial_secondary.
3.5 description
Human-readable. SHOULD explain:
- What textual form this pattern matches
- When it appears (in legislative text, in citations, in court decisions)
- Any disambiguation notes
4. Canonical Form Grammar
The canonical form is a hierarchical path using / as separator:
level:identifier
level:identifier/level:identifier
level:identifier/level:identifier/level:identifier
4.1 Identifier Formats
| Format | Use Case | Example |
|---|---|---|
N |
Simple numeric | article:5, section:405 |
B-I |
Inserted article (base-inserted) | article:34-2, paragraph:123a |
L |
Single letter | subsection:a, subparagraph:A |
LL |
Double letter (same-letter pair) | item:aa, subitem:AA |
R |
Roman numeral | clause:ii, subclause:III |
C |
CJK/Hangul character | subitem:가 |
N.N |
Part.section (CFR) | section:303.1 |
single |
Unnumbered first paragraph | paragraph:single |
4.2 Reserved Characters
| Char | Meaning |
|---|---|
: |
Level-identifier separator |
/ |
Level-level separator (hierarchy) |
- |
Inserted article separator (base-inserted) |
. |
Part.section separator (CFR) or decimal |
4.3 Examples
| Citation Text | Canonical Form |
|---|---|
| 제5조 | article:5 |
| 제34조의2 | article:34-2 |
| ② | paragraph:2 |
| § 405(c)(2) | section:405/subsection:c/paragraph:2 |
| 第15条の2第1项第3号 | article:15-2/paragraph:1/item:3 |
| Art. L. 1234-1 | article:L1234-1 |
| § 823 Abs. 1 S. 1 | paragraph:823/absatz:1/satz:1 |
| المادة 5 فقرة (2) بند (أ) | article:5/paragraph:2/item:أ |
| มาตรา 420 วรรคสอง (1) | section:420/paragraph:two/subsection:1 |
5. Multi-Script Support
The pattern language MUST support citation references in the following scripts:
5.1 Script-Specific Considerations
| Script | Countries | Numeral System | Key Pattern Features |
|---|---|---|---|
| Latin | US, GB, DE, FR, ES, IT, BR, PT, NL, PL, SE | Arabic (0-9) | § symbol, Roman numerals, letter subdivisions |
| CJK (Chinese) | CN | Chinese (一二三) | 第+N+条/款/项/目, fullwidth parentheses() |
| CJK (Japanese) | JP | Kanji (一二三) + Arabic | 第+N+条, のN inserted, 第+漢字+号 |
| CJK (Korean) | KR | Arabic + circled ① + Hangul | 제+N+조, ①②③, 가나다 |
| Cyrillic | RU, UA | Arabic + Cyrillic letters | статья/часть/пункт, а/б/в subdivisions |
| Arabic | SA, EG, MA | Eastern Arabic (٠-٩) + Western Arabic | مادة/فقرة/بند, ordinal words |
| Thai | TH | Thai (๐-๙) + Arabic | มาตรา/วรรค/อนุมาตรา, Thai ordinal words |
| Devanagari | IN | Arabic + Devanagari | धारा/अनुच्छेद/खंड |
5.2 Unicode in Regex
Patterns MAY use Unicode character classes for script-specific matching:
# Thai numerals
regex: 'มาตรา\s*(?P<number>[๐-๙]+|[0-9]+)'
# Chinese numerals (formal)
regex: '第(?P<number>[一二三四五六七八九十百千零]+)条'
# Eastern Arabic numerals
regex: 'مادة\s*(?P<number>[٠-٩]+|[0-9]+)'
# Circled numerals (Korean/Japanese)
regex: '(?P<number>[①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳])'
5.3 Numeral Normalization
The canonical form ALWAYS uses Western Arabic numerals (0-9). Patterns that match non-Arabic numerals MUST convert in the canonical template:
# For Chinese numerals, a conversion function is needed
regex: '第(?P<number>[一二三四五六七八九十百千零]+)条'
canonical: "article:{number}" # {number} must be converted from kanji to Arabic
The conversion is the responsibility of the pattern matching engine, not the regex itself. The regex captures the raw text; the engine normalizes before applying the canonical template.
6. Inserted Articles
Inserted articles (가지번호, 枝番号, -bis, /N, .N, Na) are articles inserted between existing ones during legislative amendments. They have a base number and an insertion suffix.
6.1 Inserted Article Patterns by Country
| Country | Pattern | Example | Canonical |
|---|---|---|---|
| KR | 제N조의M | 제34조의2 | article:34-2 |
| JP | 第N条のM | 第15条の2 | article:15-2 |
| JP (deep) | 第N条のMのM | 第15条の2の2 | article:15-2-2 |
| DE | § Na, § Nb | § 123a | paragraph:123a |
| FR | Article N bis/ter/quater | Article 9 bis | article:9bis |
| ES | Artículo N bis | Artículo 5 bis | article:5bis |
| TH | มาตรา N/M | มาตรา 193/1 | section:193/1 |
| RU | статья N.N | статья 123.1 | article:123.1 |
| IT | Art. N-bis | Art. 5-bis | article:5bis |
6.2 Regex Pattern
Inserted article patterns MUST capture both base and suffix:
# Korean
regex: '제(?P<base>\d+)조의(?P<inserted>\d+)'
canonical: "article:{base}-{inserted}"
# Japanese (deep insertion)
regex: '第(?P<base>\d+)条の(?P<first>\d+)の(?P<second>\d+)'
canonical: "article:{base}-{first}-{second}"
# German
regex: '§\s*(?P<number>\d+)(?P<letter>[a-z])'
canonical: "paragraph:{number}{letter}"
# French
regex: 'Article\s+(?P<number>\d+)\s+(?P<suffix>bis|ter|quater|quinquies|sexies|septies|octies|nonies|decies)'
canonical: "article:{number}{suffix}"
# Thai
regex: 'มาตรา\s*(?P<base>\d+)/(?P<inserted>\d+)'
canonical: "section:{base}/{inserted}"
7. Unnumbered First Paragraphs
Many legal systems omit the number for the first paragraph (항, Absatz, alinéa, 款, فقرة, วรรค). The pattern language handles this through two mechanisms:
7.1 Detection via Context
The first paragraph of an article is identified by position — it appears immediately after the article header, before any numbered paragraph marker. No regex pattern matches it directly; instead, the matching engine infers its existence.
7.2 Canonical Representation
| Scenario | Canonical Form |
|---|---|
| First paragraph (unnumbered, with subsequent numbered paragraphs) | paragraph:1 |
| Single paragraph (unnumbered, no other paragraphs) | paragraph:single |
| Numbered paragraph | paragraph:N |
7.3 Country Flags
The CountryProfile level definition uses existing flags:
- key: paragraph
is_unnumbered_first: true # First paragraph has no number
is_unnumbered_when_single: true # Single paragraph has no number
Patterns do NOT need to match unnumbered paragraphs directly. The matching engine uses these flags to assign paragraph:1 or paragraph:single when no numbered paragraph marker is found.
8. Order-Independent Matching
Legal text may present citations in different orders:
8.1 Article-First (Standard)
Most legal systems present the article first, then subdivisions:
제5조 제2항 제3호 (KR — article/paragraph/item)
§ 405(c)(2)(C)(ii) (US — section/subsection/paragraph/subparagraph/clause)
Article 1240, alinéa 2 (FR — article/alinea)
8.2 Paragraph-First (Inline)
Some systems allow subdivisions to appear without the article when the article is already established:
② 3. 가. (KR — paragraph/item/subitem, article implied)
(2)(C)(ii) (US — paragraph/subparagraph/clause, section implied)
8.3 Matching Strategy
The pattern matching engine MUST:
- Try combined patterns first (multi-level in one regex)
- Fall back to individual level patterns
- Build the canonical form by composing matched levels
- Handle implicit article from context (paragraph-first text assumes the current article)
9. Pattern Matching Engine Requirements
9.1 Processing Order
For a given text chunk:
- Apply combined patterns — multi-level regex that captures several levels at once
- Apply level patterns — individual patterns for each level, broadest first
- Compose canonical — assemble multi-level canonical from individual matches
- Normalize — convert non-Arabic numerals to Arabic in canonical form
- Deduplicate — if a combined pattern and individual patterns match the same text span, prefer the combined pattern
9.2 Overlap Resolution
When multiple patterns match overlapping text:
- Longest match wins
- Combined patterns take priority over individual patterns
- If two patterns at the same level match, the one with more specific context anchors wins
9.3 Context Window
Patterns are applied to a text chunk (typically a paragraph or article). The engine MAY use the chunk's structural context (e.g., "this chunk is within Article 34") to resolve implicit references.
9.4 Performance
- Patterns MUST be compiled once and reused
- Pattern lists SHOULD be ordered by specificity (most specific first)
- Combined patterns SHOULD come before individual patterns in the list
10. Integration with CountryProfile Schema
10.1 Backward Compatibility
The patterns field is OPTIONAL. Existing CountryProfile files without patterns continue to work. The LLM Judge (few-shots + validation_rules) remains the primary validation mechanism.
10.2 Relationship to Existing Fields
| Existing Field | Relationship to patterns |
|---|---|
numbering |
Human-readable description; patterns provide machine-readable regex |
inserted_pattern |
Human-readable; patterns provide regex for inserted articles |
is_unnumbered_first |
Flag; patterns handle detection via absence of numbered match |
few_shots |
LLM guidance; patterns provide deterministic extraction |
canonical (in patterns) |
Replaces what was previously implicit in reference_id |
10.3 Validation Rules Integration
The patterns field works alongside validation_rules (separate field, separate spec):
patterns → deterministic match: "is this a citation?"
validation_rules → LLM judgment: "is this citation correct?"
11. Complete Example: Korea (KR)
levels:
- key: article
label: "Article"
numbering: "Arabic numeral + Article"
description: "Basic unit of legislation"
is_unnumbered: false
inserted_pattern: "제N조의M"
patterns:
- regex: '제(?P<number>\d+)조'
canonical: "article:{number}"
captures:
number: "Arabic numeral identifying the article"
source: "Korean Law Information Center (law.go.kr)"
description: "Standard article: 제1조, 제420조"
- regex: '제\s*(?P<number>\d+)\s*조'
canonical: "article:{number}"
captures:
number: "Arabic numeral identifying the article"
source: "Supreme Court Comprehensive Legal Information (scourt.go.kr)"
description: "Spaced variant: 제 5 조, 제5 조"
- regex: '제(?P<base>\d+)조의(?P<inserted>\d+)'
canonical: "article:{base}-{inserted}"
captures:
base: "Parent article number"
inserted: "Inserted sub-article number"
source: "Korean Law Information Center (law.go.kr)"
description: "Inserted article: 제34조의2, 제6조의2"
- key: paragraph
label: "Paragraph"
numbering: "Circled Arabic numerals: ①, ②, ③"
description: "Subdivision of articles"
is_unnumbered_first: true
is_unnumbered_when_single: true
patterns:
- regex: '(?P<number>[①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳])'
canonical: "paragraph:{number}" # engine converts ①→1, ②→2, …, ⑳→20
captures:
number: "Circled numeral — engine converts to Arabic (①→1, ②→2, …, ⑳→20)"
source: "Korean Law Information Center (law.go.kr) — paragraph notation standard"
12. Complete Example: United States (US)
levels:
- key: section
label: section
numbering: "§ N"
description: "Primary unit of the United States Code"
patterns:
- regex: '§\s*(?P<number>\d+[\w-]*)'
canonical: "section:{number}"
captures:
number: "Section number (may include letters or hyphens)"
source: "Office of the Law Revision Counsel (uscode.house.gov)"
description: "Section with § symbol: § 78j, § 405"
- key: subsection
label: subsection
numbering: "(a), (b), (c)"
description: "First subdivision in USC"
patterns:
- regex: '\((?P<letter>[a-z])\)'
canonical: "subsection:{letter}"
captures:
letter: "Lowercase letter"
source: "Office of the Law Revision Counsel (uscode.house.gov)"
description: "Lowercase letter in parentheses: (a), (b), (c)"
- key: paragraph
label: paragraph
numbering: "(1), (2), (3)"
description: "Second subdivision in USC"
patterns:
- regex: '\((?P<number>\d+)\)'
canonical: "paragraph:{number}"
captures:
number: "Arabic numeral"
source: "Office of the Law Revision Counsel (uscode.house.gov)"
description: "Arabic numeral in parentheses: (1), (2), (3)"
- key: clause
label: clause
numbering: "(i), (ii), (iii)"
description: "Fourth subdivision in USC"
patterns:
- regex: >-
\((?P<numeral>
x{0,2}(?:i[xv]|v?i{0,3}|iv|vi{0,3})
)\)
canonical: "clause:{numeral}"
captures:
numeral: "Lowercase Roman numeral (i through xxiii)"
source: "Office of the Law Revision Counsel (uscode.house.gov)"
description: "Lowercase Roman numeral: (i), (ii), ..., (xix), (xx), (xxiii)"
13. Design Principles
Regex first, LLM second. Use patterns for what regex can reliably detect (article numbers, section markers). Use LLM Judge for semantic validation (is the citation contextually correct?).
Canonical over raw. The canonical form is the system of record. Multiple regex patterns may map to the same canonical.
제5조,제 5 조, and제 5 조all normalize toarticle:5.Source everything. Every pattern MUST cite its authoritative source. This is a legal citation system — provenance matters.
Script-aware, not script-specific. The pattern language uses Unicode-aware regex. Country-specific patterns handle local scripts; the canonical grammar is script-agnostic.
Backward compatible. Adding
patternsto an existing CountryProfile is purely additive. No existing fields change meaning.Human-readable. YAML, not compiled regex. The
descriptionandcapturesfields ensure patterns are understandable without running them.
14. Future Extensions
| Extension | Description | Status |
|---|---|---|
validation_rules |
LLM Judge rules for contextual validation | Design in progress |
normalization_map |
Explicit numeral conversion tables (Thai→Arabic, etc.) | Planned |
anti_patterns |
Negative patterns to exclude false positives | Planned |
compound_patterns |
Patterns for multi-article references (§§ 405-407) | Partially covered |
case_law_patterns |
Patterns for court decision citations | Country-specific |
confidence |
Per-pattern confidence score (0.0-1.0) | Planned |
15. Appendix: Regex Quick Reference
| Pattern | Meaning |
|---|---|
(?P<name>...) |
Named capturing group |
(?:...) |
Non-capturing group |
\d |
Digit [0-9] |
\w |
Word character [a-zA-Z0-9_] |
\s |
Whitespace |
[abc] |
Character class |
[a-z] |
Range in character class |
+ |
One or more |
* |
Zero or more |
? |
Zero or one |
{n,m} |
Between n and m repetitions |
| |
Alternation |
^ |
Start of string/line |
$ |
End of string/line |