Pattern Language Specification v1.0

Status: Design Document
Date: 2026-06-25
Scope: Regex-based deterministic matching for legal citation references in CountryProfile YAML files

1. Purpose

This specification defines the patterns field for CountryProfile YAML files. Each level in a CountryProfile's levels array MAY include a patterns list that defines regex patterns for deterministic (non-LLM) extraction of citation references in legal text.

The pattern language serves two purposes:

Deterministic extraction — fast, reproducible identification of citation references without LLM calls
Canonical normalization — mapping diverse textual forms to a single canonical identifier

Patterns complement (not replace) the few_shots field, which provides LLM guidance. Patterns are for machines; few-shots are for models.

⚠️ Patterns are for extraction, not validation.

Patterns are used to find clause structures in legal text. Validation is handled by LLM Judge (based on validation_rules). Attempting validation with patterns fails on 32% of few-shots (text-only references, number-only references, format mismatches). See docs/pattern-unmatchable-cases.md for details.

2. Field Placement

The patterns field is an optional list that appears inside each level definition in the levels array:

levels:
  - key: article
    label: "Article"
    numbering: "Arabic numeral + Article"
    description: "Basic unit of legislation"
    # ... existing fields unchanged ...

    patterns:                        # ← NEW FIELD
      - regex: '제(?P<number>\d+)조'
        canonical: "article:{number}"
        captures:
          number: "Arabic numeral identifying the article"
        source: "Korean Law Information Center (law.go.kr)"
        description: "Standard article reference"

The patterns key MUST NOT conflict with existing level fields (key, label, numbering, description, is_unnumbered, is_unnumbered_first, is_unnumbered_when_single, inserted_pattern, above_article, citing_format).

3. Pattern Entry Schema

Each entry in the patterns list MUST have exactly five keys:

Key	Type	Required	Description
`regex`	string	YES	Python-compatible regex (RE2-style). Uses named groups `(?P<name>...)` for captures.
`canonical`	string	YES	Normalized form template. References capture groups as `{name}`.
`captures`	map[string→string]	YES	Each named group → human description of what it captures.
`source`	string	YES	Authoritative origin (official DB, legal guide, citation manual).
`description`	string	YES	Human-readable explanation of when this pattern matches.

3.1 `regex`

SHOULD be a valid Python re pattern. Prefer RE2-compatible constructs (no lookbehind, no lookahead). Backreferences are permitted when needed (e.g., same-letter pair enforcement \2) but MUST have a companion RE2-compatible fallback pattern using explicit enumeration.
MUST use named groups (?P<name>...) for every captured element.
SHOULD use non-capturing groups (?:...) for structural alternation.
SHOULD anchor to natural boundaries (whitespace, punctuation, line start/end) to avoid false positives.
MAY use character classes for Unicode script ranges where needed.

Example:

regex: '제(?P<base>\d+)조의(?P<inserted>\d+)'

3.2 `canonical`

Uses {name} placeholders that reference named groups from regex.
Follows the canonical form grammar (Section 4).
MAY contain literal separators (:, -, /, .).
Captures that are descriptive metadata (titles, names, years) MUST NOT appear in canonical form.

Example:

canonical: "article:{base}-{inserted}"
# For regex match of 제34조의2 → article:34-2

3.3 `captures`

A flat map where each key matches a named group in regex, and each value is a human-readable description.

captures:
  base: "Parent article number before insertion"
  inserted: "Inserted sub-article number"

3.4 `source`

MUST cite the specific authoritative source. Format:

"[Source Name] ([URL or identifier]) — [what it covers]"

Grading follows CountryProfile source grades: official_1st, official_2nd, academic_institutional, unofficial_secondary.

3.5 `description`

Human-readable. SHOULD explain:

What textual form this pattern matches
When it appears (in legislative text, in citations, in court decisions)
Any disambiguation notes

4. Canonical Form Grammar

The canonical form is a hierarchical path using / as separator:

level:identifier
level:identifier/level:identifier
level:identifier/level:identifier/level:identifier

4.1 Identifier Formats

Format	Use Case	Example
`N`	Simple numeric	`article:5`, `section:405`
`B-I`	Inserted article (base-inserted)	`article:34-2`, `paragraph:123a`
`L`	Single letter	`subsection:a`, `subparagraph:A`
`LL`	Double letter (same-letter pair)	`item:aa`, `subitem:AA`
`R`	Roman numeral	`clause:ii`, `subclause:III`
`C`	CJK/Hangul character	`subitem:가`
`N.N`	Part.section (CFR)	`section:303.1`
`single`	Unnumbered first paragraph	`paragraph:single`

4.2 Reserved Characters

Char	Meaning
`:`	Level-identifier separator
`/`	Level-level separator (hierarchy)
`-`	Inserted article separator (base-inserted)
`.`	Part.section separator (CFR) or decimal

4.3 Examples

Citation Text	Canonical Form
제5조	`article:5`
제34조의2	`article:34-2`
②	`paragraph:2`
§ 405(c)(2)	`section:405/subsection:c/paragraph:2`
第15条の2第1项第3号	`article:15-2/paragraph:1/item:3`
Art. L. 1234-1	`article:L1234-1`
§ 823 Abs. 1 S. 1	`paragraph:823/absatz:1/satz:1`
المادة 5 فقرة (2) بند (أ)	`article:5/paragraph:2/item:أ`
มาตรา 420 วรรคสอง (1)	`section:420/paragraph:two/subsection:1`

5. Multi-Script Support

The pattern language MUST support citation references in the following scripts:

5.1 Script-Specific Considerations

Script	Countries	Numeral System	Key Pattern Features
Latin	US, GB, DE, FR, ES, IT, BR, PT, NL, PL, SE	Arabic (0-9)	§ symbol, Roman numerals, letter subdivisions
CJK (Chinese)	CN	Chinese (一二三)	第+N+条/款/项/目, fullwidth parentheses（）
CJK (Japanese)	JP	Kanji (一二三) + Arabic	第+N+条, のN inserted, 第+漢字+号
CJK (Korean)	KR	Arabic + circled ① + Hangul	제+N+조, ①②③, 가나다
Cyrillic	RU, UA	Arabic + Cyrillic letters	статья/часть/пункт, а/б/в subdivisions
Arabic	SA, EG, MA	Eastern Arabic (٠-٩) + Western Arabic	مادة/فقرة/بند, ordinal words
Thai	TH	Thai (๐-๙) + Arabic	มาตรา/วรรค/อนุมาตรา, Thai ordinal words
Devanagari	IN	Arabic + Devanagari	धारा/अनुच्छेद/खंड

5.2 Unicode in Regex

Patterns MAY use Unicode character classes for script-specific matching:

# Thai numerals
regex: 'มาตรา\s*(?P<number>[๐-๙]+|[0-9]+)'

# Chinese numerals (formal)
regex: '第(?P<number>[一二三四五六七八九十百千零]+)条'

# Eastern Arabic numerals
regex: 'مادة\s*(?P<number>[٠-٩]+|[0-9]+)'

# Circled numerals (Korean/Japanese)
regex: '(?P<number>[①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳])'

5.3 Numeral Normalization

The canonical form ALWAYS uses Western Arabic numerals (0-9). Patterns that match non-Arabic numerals MUST convert in the canonical template:

# For Chinese numerals, a conversion function is needed
regex: '第(?P<number>[一二三四五六七八九十百千零]+)条'
canonical: "article:{number}"  # {number} must be converted from kanji to Arabic

The conversion is the responsibility of the pattern matching engine, not the regex itself. The regex captures the raw text; the engine normalizes before applying the canonical template.

6. Inserted Articles

Inserted articles (가지번호, 枝番号, -bis, /N, .N, Na) are articles inserted between existing ones during legislative amendments. They have a base number and an insertion suffix.

6.1 Inserted Article Patterns by Country

Country	Pattern	Example	Canonical
KR	제N조의M	제34조의2	`article:34-2`
JP	第N条のM	第15条の2	`article:15-2`
JP (deep)	第N条のMのM	第15条の2の2	`article:15-2-2`
DE	§ Na, § Nb	§ 123a	`paragraph:123a`
FR	Article N bis/ter/quater	Article 9 bis	`article:9bis`
ES	Artículo N bis	Artículo 5 bis	`article:5bis`
TH	มาตรา N/M	มาตรา 193/1	`section:193/1`
RU	статья N.N	статья 123.1	`article:123.1`
IT	Art. N-bis	Art. 5-bis	`article:5bis`

6.2 Regex Pattern

Inserted article patterns MUST capture both base and suffix:

# Korean
regex: '제(?P<base>\d+)조의(?P<inserted>\d+)'
canonical: "article:{base}-{inserted}"

# Japanese (deep insertion)
regex: '第(?P<base>\d+)条の(?P<first>\d+)の(?P<second>\d+)'
canonical: "article:{base}-{first}-{second}"

# German
regex: '§\s*(?P<number>\d+)(?P<letter>[a-z])'
canonical: "paragraph:{number}{letter}"

# French
regex: 'Article\s+(?P<number>\d+)\s+(?P<suffix>bis|ter|quater|quinquies|sexies|septies|octies|nonies|decies)'
canonical: "article:{number}{suffix}"

# Thai
regex: 'มาตรา\s*(?P<base>\d+)/(?P<inserted>\d+)'
canonical: "section:{base}/{inserted}"

7. Unnumbered First Paragraphs

Many legal systems omit the number for the first paragraph (항, Absatz, alinéa, 款, فقرة, วรรค). The pattern language handles this through two mechanisms:

7.1 Detection via Context

The first paragraph of an article is identified by position — it appears immediately after the article header, before any numbered paragraph marker. No regex pattern matches it directly; instead, the matching engine infers its existence.

7.2 Canonical Representation

Scenario	Canonical Form
First paragraph (unnumbered, with subsequent numbered paragraphs)	`paragraph:1`
Single paragraph (unnumbered, no other paragraphs)	`paragraph:single`
Numbered paragraph	`paragraph:N`

7.3 Country Flags

The CountryProfile level definition uses existing flags:

- key: paragraph
  is_unnumbered_first: true    # First paragraph has no number
  is_unnumbered_when_single: true  # Single paragraph has no number

Patterns do NOT need to match unnumbered paragraphs directly. The matching engine uses these flags to assign paragraph:1 or paragraph:single when no numbered paragraph marker is found.

8. Order-Independent Matching

Legal text may present citations in different orders:

8.1 Article-First (Standard)

Most legal systems present the article first, then subdivisions:

제5조 제2항 제3호        (KR — article/paragraph/item)
§ 405(c)(2)(C)(ii)       (US — section/subsection/paragraph/subparagraph/clause)
Article 1240, alinéa 2    (FR — article/alinea)

8.2 Paragraph-First (Inline)

Some systems allow subdivisions to appear without the article when the article is already established:

② 3. 가.                  (KR — paragraph/item/subitem, article implied)
(2)(C)(ii)                (US — paragraph/subparagraph/clause, section implied)

8.3 Matching Strategy

The pattern matching engine MUST:

Try combined patterns first (multi-level in one regex)
Fall back to individual level patterns
Build the canonical form by composing matched levels
Handle implicit article from context (paragraph-first text assumes the current article)

9. Pattern Matching Engine Requirements

9.1 Processing Order

For a given text chunk:

Apply combined patterns — multi-level regex that captures several levels at once
Apply level patterns — individual patterns for each level, broadest first
Compose canonical — assemble multi-level canonical from individual matches
Normalize — convert non-Arabic numerals to Arabic in canonical form
Deduplicate — if a combined pattern and individual patterns match the same text span, prefer the combined pattern

9.2 Overlap Resolution

When multiple patterns match overlapping text:

Longest match wins
Combined patterns take priority over individual patterns
If two patterns at the same level match, the one with more specific context anchors wins

9.3 Context Window

Patterns are applied to a text chunk (typically a paragraph or article). The engine MAY use the chunk's structural context (e.g., "this chunk is within Article 34") to resolve implicit references.

9.4 Performance

Patterns MUST be compiled once and reused
Pattern lists SHOULD be ordered by specificity (most specific first)
Combined patterns SHOULD come before individual patterns in the list

10. Integration with CountryProfile Schema

10.1 Backward Compatibility

The patterns field is OPTIONAL. Existing CountryProfile files without patterns continue to work. The LLM Judge (few-shots + validation_rules) remains the primary validation mechanism.

10.2 Relationship to Existing Fields

Existing Field	Relationship to `patterns`
`numbering`	Human-readable description; patterns provide machine-readable regex
`inserted_pattern`	Human-readable; patterns provide regex for inserted articles
`is_unnumbered_first`	Flag; patterns handle detection via absence of numbered match
`few_shots`	LLM guidance; patterns provide deterministic extraction
`canonical` (in patterns)	Replaces what was previously implicit in `reference_id`

10.3 Validation Rules Integration

The patterns field works alongside validation_rules (separate field, separate spec):

patterns           →  deterministic match: "is this a citation?"
validation_rules   →  LLM judgment: "is this citation correct?"

11. Complete Example: Korea (KR)

levels:
  - key: article
    label: "Article"
    numbering: "Arabic numeral + Article"
    description: "Basic unit of legislation"
    is_unnumbered: false
    inserted_pattern: "제N조의M"
    patterns:
      - regex: '제(?P<number>\d+)조'
        canonical: "article:{number}"
        captures:
          number: "Arabic numeral identifying the article"
        source: "Korean Law Information Center (law.go.kr)"
        description: "Standard article: 제1조, 제420조"

      - regex: '제\s*(?P<number>\d+)\s*조'
        canonical: "article:{number}"
        captures:
          number: "Arabic numeral identifying the article"
        source: "Supreme Court Comprehensive Legal Information (scourt.go.kr)"
        description: "Spaced variant: 제 5 조, 제5 조"

      - regex: '제(?P<base>\d+)조의(?P<inserted>\d+)'
        canonical: "article:{base}-{inserted}"
        captures:
          base: "Parent article number"
          inserted: "Inserted sub-article number"
        source: "Korean Law Information Center (law.go.kr)"
        description: "Inserted article: 제34조의2, 제6조의2"

  - key: paragraph
    label: "Paragraph"
    numbering: "Circled Arabic numerals: ①, ②, ③"
    description: "Subdivision of articles"
    is_unnumbered_first: true
    is_unnumbered_when_single: true
    patterns:
      - regex: '(?P<number>[①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳])'
        canonical: "paragraph:{number}"  # engine converts ①→1, ②→2, …, ⑳→20
        captures:
          number: "Circled numeral — engine converts to Arabic (①→1, ②→2, …, ⑳→20)"
        source: "Korean Law Information Center (law.go.kr) — paragraph notation standard"

12. Complete Example: United States (US)

levels:
  - key: section
    label: section
    numbering: "§ N"
    description: "Primary unit of the United States Code"
    patterns:
      - regex: '§\s*(?P<number>\d+[\w-]*)'
        canonical: "section:{number}"
        captures:
          number: "Section number (may include letters or hyphens)"
        source: "Office of the Law Revision Counsel (uscode.house.gov)"
        description: "Section with § symbol: § 78j, § 405"

  - key: subsection
    label: subsection
    numbering: "(a), (b), (c)"
    description: "First subdivision in USC"
    patterns:
      - regex: '\((?P<letter>[a-z])\)'
        canonical: "subsection:{letter}"
        captures:
          letter: "Lowercase letter"
        source: "Office of the Law Revision Counsel (uscode.house.gov)"
        description: "Lowercase letter in parentheses: (a), (b), (c)"

  - key: paragraph
    label: paragraph
    numbering: "(1), (2), (3)"
    description: "Second subdivision in USC"
    patterns:
      - regex: '\((?P<number>\d+)\)'
        canonical: "paragraph:{number}"
        captures:
          number: "Arabic numeral"
        source: "Office of the Law Revision Counsel (uscode.house.gov)"
        description: "Arabic numeral in parentheses: (1), (2), (3)"

  - key: clause
    label: clause
    numbering: "(i), (ii), (iii)"
    description: "Fourth subdivision in USC"
    patterns:
      - regex: >-
          \((?P<numeral>
          x{0,2}(?:i[xv]|v?i{0,3}|iv|vi{0,3})
          )\)
        canonical: "clause:{numeral}"
        captures:
          numeral: "Lowercase Roman numeral (i through xxiii)"
        source: "Office of the Law Revision Counsel (uscode.house.gov)"
        description: "Lowercase Roman numeral: (i), (ii), ..., (xix), (xx), (xxiii)"

13. Design Principles

Regex first, LLM second. Use patterns for what regex can reliably detect (article numbers, section markers). Use LLM Judge for semantic validation (is the citation contextually correct?).
Canonical over raw. The canonical form is the system of record. Multiple regex patterns may map to the same canonical. 제5조, 제 5 조, and 제 5 조 all normalize to article:5.
Source everything. Every pattern MUST cite its authoritative source. This is a legal citation system — provenance matters.
Script-aware, not script-specific. The pattern language uses Unicode-aware regex. Country-specific patterns handle local scripts; the canonical grammar is script-agnostic.
Backward compatible. Adding patterns to an existing CountryProfile is purely additive. No existing fields change meaning.
Human-readable. YAML, not compiled regex. The description and captures fields ensure patterns are understandable without running them.

14. Future Extensions

Extension	Description	Status
`validation_rules`	LLM Judge rules for contextual validation	Design in progress
`normalization_map`	Explicit numeral conversion tables (Thai→Arabic, etc.)	Planned
`anti_patterns`	Negative patterns to exclude false positives	Planned
`compound_patterns`	Patterns for multi-article references (§§ 405-407)	Partially covered
`case_law_patterns`	Patterns for court decision citations	Country-specific
`confidence`	Per-pattern confidence score (0.0-1.0)	Planned

15. Appendix: Regex Quick Reference

Pattern	Meaning
`(?P<name>...)`	Named capturing group
`(?:...)`	Non-capturing group
`\d`	Digit [0-9]
`\w`	Word character [a-zA-Z0-9_]
`\s`	Whitespace
`[abc]`	Character class
`[a-z]`	Range in character class
`+`	One or more
`*`	Zero or more
`?`	Zero or one
`{n,m}`	Between n and m repetitions
`\|`	Alternation
`^`	Start of string/line
`$`	End of string/line

Pattern Language