Pattern Language

Pattern Language Specification v1.0

Status: Design Document
Date: 2026-06-25
Scope: Regex-based deterministic matching for legal citation references in CountryProfile YAML files


1. Purpose

This specification defines the patterns field for CountryProfile YAML files. Each level in a CountryProfile's levels array MAY include a patterns list that defines regex patterns for deterministic (non-LLM) extraction of citation references in legal text.

The pattern language serves two purposes:

  1. Deterministic extraction — fast, reproducible identification of citation references without LLM calls
  2. Canonical normalization — mapping diverse textual forms to a single canonical identifier

Patterns complement (not replace) the few_shots field, which provides LLM guidance. Patterns are for machines; few-shots are for models.

⚠️ Patterns are for extraction, not validation.

Patterns are used to find clause structures in legal text. Validation is handled by LLM Judge (based on validation_rules). Attempting validation with patterns fails on 32% of few-shots (text-only references, number-only references, format mismatches). See docs/pattern-unmatchable-cases.md for details.


2. Field Placement

The patterns field is an optional list that appears inside each level definition in the levels array:

levels:
  - key: article
    label: "Article"
    numbering: "Arabic numeral + Article"
    description: "Basic unit of legislation"
    # ... existing fields unchanged ...

    patterns:                        # ← NEW FIELD
      - regex: '제(?P<number>\d+)조'
        canonical: "article:{number}"
        captures:
          number: "Arabic numeral identifying the article"
        source: "Korean Law Information Center (law.go.kr)"
        description: "Standard article reference"

The patterns key MUST NOT conflict with existing level fields (key, label, numbering, description, is_unnumbered, is_unnumbered_first, is_unnumbered_when_single, inserted_pattern, above_article, citing_format).


3. Pattern Entry Schema

Each entry in the patterns list MUST have exactly five keys:

Key Type Required Description
regex string YES Python-compatible regex (RE2-style). Uses named groups (?P<name>...) for captures.
canonical string YES Normalized form template. References capture groups as {name}.
captures map[string→string] YES Each named group → human description of what it captures.
source string YES Authoritative origin (official DB, legal guide, citation manual).
description string YES Human-readable explanation of when this pattern matches.

3.1 regex

  • SHOULD be a valid Python re pattern. Prefer RE2-compatible constructs (no lookbehind, no lookahead). Backreferences are permitted when needed (e.g., same-letter pair enforcement \2) but MUST have a companion RE2-compatible fallback pattern using explicit enumeration.
  • MUST use named groups (?P<name>...) for every captured element.
  • SHOULD use non-capturing groups (?:...) for structural alternation.
  • SHOULD anchor to natural boundaries (whitespace, punctuation, line start/end) to avoid false positives.
  • MAY use character classes for Unicode script ranges where needed.

Example:

regex: '제(?P<base>\d+)조의(?P<inserted>\d+)'

3.2 canonical

  • Uses {name} placeholders that reference named groups from regex.
  • Follows the canonical form grammar (Section 4).
  • MAY contain literal separators (:, -, /, .).
  • Captures that are descriptive metadata (titles, names, years) MUST NOT appear in canonical form.

Example:

canonical: "article:{base}-{inserted}"
# For regex match of 제34조의2 → article:34-2

3.3 captures

A flat map where each key matches a named group in regex, and each value is a human-readable description.

captures:
  base: "Parent article number before insertion"
  inserted: "Inserted sub-article number"

3.4 source

MUST cite the specific authoritative source. Format:

"[Source Name] ([URL or identifier]) — [what it covers]"

Grading follows CountryProfile source grades: official_1st, official_2nd, academic_institutional, unofficial_secondary.

3.5 description

Human-readable. SHOULD explain:

  • What textual form this pattern matches
  • When it appears (in legislative text, in citations, in court decisions)
  • Any disambiguation notes

4. Canonical Form Grammar

The canonical form is a hierarchical path using / as separator:

level:identifier
level:identifier/level:identifier
level:identifier/level:identifier/level:identifier

4.1 Identifier Formats

Format Use Case Example
N Simple numeric article:5, section:405
B-I Inserted article (base-inserted) article:34-2, paragraph:123a
L Single letter subsection:a, subparagraph:A
LL Double letter (same-letter pair) item:aa, subitem:AA
R Roman numeral clause:ii, subclause:III
C CJK/Hangul character subitem:가
N.N Part.section (CFR) section:303.1
single Unnumbered first paragraph paragraph:single

4.2 Reserved Characters

Char Meaning
: Level-identifier separator
/ Level-level separator (hierarchy)
- Inserted article separator (base-inserted)
. Part.section separator (CFR) or decimal

4.3 Examples

Citation Text Canonical Form
제5조 article:5
제34조의2 article:34-2
paragraph:2
§ 405(c)(2) section:405/subsection:c/paragraph:2
第15条の2第1项第3号 article:15-2/paragraph:1/item:3
Art. L. 1234-1 article:L1234-1
§ 823 Abs. 1 S. 1 paragraph:823/absatz:1/satz:1
المادة 5 فقرة (2) بند (أ) article:5/paragraph:2/item:أ
มาตรา 420 วรรคสอง (1) section:420/paragraph:two/subsection:1

5. Multi-Script Support

The pattern language MUST support citation references in the following scripts:

5.1 Script-Specific Considerations

Script Countries Numeral System Key Pattern Features
Latin US, GB, DE, FR, ES, IT, BR, PT, NL, PL, SE Arabic (0-9) § symbol, Roman numerals, letter subdivisions
CJK (Chinese) CN Chinese (一二三) 第+N+条/款/项/目, fullwidth parentheses()
CJK (Japanese) JP Kanji (一二三) + Arabic 第+N+条, のN inserted, 第+漢字+号
CJK (Korean) KR Arabic + circled ① + Hangul 제+N+조, ①②③, 가나다
Cyrillic RU, UA Arabic + Cyrillic letters статья/часть/пункт, а/б/в subdivisions
Arabic SA, EG, MA Eastern Arabic (٠-٩) + Western Arabic مادة/فقرة/بند, ordinal words
Thai TH Thai (๐-๙) + Arabic มาตรา/วรรค/อนุมาตรา, Thai ordinal words
Devanagari IN Arabic + Devanagari धारा/अनुच्छेद/खंड

5.2 Unicode in Regex

Patterns MAY use Unicode character classes for script-specific matching:

# Thai numerals
regex: 'มาตรา\s*(?P<number>[๐-๙]+|[0-9]+)'

# Chinese numerals (formal)
regex: '第(?P<number>[一二三四五六七八九十百千零]+)条'

# Eastern Arabic numerals
regex: 'مادة\s*(?P<number>[٠-٩]+|[0-9]+)'

# Circled numerals (Korean/Japanese)
regex: '(?P<number>[①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳])'

5.3 Numeral Normalization

The canonical form ALWAYS uses Western Arabic numerals (0-9). Patterns that match non-Arabic numerals MUST convert in the canonical template:

# For Chinese numerals, a conversion function is needed
regex: '第(?P<number>[一二三四五六七八九十百千零]+)条'
canonical: "article:{number}"  # {number} must be converted from kanji to Arabic

The conversion is the responsibility of the pattern matching engine, not the regex itself. The regex captures the raw text; the engine normalizes before applying the canonical template.


6. Inserted Articles

Inserted articles (가지번호, 枝番号, -bis, /N, .N, Na) are articles inserted between existing ones during legislative amendments. They have a base number and an insertion suffix.

6.1 Inserted Article Patterns by Country

Country Pattern Example Canonical
KR 제N조의M 제34조의2 article:34-2
JP 第N条のM 第15条の2 article:15-2
JP (deep) 第N条のMのM 第15条の2の2 article:15-2-2
DE § Na, § Nb § 123a paragraph:123a
FR Article N bis/ter/quater Article 9 bis article:9bis
ES Artículo N bis Artículo 5 bis article:5bis
TH มาตรา N/M มาตรา 193/1 section:193/1
RU статья N.N статья 123.1 article:123.1
IT Art. N-bis Art. 5-bis article:5bis

6.2 Regex Pattern

Inserted article patterns MUST capture both base and suffix:

# Korean
regex: '제(?P<base>\d+)조의(?P<inserted>\d+)'
canonical: "article:{base}-{inserted}"

# Japanese (deep insertion)
regex: '第(?P<base>\d+)条の(?P<first>\d+)の(?P<second>\d+)'
canonical: "article:{base}-{first}-{second}"

# German
regex: '§\s*(?P<number>\d+)(?P<letter>[a-z])'
canonical: "paragraph:{number}{letter}"

# French
regex: 'Article\s+(?P<number>\d+)\s+(?P<suffix>bis|ter|quater|quinquies|sexies|septies|octies|nonies|decies)'
canonical: "article:{number}{suffix}"

# Thai
regex: 'มาตรา\s*(?P<base>\d+)/(?P<inserted>\d+)'
canonical: "section:{base}/{inserted}"

7. Unnumbered First Paragraphs

Many legal systems omit the number for the first paragraph (항, Absatz, alinéa, 款, فقرة, วรรค). The pattern language handles this through two mechanisms:

7.1 Detection via Context

The first paragraph of an article is identified by position — it appears immediately after the article header, before any numbered paragraph marker. No regex pattern matches it directly; instead, the matching engine infers its existence.

7.2 Canonical Representation

Scenario Canonical Form
First paragraph (unnumbered, with subsequent numbered paragraphs) paragraph:1
Single paragraph (unnumbered, no other paragraphs) paragraph:single
Numbered paragraph paragraph:N

7.3 Country Flags

The CountryProfile level definition uses existing flags:

- key: paragraph
  is_unnumbered_first: true    # First paragraph has no number
  is_unnumbered_when_single: true  # Single paragraph has no number

Patterns do NOT need to match unnumbered paragraphs directly. The matching engine uses these flags to assign paragraph:1 or paragraph:single when no numbered paragraph marker is found.


8. Order-Independent Matching

Legal text may present citations in different orders:

8.1 Article-First (Standard)

Most legal systems present the article first, then subdivisions:

제5조 제2항 제3호        (KR — article/paragraph/item)
§ 405(c)(2)(C)(ii)       (US — section/subsection/paragraph/subparagraph/clause)
Article 1240, alinéa 2    (FR — article/alinea)

8.2 Paragraph-First (Inline)

Some systems allow subdivisions to appear without the article when the article is already established:

② 3. 가.                  (KR — paragraph/item/subitem, article implied)
(2)(C)(ii)                (US — paragraph/subparagraph/clause, section implied)

8.3 Matching Strategy

The pattern matching engine MUST:

  1. Try combined patterns first (multi-level in one regex)
  2. Fall back to individual level patterns
  3. Build the canonical form by composing matched levels
  4. Handle implicit article from context (paragraph-first text assumes the current article)

9. Pattern Matching Engine Requirements

9.1 Processing Order

For a given text chunk:

  1. Apply combined patterns — multi-level regex that captures several levels at once
  2. Apply level patterns — individual patterns for each level, broadest first
  3. Compose canonical — assemble multi-level canonical from individual matches
  4. Normalize — convert non-Arabic numerals to Arabic in canonical form
  5. Deduplicate — if a combined pattern and individual patterns match the same text span, prefer the combined pattern

9.2 Overlap Resolution

When multiple patterns match overlapping text:

  • Longest match wins
  • Combined patterns take priority over individual patterns
  • If two patterns at the same level match, the one with more specific context anchors wins

9.3 Context Window

Patterns are applied to a text chunk (typically a paragraph or article). The engine MAY use the chunk's structural context (e.g., "this chunk is within Article 34") to resolve implicit references.

9.4 Performance

  • Patterns MUST be compiled once and reused
  • Pattern lists SHOULD be ordered by specificity (most specific first)
  • Combined patterns SHOULD come before individual patterns in the list

10. Integration with CountryProfile Schema

10.1 Backward Compatibility

The patterns field is OPTIONAL. Existing CountryProfile files without patterns continue to work. The LLM Judge (few-shots + validation_rules) remains the primary validation mechanism.

10.2 Relationship to Existing Fields

Existing Field Relationship to patterns
numbering Human-readable description; patterns provide machine-readable regex
inserted_pattern Human-readable; patterns provide regex for inserted articles
is_unnumbered_first Flag; patterns handle detection via absence of numbered match
few_shots LLM guidance; patterns provide deterministic extraction
canonical (in patterns) Replaces what was previously implicit in reference_id

10.3 Validation Rules Integration

The patterns field works alongside validation_rules (separate field, separate spec):

patterns           →  deterministic match: "is this a citation?"
validation_rules   →  LLM judgment: "is this citation correct?"

11. Complete Example: Korea (KR)

levels:
  - key: article
    label: "Article"
    numbering: "Arabic numeral + Article"
    description: "Basic unit of legislation"
    is_unnumbered: false
    inserted_pattern: "제N조의M"
    patterns:
      - regex: '제(?P<number>\d+)조'
        canonical: "article:{number}"
        captures:
          number: "Arabic numeral identifying the article"
        source: "Korean Law Information Center (law.go.kr)"
        description: "Standard article: 제1조, 제420조"

      - regex: '제\s*(?P<number>\d+)\s*조'
        canonical: "article:{number}"
        captures:
          number: "Arabic numeral identifying the article"
        source: "Supreme Court Comprehensive Legal Information (scourt.go.kr)"
        description: "Spaced variant: 제 5 조, 제5 조"

      - regex: '제(?P<base>\d+)조의(?P<inserted>\d+)'
        canonical: "article:{base}-{inserted}"
        captures:
          base: "Parent article number"
          inserted: "Inserted sub-article number"
        source: "Korean Law Information Center (law.go.kr)"
        description: "Inserted article: 제34조의2, 제6조의2"

  - key: paragraph
    label: "Paragraph"
    numbering: "Circled Arabic numerals: ①, ②, ③"
    description: "Subdivision of articles"
    is_unnumbered_first: true
    is_unnumbered_when_single: true
    patterns:
      - regex: '(?P<number>[①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳])'
        canonical: "paragraph:{number}"  # engine converts ①→1, ②→2, …, ⑳→20
        captures:
          number: "Circled numeral — engine converts to Arabic (①→1, ②→2, …, ⑳→20)"
        source: "Korean Law Information Center (law.go.kr) — paragraph notation standard"

12. Complete Example: United States (US)

levels:
  - key: section
    label: section
    numbering: "§ N"
    description: "Primary unit of the United States Code"
    patterns:
      - regex: '§\s*(?P<number>\d+[\w-]*)'
        canonical: "section:{number}"
        captures:
          number: "Section number (may include letters or hyphens)"
        source: "Office of the Law Revision Counsel (uscode.house.gov)"
        description: "Section with § symbol: § 78j, § 405"

  - key: subsection
    label: subsection
    numbering: "(a), (b), (c)"
    description: "First subdivision in USC"
    patterns:
      - regex: '\((?P<letter>[a-z])\)'
        canonical: "subsection:{letter}"
        captures:
          letter: "Lowercase letter"
        source: "Office of the Law Revision Counsel (uscode.house.gov)"
        description: "Lowercase letter in parentheses: (a), (b), (c)"

  - key: paragraph
    label: paragraph
    numbering: "(1), (2), (3)"
    description: "Second subdivision in USC"
    patterns:
      - regex: '\((?P<number>\d+)\)'
        canonical: "paragraph:{number}"
        captures:
          number: "Arabic numeral"
        source: "Office of the Law Revision Counsel (uscode.house.gov)"
        description: "Arabic numeral in parentheses: (1), (2), (3)"

  - key: clause
    label: clause
    numbering: "(i), (ii), (iii)"
    description: "Fourth subdivision in USC"
    patterns:
      - regex: >-
          \((?P<numeral>
          x{0,2}(?:i[xv]|v?i{0,3}|iv|vi{0,3})
          )\)
        canonical: "clause:{numeral}"
        captures:
          numeral: "Lowercase Roman numeral (i through xxiii)"
        source: "Office of the Law Revision Counsel (uscode.house.gov)"
        description: "Lowercase Roman numeral: (i), (ii), ..., (xix), (xx), (xxiii)"

13. Design Principles

  1. Regex first, LLM second. Use patterns for what regex can reliably detect (article numbers, section markers). Use LLM Judge for semantic validation (is the citation contextually correct?).

  2. Canonical over raw. The canonical form is the system of record. Multiple regex patterns may map to the same canonical. 제5조, 제 5 조, and 제 5 조 all normalize to article:5.

  3. Source everything. Every pattern MUST cite its authoritative source. This is a legal citation system — provenance matters.

  4. Script-aware, not script-specific. The pattern language uses Unicode-aware regex. Country-specific patterns handle local scripts; the canonical grammar is script-agnostic.

  5. Backward compatible. Adding patterns to an existing CountryProfile is purely additive. No existing fields change meaning.

  6. Human-readable. YAML, not compiled regex. The description and captures fields ensure patterns are understandable without running them.


14. Future Extensions

Extension Description Status
validation_rules LLM Judge rules for contextual validation Design in progress
normalization_map Explicit numeral conversion tables (Thai→Arabic, etc.) Planned
anti_patterns Negative patterns to exclude false positives Planned
compound_patterns Patterns for multi-article references (§§ 405-407) Partially covered
case_law_patterns Patterns for court decision citations Country-specific
confidence Per-pattern confidence score (0.0-1.0) Planned

15. Appendix: Regex Quick Reference

Pattern Meaning
(?P<name>...) Named capturing group
(?:...) Non-capturing group
\d Digit [0-9]
\w Word character [a-zA-Z0-9_]
\s Whitespace
[abc] Character class
[a-z] Range in character class
+ One or more
* Zero or more
? Zero or one
{n,m} Between n and m repetitions
| Alternation
^ Start of string/line
$ End of string/line