Validation Rules Specification

1. Purpose

The validation_rules field defines contextual rules for LLM-based validation of citation extractions from legal text. This is the primary and only validation mechanism for CountryProfile YAML files.

The fields form a layered pipeline:

Layer	Field	Mechanism	Role
1	`patterns`	Regex	Extraction only — finding citation tokens in text
2	`validation_rules`	LLM Judge	Validation — judging if extraction is correct
3	`few_shots`	Prompt examples	Guidance for the extraction LLM

⚠️ Patterns are for extraction, not validation.

Pattern matching is not used for validation. 32% of few-shots cannot be matched by patterns (text-only references, number-only references, format mismatches). See docs/pattern-unmatchable-cases.md for details.

Validation rules catch contextual errors (correct format but wrong meaning, ambiguous references, missing context). The LLM Judge applies these rules as the sole validation mechanism.

2. Schema

2.1 Top-Level Placement

country_code: XX
language: xx
name: Country Name
legal_system: ...
citation_mode: ...
numeral_system: ...

levels: [...]
document_types: {...}
cross_references: {...}
sources: [...]

validation_rules:    # ← NEW FIELD, sibling of levels/sources/few_shots
  - id: rule_name
    description: ...
    depends_on: [...]
    pattern_match_rule: ...
    context_rule: ...
    examples: [...]
    source: ...

few_shots: [...]
metadata: {...}

2.2 Rule Object Schema

Each entry in the validation_rules array is a rule object with the following fields:

Field	Type	Required	Description
`id`	string	yes	Machine-readable identifier. Snake_case. Unique within the profile.
`description`	string	yes	Human-readable explanation of what the rule validates and why.
`depends_on`	list[string]	yes	CountryProfile field paths this rule relies on (e.g., `levels[key=paragraph].is_unnumbered_when_single`). Enables traceability and change-impact analysis.
`pattern_match_rule`	string	yes	Describes what deterministic (regex-based) checking is possible and sufficient for this rule. May say "no regex applies" for purely semantic rules.
`context_rule`	string	yes	Describes when LLM judgment is required — the cases where pattern matching is insufficient or ambiguous.
`examples`	list[object]	yes	Pass/fail examples demonstrating the rule in action. See §2.3.
`source`	string	yes	Cites the authority for the rule (official database, legal guide, style manual). Must reference a specific `sources[]` entry or external authority.

2.3 Example Object Schema

Each entry in the examples array has one of two shapes:

Pass example:

- pass:
    chunk: "raw legislative text..."
    reference_id: "extracted reference"
    reason: "why this is correct"

Fail example:

- fail:
    chunk: "raw legislative text..."
    reference_id: "extracted reference"
    reason: "why this is incorrect"

Both shapes may include optional fields:

query: the user query that produced the extraction
relation: the expected relation (direct, indirect, unrelated)
section_name: the expected section name
expected: list of expected extractions (for empty-result cases)

2.4 `depends_on` Field Path Conventions

The depends_on field uses dot-separated paths into the CountryProfile YAML structure. Supported path patterns:

Pattern	Meaning
`levels`	All level definitions
`levels[key=paragraph]`	The level with `key: paragraph`
`levels[key=paragraph].is_unnumbered_when_single`	Specific field on a specific level
`cross_references.abbreviations`	The abbreviations map
`cross_references.connectors`	The connectors list
`document_types[USC].notes`	Specific field on a specific document type
`few_shots`	The few-shots array

3. Rule Categories

Rules are organized into the following categories. Every profile SHOULD include at least one rule from each applicable category.

3.1 Reference ID Verification (Universal)

Applies to: All profiles. Category ID: ref_id_verbatim_substring, ref_id_component_verification

The foundational rule: the reference_id in an extraction must be traceable to verbatim text in the chunk. However, what "verbatim" means depends on the country's citation structure.

Flat Citation Systems (most countries)

For countries where citations are contiguous strings that appear literally in the text (KR 제53조, JP 第九十条, DE § 242 BGB, FR Article 1240), the full reference_id IS a literal substring of the chunk. Verification is a simple case-sensitive ref_id in chunk check.

Pattern match rule: ref_id in chunk. Context rule: Near-matches where normalisation differs (spacing, trailing punctuation).

Component-Level Citation Systems (US, and similar assembled-citation countries)

For countries where citations are assembled from components that appear at different positions in the raw text, the full reference_id is never a literal substring. Even a "simple" citation like § 78j(b) is assembled: the chunk has § 78j. on one line and (b) on another. The string § 78j(b) does not appear contiguously.

This applies to ALL assembled citations in these systems, not just deeply nested ones. The verification must decompose the reference_id into its component tokens and verify each one appears in the chunk in the correct hierarchical order.

Pattern match rule: Decompose the reference_id using level-specific regexes (section number, subdivision designators). Verify each component is a substring of the chunk. Components must appear in hierarchical order. The regex must match Roman numerals (I), (II) before single uppercase letters (A), (B) to avoid misclassifying subclause-level designators as subparagraph-level.

Context rule: The LLM judge handles edge cases:

Spacing normalisation (e.g., § 405. with period vs § 405 without)
Component from wrong article (e.g., (d) belongs to § 406, not § 405)
Ambiguous designators at deep levels (e.g., (A) could be depth 3 or 6 in CFR)
Missing section header in chunk

CountryProfile dependency: levels (all numbering formats), document_types (structure depth).

3.2 Citation Abbreviation Rejection (Universal)

Applies to: All profiles. Category ID: no_citation_abbreviations, accepted_abbreviations

Raw legislative text uses full forms. Citation abbreviations (art., §, 조) may or may not match the raw text. The rule validates that the reference_id uses the form found in the chunk, not the citation-abbreviation form.

Pattern match rule: Compare reference_id law-name tokens against cross_references.abbreviations map. Context rule: Accept abbreviation only if the chunk itself uses it.

CountryProfile dependency: levels (numbering formats), cross_references.abbreviations.

3.3 Unnumbered First Paragraph (Civil Law)

Applies to: Profiles where is_unnumbered_first: true or is_unnumbered_when_single: true on any level. Applies to KR, JP, DE, FR, EG, TH, NL, CN, SA, MX, GR, ID, AR, SE, ES, PL, TR, UA, IL — 19 of 36 countries. Does NOT apply to US, AU, NZ, CA, GB — common law systems.

Category ID: unnumbered_first_paragraph, unnumbered_single_paragraph

When a level has is_unnumbered_first: true, the first element at that level has no number in the raw text. The LLM must not extract a non-existent number. When is_unnumbered_when_single: true, the number is omitted only when the parent element contains a single child at that level.

Pattern match rule: Detect absence of the level's numbering pattern after the parent element header. Context rule: Distinguish true unnumbered status from chunk truncation.

CountryProfile dependency: levels[key=*].is_unnumbered_first, levels[key=*].is_unnumbered_when_single, levels[key=*].unnumbered_note.

3.4 Inserted Article Pattern (Civil Law)

Applies to: Profiles with inserted_pattern on the article level. Applies to KR (의M), JP (のM), DE (Na/Nb), FR (bis/ter/quater), IT (-bis), ES (-bis), RU/UA (.M), PT (.º).

Category ID: inserted_article_pattern

Inserted articles use special suffixes to denote placement between existing articles. The suffix is part of the official numbering.

Pattern match rule: Country-specific regex for the inserted pattern. Context rule: Spacing variants around the inserted suffix.

CountryProfile dependency: levels[key=article].inserted_pattern.

3.5 Relative References (Universal)

Applies to: All profiles. Category ID: relative_references

Legal text frequently uses relative references to refer to the same law (같은 법, this section, ledit article), adjacent articles (前条, 前項, the preceding section), or previously mentioned statutes (위 법률, the said Act). US statutes use 'such subsection', 'the preceding section', 'this paragraph'.

Pattern match rule: Detect relative-reference phrases from cross_references.subsequent_ref, cross_references.internal_ref, and country-specific relative terms. Context rule: Determine whether a relative reference is direct (names a specific article) or indirect (anaphoric back-reference).

CountryProfile dependency: cross_references.subsequent_ref, cross_references.internal_ref.

3.6 Cross-Document References (Universal)

Applies to: All profiles. Category ID: cross_document_references

When a chunk references another law by name, the reference_id must include both the law name and the article. The law name must match the form used in the chunk (full name vs. abbreviation, with or without official quote marks).

Pattern match rule: Detect law-name patterns (e.g., 「...」, 42 U.S.C., Code civil) followed by article numbering. Context rule: Determine relation type — is the cross-document reference the primary subject or a passing citation?

CountryProfile dependency: cross_references.quote_marks, cross_references.full_citation, document_types.

3.7 Text-Only References (Universal)

Applies to: All profiles. Category ID: text_only_references

Some chunks contain substantive legal text without any numbered provisions — preambles, recitals, narrative descriptions, statutory notes. When no numbering patterns match, the extraction should return empty.

Pattern match rule: Absence check — if none of the level-numbering regexes match, flag as text-only. Context rule: Distinguish genuinely unnumbered text from truncated chunks or unusual formats.

CountryProfile dependency: levels (all numbering patterns).

3.8 Relation Classification (Universal)

Applies to: All profiles. Category ID: relation_classification

The relation field (direct/indirect/unrelated) requires semantic understanding of the query-chunk relationship. No regex can determine this.

Pattern match rule: None — purely semantic. Context rule: The LLM judge must assess whether the chunk directly answers the query, provides supporting information, or is irrelevant.

CountryProfile dependency: few_shots (demonstrates expected relation patterns).

4. Country-Specific Rule Categories

Beyond the universal categories above, some countries need specialised rules:

4.1 KR-Specific: Item/Sub-Item Numbering Format

Korean law uses specific numbering for 호 (Arabic + period: 1.) and 목 (Korean alphabet + period: 가.). The period is mandatory; 1 without a period is ambiguous.

CountryProfile dependency: levels[key=item].numbering, levels[key=sub-item].numbering.

4.2 US-Specific: USC vs CFR Naming Divergence

USC and CFR use different names for the same positional levels. The LLM must not mix them.

CountryProfile dependency: document_types[USC].notes, document_types[CFR].notes, levels.

4.3 US-Specific: CFR Cycling Pattern

CFR subdivisions cycle after 3 levels: (a)(1)(i)(A)(1)(i). The same designator pattern repeats, requiring depth tracking.

CountryProfile dependency: document_types[CFR].notes.

4.4 US-Specific: No Unnumbered First Subdivision

Unlike civil law systems, US statutes always explicitly number every subdivision. The LLM should never assume an unnumbered first element.

CountryProfile dependency: levels (negative check — no is_unnumbered_first).

4.5 US-Specific: Hierarchical Component Assembly

USC citations are assembled from components that appear at different positions in the chunk. Even a citation with only one subdivision (e.g., § 78j(b)) is not a literal substring — the section header and the subdivision designator appear on separate lines. ALL US citations require component-level verification, not just deeply nested ones. The component regex must match uppercase Roman numerals (I), (II) before single uppercase letters (A), (B) to avoid misclassifying subclause-level designators.

CountryProfile dependency: levels (all 8 level numbering patterns), document_types[USC].structure.

4.6 JP/FR-Specific: Kanji/Roman Numeral vs Arabic

JP uses kanji numerals (一, 二, 三) for items in formal text but Arabic (1, 2, 3) in casual citations. FR uses Roman numerals (I., II., III.) for paragraphs. The reference_id must match the numeral system used in the chunk.

CountryProfile dependency: numeral_system, levels[*].numbering.

5. Integration with Existing Fields

5.1 Relationship to `patterns`

Aspect	`patterns`	`validation_rules`
Mechanism	Regex	LLM + regex hybrid
Scope	Structural token matching	Contextual/semantic validation
Output	Match/no-match with captures	Pass/fail with explanation
Determinism	Fully deterministic	Partially deterministic

patterns handles the "does this token look like a valid citation component?" question. validation_rules handles the "is this citation component correctly applied in this context?" question.

A pattern_match_rule within a validation rule can reference the same regexes defined in the patterns field. When it does, it should cite the pattern by ID.

5.2 Relationship to `few_shots`

few_shots are prompt guidance for the extraction LLM — they teach the model what good extractions look like. validation_rules are evaluation criteria for the validation LLM — they teach the judge what to accept/reject.

The examples in validation rules serve a different purpose than few_shots: they demonstrate error cases (fail examples) alongside correct cases, whereas few_shots only show correct extractions.

5.3 Relationship to `cross_references`

The cross_references field defines the citation CONVENTIONS of a country (what forms exist). The validation_rules field defines how to VALIDATE those conventions when they appear in extraction output.

For example:

cross_references.subsequent_ref says "같은 법 / 위 법률" exists as a form
validation_rules[id=relative_references] says "when you see '같은 법 제5조', it's verifiable; when you see '위 법률' alone, the LLM judge must determine what it refers to"

6. Implementation Guide

6.1 LLM Judge Prompt Template

The validation rules are consumed by an LLM judge. The prompt should be structured as:

You are validating legal citation extractions for {country_name} ({country_code}).

VALIDATION RULES:
{for each rule: id, description, context_rule}

TEST CASES:
{for each extraction to validate: query, chunk, reference_id, relation, section_name}

For each test case, evaluate against each applicable rule.
Return a JSON array with pass/fail per rule and a final verdict.

6.2 Two-Phase Validation

Phase 1 (Pattern): Apply the pattern_match_rule deterministically. If the regex check fails, the extraction fails without LLM invocation. This is cheap and fast.
Phase 2 (Context): Only for cases that pass Phase 1 but need contextual judgment. Invoke the LLM judge with the context_rule and relevant examples. This is expensive but necessary for ambiguous cases.

6.3 Rule Selection Per Profile

Not all rules apply to every country. The judge should select applicable rules based on:

Universal rules (categories 3.1–3.8): apply to all profiles.
Country-specific rules: apply only when the profile has the relevant depends_on fields.
Negative rules (e.g., US Rule 9: no unnumbered first): apply as guardrails for systems that should NOT exhibit certain patterns.

6.4 `depends_on` Change Impact

When a CountryProfile field referenced by depends_on changes, all rules that depend on it MUST be reviewed. The depends_on field enables automated change-impact analysis:

If levels[key=paragraph].is_unnumbered_when_single changes:
  → Review: unnumbered_single_paragraph (KR Rule 3)
  → Review: no_unnumbered_first (US Rule 9)

7. Design Principles

Traceability. Every rule cites its source authority and lists the CountryProfile fields it depends on. A reviewer can always answer "why does this rule exist?" and "what happens if the profile changes?"
Pattern-first. The pattern_match_rule should be as deterministic as possible. The context_rule should be reserved for cases that genuinely require semantic understanding. If a regex can do the job, don't invoke the LLM.
Fail-visible. Every rule includes fail examples. A rule without fail examples is incomplete — it tells the judge what to accept but not what to reject.
Country-grounded. Rules reference specific CountryProfile fields (e.g., levels[key=paragraph].is_unnumbered_when_single), not abstract principles. The rule is a property of the country's citation system, not a universal axiom.
Non-overlapping. Each rule covers a distinct validation concern. If two rules overlap, they should be merged or one should reference the other.

8. Appendix: Cross-Country Rule Coverage

The table uses actual rule id values from each country's YAML. A checkmark (✓) means the rule exists in that country's validation_rules array; a dash (—) means it does not apply or is not defined.

Rule Category	KR rule id	US rule id	JP	DE	FR	Notes
Reference ID verification	`ref_id_verbatim_substring`	`ref_id_component_verification`	✓	✓	✓	KR/JP/DE/FR: flat substring. US: component-level for ALL citations.
Citation abbreviation rejection	`no_citation_abbreviations`	—	✓	✓	✓	Universal concept; US omitted (§ is unambiguous in ASCII).
Unnumbered first paragraph	`unnumbered_single_paragraph`	`no_unnumbered_first`	✓	✓	✓	KR: single paragraph; JP: implicit first paragraph; DE: Absatz 1; FR: alinéa. US rule is negative guard.
Inserted article pattern	`inserted_article_pattern`	—	✓	✓	✓	KR: 의M; JP: のM; DE: Na/Nb; FR: bis/ter/quater.
Relative references	`relative_references`	`relative_references`	✓	✓	✓	Universal. KR: 같은 법/이 법/위 법률. US: such subsection/preceding section.
Cross-document references	`cross_document_references`	`cross_document_references`	✓	✓	✓	KR: 「」 quotation marks. US: Act names, title citations, 'this Act'.
Text-only references	`text_only_references`	`text_only_references`	✓	✓	✓	Universal. Chunks with no numbered provisions → empty expected.
Relation classification	`relation_classification`	`relation_classification`	✓	✓	✓	Universal. Semantic judgment, no regex.
Item numbering format	`item_numbering_format`	—	—	—	—	KR-specific (호: N. with mandatory period).
Sub-item numbering format	`sub_item_numbering_format`	—	—	—	—	KR-specific (목: 가. with mandatory period).
USC/CFR naming divergence	—	`usc_cfr_naming_divergence`	—	—	—	US-specific.
Nested subdivision assembly	—	`nested_subdivision_assembly`	—	—	—	US-specific (8 levels deep).
CFR cycling pattern	—	`cfr_cycling_pattern`	—	—	—	US-specific (a)(1)(i)(A)(1)(i).
Public Law / Executive Order	—	`public_law_executive_order_format`	—	—	—	US-specific.
Section number must be present	—	`section_number_present`	—	—	—	US-specific (component verification guard).

Validation Rules