Validation Rules Specification
1. Purpose
The validation_rules field defines contextual rules for LLM-based validation of citation extractions from legal text. This is the primary and only validation mechanism for CountryProfile YAML files.
The fields form a layered pipeline:
| Layer | Field | Mechanism | Role |
|---|---|---|---|
| 1 | patterns |
Regex | Extraction only — finding citation tokens in text |
| 2 | validation_rules |
LLM Judge | Validation — judging if extraction is correct |
| 3 | few_shots |
Prompt examples | Guidance for the extraction LLM |
⚠️ Patterns are for extraction, not validation.
Pattern matching is not used for validation. 32% of few-shots cannot be matched by patterns (text-only references, number-only references, format mismatches). See docs/pattern-unmatchable-cases.md for details.
Validation rules catch contextual errors (correct format but wrong meaning, ambiguous references, missing context). The LLM Judge applies these rules as the sole validation mechanism.
2. Schema
2.1 Top-Level Placement
country_code: XX
language: xx
name: Country Name
legal_system: ...
citation_mode: ...
numeral_system: ...
levels: [...]
document_types: {...}
cross_references: {...}
sources: [...]
validation_rules: # ← NEW FIELD, sibling of levels/sources/few_shots
- id: rule_name
description: ...
depends_on: [...]
pattern_match_rule: ...
context_rule: ...
examples: [...]
source: ...
few_shots: [...]
metadata: {...}
2.2 Rule Object Schema
Each entry in the validation_rules array is a rule object with the following fields:
| Field | Type | Required | Description |
|---|---|---|---|
id |
string | yes | Machine-readable identifier. Snake_case. Unique within the profile. |
description |
string | yes | Human-readable explanation of what the rule validates and why. |
depends_on |
list[string] | yes | CountryProfile field paths this rule relies on (e.g., levels[key=paragraph].is_unnumbered_when_single). Enables traceability and change-impact analysis. |
pattern_match_rule |
string | yes | Describes what deterministic (regex-based) checking is possible and sufficient for this rule. May say "no regex applies" for purely semantic rules. |
context_rule |
string | yes | Describes when LLM judgment is required — the cases where pattern matching is insufficient or ambiguous. |
examples |
list[object] | yes | Pass/fail examples demonstrating the rule in action. See §2.3. |
source |
string | yes | Cites the authority for the rule (official database, legal guide, style manual). Must reference a specific sources[] entry or external authority. |
2.3 Example Object Schema
Each entry in the examples array has one of two shapes:
Pass example:
- pass:
chunk: "raw legislative text..."
reference_id: "extracted reference"
reason: "why this is correct"
Fail example:
- fail:
chunk: "raw legislative text..."
reference_id: "extracted reference"
reason: "why this is incorrect"
Both shapes may include optional fields:
query: the user query that produced the extractionrelation: the expected relation (direct,indirect,unrelated)section_name: the expected section nameexpected: list of expected extractions (for empty-result cases)
2.4 depends_on Field Path Conventions
The depends_on field uses dot-separated paths into the CountryProfile YAML structure. Supported path patterns:
| Pattern | Meaning |
|---|---|
levels |
All level definitions |
levels[key=paragraph] |
The level with key: paragraph |
levels[key=paragraph].is_unnumbered_when_single |
Specific field on a specific level |
cross_references.abbreviations |
The abbreviations map |
cross_references.connectors |
The connectors list |
document_types[USC].notes |
Specific field on a specific document type |
few_shots |
The few-shots array |
3. Rule Categories
Rules are organized into the following categories. Every profile SHOULD include at least one rule from each applicable category.
3.1 Reference ID Verification (Universal)
Applies to: All profiles.
Category ID: ref_id_verbatim_substring, ref_id_component_verification
The foundational rule: the reference_id in an extraction must be traceable to verbatim text in the chunk. However, what "verbatim" means depends on the country's citation structure.
Flat Citation Systems (most countries)
For countries where citations are contiguous strings that appear literally in the text (KR 제53조, JP 第九十条, DE § 242 BGB, FR Article 1240), the full reference_id IS a literal substring of the chunk. Verification is a simple case-sensitive ref_id in chunk check.
Pattern match rule: ref_id in chunk.
Context rule: Near-matches where normalisation differs (spacing, trailing punctuation).
Component-Level Citation Systems (US, and similar assembled-citation countries)
For countries where citations are assembled from components that appear at different positions in the raw text, the full reference_id is never a literal substring. Even a "simple" citation like § 78j(b) is assembled: the chunk has § 78j. on one line and (b) on another. The string § 78j(b) does not appear contiguously.
This applies to ALL assembled citations in these systems, not just deeply nested ones. The verification must decompose the reference_id into its component tokens and verify each one appears in the chunk in the correct hierarchical order.
Pattern match rule: Decompose the reference_id using level-specific regexes (section number, subdivision designators). Verify each component is a substring of the chunk. Components must appear in hierarchical order. The regex must match Roman numerals (I), (II) before single uppercase letters (A), (B) to avoid misclassifying subclause-level designators as subparagraph-level.
Context rule: The LLM judge handles edge cases:
- Spacing normalisation (e.g.,
§ 405.with period vs§ 405without) - Component from wrong article (e.g.,
(d)belongs to § 406, not § 405) - Ambiguous designators at deep levels (e.g.,
(A)could be depth 3 or 6 in CFR) - Missing section header in chunk
CountryProfile dependency: levels (all numbering formats), document_types (structure depth).
3.2 Citation Abbreviation Rejection (Universal)
Applies to: All profiles.
Category ID: no_citation_abbreviations, accepted_abbreviations
Raw legislative text uses full forms. Citation abbreviations (art., §, 조) may or may not match the raw text. The rule validates that the reference_id uses the form found in the chunk, not the citation-abbreviation form.
Pattern match rule: Compare reference_id law-name tokens against cross_references.abbreviations map.
Context rule: Accept abbreviation only if the chunk itself uses it.
CountryProfile dependency: levels (numbering formats), cross_references.abbreviations.
3.3 Unnumbered First Paragraph (Civil Law)
Applies to: Profiles where is_unnumbered_first: true or is_unnumbered_when_single: true on any level.
Applies to KR, JP, DE, FR, EG, TH, NL, CN, SA, MX, GR, ID, AR, SE, ES, PL, TR, UA, IL — 19 of 36 countries.
Does NOT apply to US, AU, NZ, CA, GB — common law systems.
Category ID: unnumbered_first_paragraph, unnumbered_single_paragraph
When a level has is_unnumbered_first: true, the first element at that level has no number in the raw text. The LLM must not extract a non-existent number. When is_unnumbered_when_single: true, the number is omitted only when the parent element contains a single child at that level.
Pattern match rule: Detect absence of the level's numbering pattern after the parent element header. Context rule: Distinguish true unnumbered status from chunk truncation.
CountryProfile dependency: levels[key=*].is_unnumbered_first, levels[key=*].is_unnumbered_when_single, levels[key=*].unnumbered_note.
3.4 Inserted Article Pattern (Civil Law)
Applies to: Profiles with inserted_pattern on the article level.
Applies to KR (의M), JP (のM), DE (Na/Nb), FR (bis/ter/quater), IT (-bis), ES (-bis), RU/UA (.M), PT (.º).
Category ID: inserted_article_pattern
Inserted articles use special suffixes to denote placement between existing articles. The suffix is part of the official numbering.
Pattern match rule: Country-specific regex for the inserted pattern. Context rule: Spacing variants around the inserted suffix.
CountryProfile dependency: levels[key=article].inserted_pattern.
3.5 Relative References (Universal)
Applies to: All profiles.
Category ID: relative_references
Legal text frequently uses relative references to refer to the same law (같은 법, this section, ledit article), adjacent articles (前条, 前項, the preceding section), or previously mentioned statutes (위 법률, the said Act). US statutes use 'such subsection', 'the preceding section', 'this paragraph'.
Pattern match rule: Detect relative-reference phrases from cross_references.subsequent_ref, cross_references.internal_ref, and country-specific relative terms.
Context rule: Determine whether a relative reference is direct (names a specific article) or indirect (anaphoric back-reference).
CountryProfile dependency: cross_references.subsequent_ref, cross_references.internal_ref.
3.6 Cross-Document References (Universal)
Applies to: All profiles.
Category ID: cross_document_references
When a chunk references another law by name, the reference_id must include both the law name and the article. The law name must match the form used in the chunk (full name vs. abbreviation, with or without official quote marks).
Pattern match rule: Detect law-name patterns (e.g., 「...」, 42 U.S.C., Code civil) followed by article numbering. Context rule: Determine relation type — is the cross-document reference the primary subject or a passing citation?
CountryProfile dependency: cross_references.quote_marks, cross_references.full_citation, document_types.
3.7 Text-Only References (Universal)
Applies to: All profiles.
Category ID: text_only_references
Some chunks contain substantive legal text without any numbered provisions — preambles, recitals, narrative descriptions, statutory notes. When no numbering patterns match, the extraction should return empty.
Pattern match rule: Absence check — if none of the level-numbering regexes match, flag as text-only. Context rule: Distinguish genuinely unnumbered text from truncated chunks or unusual formats.
CountryProfile dependency: levels (all numbering patterns).
3.8 Relation Classification (Universal)
Applies to: All profiles.
Category ID: relation_classification
The relation field (direct/indirect/unrelated) requires semantic understanding of the query-chunk relationship. No regex can determine this.
Pattern match rule: None — purely semantic. Context rule: The LLM judge must assess whether the chunk directly answers the query, provides supporting information, or is irrelevant.
CountryProfile dependency: few_shots (demonstrates expected relation patterns).
4. Country-Specific Rule Categories
Beyond the universal categories above, some countries need specialised rules:
4.1 KR-Specific: Item/Sub-Item Numbering Format
Korean law uses specific numbering for 호 (Arabic + period: 1.) and 목 (Korean alphabet + period: 가.). The period is mandatory; 1 without a period is ambiguous.
CountryProfile dependency: levels[key=item].numbering, levels[key=sub-item].numbering.
4.2 US-Specific: USC vs CFR Naming Divergence
USC and CFR use different names for the same positional levels. The LLM must not mix them.
CountryProfile dependency: document_types[USC].notes, document_types[CFR].notes, levels.
4.3 US-Specific: CFR Cycling Pattern
CFR subdivisions cycle after 3 levels: (a)(1)(i)(A)(1)(i). The same designator pattern repeats, requiring depth tracking.
CountryProfile dependency: document_types[CFR].notes.
4.4 US-Specific: No Unnumbered First Subdivision
Unlike civil law systems, US statutes always explicitly number every subdivision. The LLM should never assume an unnumbered first element.
CountryProfile dependency: levels (negative check — no is_unnumbered_first).
4.5 US-Specific: Hierarchical Component Assembly
USC citations are assembled from components that appear at different positions in the chunk. Even a citation with only one subdivision (e.g., § 78j(b)) is not a literal substring — the section header and the subdivision designator appear on separate lines. ALL US citations require component-level verification, not just deeply nested ones. The component regex must match uppercase Roman numerals (I), (II) before single uppercase letters (A), (B) to avoid misclassifying subclause-level designators.
CountryProfile dependency: levels (all 8 level numbering patterns), document_types[USC].structure.
4.6 JP/FR-Specific: Kanji/Roman Numeral vs Arabic
JP uses kanji numerals (一, 二, 三) for items in formal text but Arabic (1, 2, 3) in casual citations. FR uses Roman numerals (I., II., III.) for paragraphs. The reference_id must match the numeral system used in the chunk.
CountryProfile dependency: numeral_system, levels[*].numbering.
5. Integration with Existing Fields
5.1 Relationship to patterns
| Aspect | patterns |
validation_rules |
|---|---|---|
| Mechanism | Regex | LLM + regex hybrid |
| Scope | Structural token matching | Contextual/semantic validation |
| Output | Match/no-match with captures | Pass/fail with explanation |
| Determinism | Fully deterministic | Partially deterministic |
patterns handles the "does this token look like a valid citation component?" question.
validation_rules handles the "is this citation component correctly applied in this context?" question.
A pattern_match_rule within a validation rule can reference the same regexes defined in the patterns field. When it does, it should cite the pattern by ID.
5.2 Relationship to few_shots
few_shots are prompt guidance for the extraction LLM — they teach the model what good extractions look like.
validation_rules are evaluation criteria for the validation LLM — they teach the judge what to accept/reject.
The examples in validation rules serve a different purpose than few_shots: they demonstrate error cases (fail examples) alongside correct cases, whereas few_shots only show correct extractions.
5.3 Relationship to cross_references
The cross_references field defines the citation CONVENTIONS of a country (what forms exist). The validation_rules field defines how to VALIDATE those conventions when they appear in extraction output.
For example:
cross_references.subsequent_refsays "같은 법 / 위 법률" exists as a formvalidation_rules[id=relative_references]says "when you see '같은 법 제5조', it's verifiable; when you see '위 법률' alone, the LLM judge must determine what it refers to"
6. Implementation Guide
6.1 LLM Judge Prompt Template
The validation rules are consumed by an LLM judge. The prompt should be structured as:
You are validating legal citation extractions for {country_name} ({country_code}).
VALIDATION RULES:
{for each rule: id, description, context_rule}
TEST CASES:
{for each extraction to validate: query, chunk, reference_id, relation, section_name}
For each test case, evaluate against each applicable rule.
Return a JSON array with pass/fail per rule and a final verdict.
6.2 Two-Phase Validation
Phase 1 (Pattern): Apply the
pattern_match_ruledeterministically. If the regex check fails, the extraction fails without LLM invocation. This is cheap and fast.Phase 2 (Context): Only for cases that pass Phase 1 but need contextual judgment. Invoke the LLM judge with the
context_ruleand relevantexamples. This is expensive but necessary for ambiguous cases.
6.3 Rule Selection Per Profile
Not all rules apply to every country. The judge should select applicable rules based on:
- Universal rules (categories 3.1–3.8): apply to all profiles.
- Country-specific rules: apply only when the profile has the relevant
depends_onfields. - Negative rules (e.g., US Rule 9: no unnumbered first): apply as guardrails for systems that should NOT exhibit certain patterns.
6.4 depends_on Change Impact
When a CountryProfile field referenced by depends_on changes, all rules that depend on it MUST be reviewed. The depends_on field enables automated change-impact analysis:
If levels[key=paragraph].is_unnumbered_when_single changes:
→ Review: unnumbered_single_paragraph (KR Rule 3)
→ Review: no_unnumbered_first (US Rule 9)
7. Design Principles
Traceability. Every rule cites its source authority and lists the CountryProfile fields it depends on. A reviewer can always answer "why does this rule exist?" and "what happens if the profile changes?"
Pattern-first. The
pattern_match_ruleshould be as deterministic as possible. Thecontext_ruleshould be reserved for cases that genuinely require semantic understanding. If a regex can do the job, don't invoke the LLM.Fail-visible. Every rule includes fail examples. A rule without fail examples is incomplete — it tells the judge what to accept but not what to reject.
Country-grounded. Rules reference specific CountryProfile fields (e.g.,
levels[key=paragraph].is_unnumbered_when_single), not abstract principles. The rule is a property of the country's citation system, not a universal axiom.Non-overlapping. Each rule covers a distinct validation concern. If two rules overlap, they should be merged or one should reference the other.
8. Appendix: Cross-Country Rule Coverage
The table uses actual rule id values from each country's YAML. A checkmark (✓) means the rule exists in that country's validation_rules array; a dash (—) means it does not apply or is not defined.
| Rule Category | KR rule id | US rule id | JP | DE | FR | Notes |
|---|---|---|---|---|---|---|
| Reference ID verification | ref_id_verbatim_substring |
ref_id_component_verification |
✓ | ✓ | ✓ | KR/JP/DE/FR: flat substring. US: component-level for ALL citations. |
| Citation abbreviation rejection | no_citation_abbreviations |
— | ✓ | ✓ | ✓ | Universal concept; US omitted (§ is unambiguous in ASCII). |
| Unnumbered first paragraph | unnumbered_single_paragraph |
no_unnumbered_first |
✓ | ✓ | ✓ | KR: single paragraph; JP: implicit first paragraph; DE: Absatz 1; FR: alinéa. US rule is negative guard. |
| Inserted article pattern | inserted_article_pattern |
— | ✓ | ✓ | ✓ | KR: 의M; JP: のM; DE: Na/Nb; FR: bis/ter/quater. |
| Relative references | relative_references |
relative_references |
✓ | ✓ | ✓ | Universal. KR: 같은 법/이 법/위 법률. US: such subsection/preceding section. |
| Cross-document references | cross_document_references |
cross_document_references |
✓ | ✓ | ✓ | KR: 「 」 quotation marks. US: Act names, title citations, 'this Act'. |
| Text-only references | text_only_references |
text_only_references |
✓ | ✓ | ✓ | Universal. Chunks with no numbered provisions → empty expected. |
| Relation classification | relation_classification |
relation_classification |
✓ | ✓ | ✓ | Universal. Semantic judgment, no regex. |
| Item numbering format | item_numbering_format |
— | — | — | — | KR-specific (호: N. with mandatory period). |
| Sub-item numbering format | sub_item_numbering_format |
— | — | — | — | KR-specific (목: 가. with mandatory period). |
| USC/CFR naming divergence | — | usc_cfr_naming_divergence |
— | — | — | US-specific. |
| Nested subdivision assembly | — | nested_subdivision_assembly |
— | — | — | US-specific (8 levels deep). |
| CFR cycling pattern | — | cfr_cycling_pattern |
— | — | — | US-specific (a)(1)(i)(A)(1)(i). |
| Public Law / Executive Order | — | public_law_executive_order_format |
— | — | — | US-specific. |
| Section number must be present | — | section_number_present |
— | — | — | US-specific (component verification guard). |