Trustworthy AI Legal and Governmental Content Validator

This project is part of CS5374 Software Verification and Validation at Texas Tech University, Department of Computer Science. The project builds a Trustworthy AI validation pipeline that verifies legal and governmental content against authoritative Texas open data before any AI system presents it to users.

Project Personnel

Role	Name	Contact	Link
Student	Scott Weeden	sweeden@ttu.edu	LinkedIn
Instructor	Dr. Akbar S. Namin	akbar.namin@ttu.edu	TTU CS Faculty

Course: CS 5374 - Software Verification and Validation | Spring 2026
Repository: CS5374 Software V&V on GitHub

The Problem: AI Hallucination in Legal Research

Large language models and retrieval-augmented generation (RAG) systems are increasingly used to answer questions about legal and governmental matters, yet they frequently hallucinate or return outdated information. Invented judge names, non-existent laws, fabricated election details, or unverified court documents can cause serious harm: incorrect legal advice, misrepresentation of officials, and invalid citations presented as binding authority.

Notable Cases & Studies

Reference	Description
Stanford Law - “Hallucination-Free?”	Assessing the reliability of leading AI legal research tools (link)
Stanford Law - “Large Legal Fictions”	Profiling legal hallucinations in large language models (link)
Mata v. Avianca, Inc.	Court sanctions for AI-generated fake citations (link)

Content Verification Architecture

The pipeline verifies content across seven domains using authoritative Texas sources:

Content Type	Authoritative Texas Source	Verification Approach
Legal/Government News	Trust lists, NewsGuard, AllSides	URL and domain checks; cross-check with Texas agency press releases
Judges	Texas judicial directories, court rosters	Name and court match against official rosters
Elected Officials	data.texas.gov, data.capitol.texas.gov	Match names, offices, and terms to official datasets
Elections & Opponents	Capitol Data Portal (116+ datasets)	Certified filings and results; candidate/race verification
Laws & Ordinances	Texas Legislature, agency sites	Citation and text match against official code/statute datasets
Court Documents	Texas court datasets, e-filing metadata	Docket/case ID and document metadata validation
Legal Templates	Texas court form registries	Checksum and version validation against known good templates

Note: Federal sources (CourtListener, PACER, FEC) are not used as primary authorities; the focus is on Texas legal and governmental sources via the Texas Open Data Portal and Capitol Data Portal.

LangGraph Validation Pipeline

The system uses LangChain and LangGraph to implement validator agents that ingest, parse, and verify content at each stage.

Pipeline Stages

Content Extraction - Parse and normalize input content
Schema Validation - Verify required fields and data types
Source Authority Check - Validate against allowlist of authoritative domains
Temporal Validation - Verify timestamps are valid and current
Content Verification - Cross-reference with authoritative Texas databases
Provenance Attribution - Attach verification metadata to all outputs

Key Features

Schema validation at every stage
Source grounding requirements before indexing
Pass/fail routing with retry or escalation
Provenance metadata on all outputs (source, date, verification status)
Only content that passes verification is indexed and made available to downstream AI systems

AI Agent Design Patterns

The project leverages 21 AI agent design patterns documented in the DesignPatterns repository:

Pattern	Application in Project
01 - Prompt Chaining	Sequential validation steps where output of one step feeds the next
02 - Routing	Content-type classification directing to appropriate validators
03 - Parallelization	Concurrent checking of multiple authoritative sources
04 - Reflection	Self-verification of validator outputs before acceptance
05 - Tool Use	Integration with Texas Open Data APIs
06 - Planning	Multi-step validation workflows for complex content
07 - Multi-Agent Collaboration	Distributed validators for different content types
08 - Memory Management	Preservation of verification context across pipeline
09 - Learning & Adaptation	Pattern learning from verification results
10 - Model Context Protocol	Standardized context passing between agents
11 - Goal Setting	Defining verification thresholds and targets
12 - Exception Handling	Graceful handling of API failures and timeouts
13 - Human-in-the-Loop	Escalation paths for ambiguous verifications
14 - RAG (Retrieval-Augmented Generation)	Ground truth retrieval from Texas databases
15 - Inter-Agent Communication	Coordination between validator nodes
16 - Resource-Aware Optimization	Efficient API usage and rate limiting
17 - Reasoning Techniques	Logical inference for complex content types
18 - Guardrails & Safety	Input sanitization and output validation
19 - Evaluation & Monitoring	Metrics tracking with LangSmith/Phoenix
20 - Prioritization	Queue management for verification tasks
21 - Exploration & Discovery	New source identification and validation

Experiments & Evaluation

Experiment 1: Baseline Hallucination Rate

Objective: Establish baseline hallucination rate for LLM on Texas legal citation tasks without verification
Data: Held-out set of legal questions with ground-truth citations from data.texas.gov
Metrics: Proportion of generated citations that do not exist, are misattributed, or have incorrect holdings
Tools: LangSmith, Ragas, DeepEval, promptfoo

Experiment 2: Verification Pipeline Effectiveness

Objective: Measure impact of Texas-data-backed validator on hallucination and citation quality
Setup: Same Texas legal citation tasks passed through LLM, then through validator
Metrics: Precision, Recall, Hallucination rate reduction
Tools: Ragas, LangSmith, Phoenix, DeepEval

Experiment 3: Validator Nodes vs Post-Hoc Verification

Objective: Compare LangGraph with validator nodes (reject/retry on failure) vs simple RAG with post-hoc filtering
Metrics: End-to-end accuracy and latency
Tools: LangSmith, promptfoo, TruLens, Phoenix

Experiment 4: Security Red-Team Evaluation

Objective: Apply adversarial testing to the validator pipeline
Tests: Prompt injection, data exfiltration, source spoofing
Tools: GARAK (NVIDIA), LLM Canary, TextAttack, OpenAttack
Deliverable: Documented vulnerabilities and mitigations

Evaluation Metric Definitions (Repo-Aligned)

The following metrics align directly with experiment code under experiments/ and rubric expectations under .agents/legal-luminary/RUBRIC.md.

Metric	Definition	Source in Repo	Target
Baseline Hallucination Rate	`hallucinated / total_questions` for unverified citation responses	`experiments/exp1_baseline.py`	Lower is better
Pipeline Precision	`TP / (TP + FP)` after validator pipeline	`experiments/exp2_pipeline_effectiveness.py`	`>= 0.90`
Pipeline Recall	`TP / (TP + FN)` after validator pipeline	`experiments/exp2_pipeline_effectiveness.py`	`>= 0.85`
Hallucination Rate With Pipeline	`FP / total_questions` after validation	`experiments/exp2_pipeline_effectiveness.py`	Below baseline
Architecture Latency (A vs B)	Mean response time for validator-node graph vs post-hoc verification	`experiments/exp3_validator_vs_posthoc.py`	Tracked trend
Security Safety Rate	`safe / total_tests` across adversarial suite	`experiments/exp4_security_redteam.py`	`>= 0.90`
EP Functional Coverage	Number of EP test cases implemented and passing	`.agents/legal-luminary/RUBRIC.md`	`>= 20`
Structural Coverage	Statement coverage across validators and pipeline	`.agents/legal-luminary/RUBRIC.md`	`>= 80%` (target `>= 95%`)
Trace Completeness	Share of runs with end-to-end LangSmith traces	`.agents/legal-luminary/RUBRIC.md` and LangSmith runs	`100%` for graded runs
Source Attribution Rate	Share of outputs including provenance metadata	Pipeline outputs and integration reports	`100%`

Open-Source Tool Integration

LLM / AI Evaluation & Testing

Tool	Role
DeepEval	LLM evaluation metrics (faithfulness, answer relevancy)
promptfoo	Local testing of LLM application behavior; regression tests
Ragas	RAG evaluation using Texas-sourced context and ground truth
LangSmith	Tracing and evaluation of LangChain/LangGraph runs
TruLens	LLM evaluation framework for monitoring pipeline
Phoenix (Arize)	Observability and hallucination detection
Langfuse	Open-source LLM engineering platform

Adversarial & Robustness Testing

Tool	Role
GARAK (NVIDIA)	Red-teaming and vulnerability scanning
LLM Canary	Security benchmarking test suite
TextAttack	Adversarial attacks on validator inputs
OpenAttack	Textual adversarial attack toolkit

Systematic Testing & Error Analysis

Tool	Role
Azimuth	Dataset and error analysis for classifiers
CheckList	Behavioral NLP testing for validator logic
Deepchecks	Validation of ML/data components

Project Deliverables

First Round

Design document and threat model for validation pipeline
Implemented validator modules:
- Legal news source verification
- Judge name verification against Texas court rosters
- Elected official verification against Texas data portals
LangGraph prototype with validator nodes
Unit and integration tests with documented coverage

Final Round

Full validator suite (7 content types)
Integration with at least one authoritative Texas source per content type
End-to-end RAG pipeline with validation gates
Security review report (GARAK red-team results)
Evaluation metrics report (Experiments 1-3)

Texas Open Data Resources

Resource	URL
State of Texas Open Data Portal	data.texas.gov
Capitol Data Portal	data.capitol.texas.gov
Texas Open Data Overview	texas.gov/texas-open-data-portal

Framework & Tool References

Category	Links
Core Frameworks	LangChain, LangGraph, LangSmith
Evaluation	DeepEval, Ragas, TruLens, Phoenix, Langfuse
Testing	promptfoo, CheckList, Deepchecks
Security	GARAK, LLM Canary, TextAttack, OpenAttack

Course Alignment

Week-by-Week Topic and Item Alignment

Week	Topic Discussed	Item(s) Discussed on This Page	Artifact / Report
Week 1	Introduction to V&V	Verification architecture, source grounding, provenance requirements	Problem and architecture sections
Week 2	Adequacy criterion	Definition of “verified” (`schema + authority + temporal validity + provenance`)	Pipeline stage definitions
Week 3	Project proposal and hypothesis	Threat model and experiment plan for legal validator	Proposal and rubric alignment
Week 4	Black-box testing	Output-level checks of LLM answers against authoritative Texas data	Experiment 1 baseline setup
Week 5	LangGraph and LangSmith	Validator-node workflow, routing, traceability, observability	`LANGGRAPH_INTEGRATION_REPORT.md`
Week 6	Functional and structural testing	EP testing strategy and coverage thresholds	Rubric components for EP and coverage
Week 7	Baseline model behavior	Baseline hallucination measurement	Experiment 1 report
Week 8	Validation effectiveness	Precision/recall and hallucination reduction with pipeline	Experiment 2 report
Week 9	Architecture comparison	Validator nodes versus post-hoc verification tradeoff	Experiment 3 report
Week 10	Security robustness	Prompt injection, source spoofing, exfiltration resilience	Experiment 4 red-team report
Week 11	Concept communication	Documentation and conceptual synthesis for trustworthy legal AI	Blog/report deliverables
Week 12	Formal verification	Verification contracts and explicit pass/fail validator gates	Pipeline contract design
Week 13	Model checking	State-transition reasoning over validator routing and outcomes	Validator state and routing logic
Week 16	Tracing hands-on	LangSmith trace coverage and retry analysis	Tracing dashboard evidence
Week 17	AI/LLM/RL evaluation	Multi-tool evaluation stack (Ragas, DeepEval, promptfoo, TruLens, Phoenix)	Evaluation and metrics sections

Reports and Evaluation Outputs

Report	Concepts Used	Key Metrics Included	Primary Source
R1: Baseline Hallucination Report	Black-box testing, adequacy, oracle checks	accuracy, hallucination rate, correct/hallucinated counts	`experiments/exp1_baseline.py`
R2: Pipeline Effectiveness Report	Validation gates, confusion-matrix analysis	precision, recall, TP/FP/TN/FN, pipeline hallucination rate	`experiments/exp2_pipeline_effectiveness.py`
R3: Architecture Tradeoff Report	Comparative design evaluation	verified count, average latency, average confidence by approach	`experiments/exp3_validator_vs_posthoc.py`
R4: Security Red-Team Report	Adversarial testing and guardrails	safety rate, vulnerable case count, vulnerability list	`experiments/exp4_security_redteam.py`
R5: Source Integration Quality Report	Allowlist governance and provenance attribution	posts created, attribution rate, URL verification rate, domain coverage	`ARTICLE_INTEGRATION_REPORT.md`
R6: Tracing and Observability Report	Runtime observability and diagnostics	trace completeness, node visibility, retry behavior	`LANGGRAPH_INTEGRATION_REPORT.md` and LangSmith runs

Current Evaluation Snapshot (Evidence-Based)

Metric Group	Value	Evidence
Experiment 1 test set size	10 citation prompts	`experiments/exp1_baseline.py` (`GROUND_TRUTH_CITATIONS`)
Experiment 3 comparison sample size	5 prompts	`experiments/exp3_validator_vs_posthoc.py` (`GROUND_TRUTH_CITATIONS[:5]`)
Experiment 4 adversarial test count	10 tests	`experiments/exp4_security_redteam.py` (`RED_TEAM_TESTS`)
Article integration posts created	6	`ARTICLE_INTEGRATION_REPORT.md`
Allowlist domain count	78	`ARTICLE_INTEGRATION_REPORT.md`
Source attribution rate (articles)	100%	`ARTICLE_INTEGRATION_REPORT.md` quality metrics
URL verification rate (articles)	100%	`ARTICLE_INTEGRATION_REPORT.md` quality metrics
Coverage policy threshold	`>= 80%` (target `>= 95%`)	`.agents/legal-luminary/RUBRIC.md`

About This Project

This validation pipeline ensures that information about legal news, judges, elected officials, elections, laws, court documents, and legal templates is grounded in verifiable data from Texas government open data portals, with clear provenance on every output.

Disclaimer: This is an academic project for CS5374 Software Verification and Validation at Texas Tech University. The validation pipeline is designed to reduce hallucination rates but should not be used as the sole source for legal research or advice.

Sources & Verification

Verified: 2026-03-21

Trustworthy AI Legal Validator

Trustworthy AI Legal and Governmental Content Validator

Project Personnel

The Problem: AI Hallucination in Legal Research

Notable Cases & Studies

Content Verification Architecture

LangGraph Validation Pipeline

Pipeline Stages

Key Features

AI Agent Design Patterns

Experiments & Evaluation

Experiment 1: Baseline Hallucination Rate

Experiment 2: Verification Pipeline Effectiveness

Experiment 3: Validator Nodes vs Post-Hoc Verification

Experiment 4: Security Red-Team Evaluation

Evaluation Metric Definitions (Repo-Aligned)

Open-Source Tool Integration

LLM / AI Evaluation & Testing

Adversarial & Robustness Testing

Systematic Testing & Error Analysis

Project Deliverables

First Round

Final Round

Texas Open Data Resources

Framework & Tool References

Course Alignment

Week-by-Week Topic and Item Alignment

Reports and Evaluation Outputs

Current Evaluation Snapshot (Evidence-Based)

About This Project

Sources & Verification

Recent Articles

Killeen Municipal Election Briefing (May 2, 2026)

Killeen Municipal Election Update

Killeen Municipal Election Update

Killeen Municipal Election Update

Killeen Municipal Election Update

Killeen Municipal Election Update