Trustworthy AI Legal and Governmental Content Validator
This project is part of CS5374 Software Verification and Validation at Texas Tech University, Department of Computer Science. The project builds a Trustworthy AI validation pipeline that verifies legal and governmental content against authoritative Texas open data before any AI system presents it to users.
Project Personnel
| Role |
Name |
Contact |
Link |
| Student |
Scott Weeden |
sweeden@ttu.edu |
LinkedIn |
| Instructor |
Dr. Akbar S. Namin |
akbar.namin@ttu.edu |
TTU CS Faculty |
Course: CS 5374 - Software Verification and Validation | Spring 2026
Repository: CS5374 Software V&V on GitHub
The Problem: AI Hallucination in Legal Research
Large language models and retrieval-augmented generation (RAG) systems are increasingly used to answer questions about legal and governmental matters, yet they frequently hallucinate or return outdated information. Invented judge names, non-existent laws, fabricated election details, or unverified court documents can cause serious harm: incorrect legal advice, misrepresentation of officials, and invalid citations presented as binding authority.
Notable Cases & Studies
| Reference |
Description |
| Stanford Law - “Hallucination-Free?” |
Assessing the reliability of leading AI legal research tools (link) |
| Stanford Law - “Large Legal Fictions” |
Profiling legal hallucinations in large language models (link) |
| Mata v. Avianca, Inc. |
Court sanctions for AI-generated fake citations (link) |
Content Verification Architecture
The pipeline verifies content across seven domains using authoritative Texas sources:
| Content Type |
Authoritative Texas Source |
Verification Approach |
| Legal/Government News |
Trust lists, NewsGuard, AllSides |
URL and domain checks; cross-check with Texas agency press releases |
| Judges |
Texas judicial directories, court rosters |
Name and court match against official rosters |
| Elected Officials |
data.texas.gov, data.capitol.texas.gov |
Match names, offices, and terms to official datasets |
| Elections & Opponents |
Capitol Data Portal (116+ datasets) |
Certified filings and results; candidate/race verification |
| Laws & Ordinances |
Texas Legislature, agency sites |
Citation and text match against official code/statute datasets |
| Court Documents |
Texas court datasets, e-filing metadata |
Docket/case ID and document metadata validation |
| Legal Templates |
Texas court form registries |
Checksum and version validation against known good templates |
Note: Federal sources (CourtListener, PACER, FEC) are not used as primary authorities; the focus is on Texas legal and governmental sources via the Texas Open Data Portal and Capitol Data Portal.
LangGraph Validation Pipeline
The system uses LangChain and LangGraph to implement validator agents that ingest, parse, and verify content at each stage.
Pipeline Stages
- Content Extraction - Parse and normalize input content
- Schema Validation - Verify required fields and data types
- Source Authority Check - Validate against allowlist of authoritative domains
- Temporal Validation - Verify timestamps are valid and current
- Content Verification - Cross-reference with authoritative Texas databases
- Provenance Attribution - Attach verification metadata to all outputs
Key Features
- Schema validation at every stage
- Source grounding requirements before indexing
- Pass/fail routing with retry or escalation
- Provenance metadata on all outputs (source, date, verification status)
- Only content that passes verification is indexed and made available to downstream AI systems
AI Agent Design Patterns
The project leverages 21 AI agent design patterns documented in the DesignPatterns repository:
| Pattern |
Application in Project |
| 01 - Prompt Chaining |
Sequential validation steps where output of one step feeds the next |
| 02 - Routing |
Content-type classification directing to appropriate validators |
| 03 - Parallelization |
Concurrent checking of multiple authoritative sources |
| 04 - Reflection |
Self-verification of validator outputs before acceptance |
| 05 - Tool Use |
Integration with Texas Open Data APIs |
| 06 - Planning |
Multi-step validation workflows for complex content |
| 07 - Multi-Agent Collaboration |
Distributed validators for different content types |
| 08 - Memory Management |
Preservation of verification context across pipeline |
| 09 - Learning & Adaptation |
Pattern learning from verification results |
| 10 - Model Context Protocol |
Standardized context passing between agents |
| 11 - Goal Setting |
Defining verification thresholds and targets |
| 12 - Exception Handling |
Graceful handling of API failures and timeouts |
| 13 - Human-in-the-Loop |
Escalation paths for ambiguous verifications |
| 14 - RAG (Retrieval-Augmented Generation) |
Ground truth retrieval from Texas databases |
| 15 - Inter-Agent Communication |
Coordination between validator nodes |
| 16 - Resource-Aware Optimization |
Efficient API usage and rate limiting |
| 17 - Reasoning Techniques |
Logical inference for complex content types |
| 18 - Guardrails & Safety |
Input sanitization and output validation |
| 19 - Evaluation & Monitoring |
Metrics tracking with LangSmith/Phoenix |
| 20 - Prioritization |
Queue management for verification tasks |
| 21 - Exploration & Discovery |
New source identification and validation |
Experiments & Evaluation
Experiment 1: Baseline Hallucination Rate
- Objective: Establish baseline hallucination rate for LLM on Texas legal citation tasks without verification
- Data: Held-out set of legal questions with ground-truth citations from data.texas.gov
- Metrics: Proportion of generated citations that do not exist, are misattributed, or have incorrect holdings
- Tools: LangSmith, Ragas, DeepEval, promptfoo
Experiment 2: Verification Pipeline Effectiveness
- Objective: Measure impact of Texas-data-backed validator on hallucination and citation quality
- Setup: Same Texas legal citation tasks passed through LLM, then through validator
- Metrics: Precision, Recall, Hallucination rate reduction
- Tools: Ragas, LangSmith, Phoenix, DeepEval
Experiment 3: Validator Nodes vs Post-Hoc Verification
- Objective: Compare LangGraph with validator nodes (reject/retry on failure) vs simple RAG with post-hoc filtering
- Metrics: End-to-end accuracy and latency
- Tools: LangSmith, promptfoo, TruLens, Phoenix
Experiment 4: Security Red-Team Evaluation
- Objective: Apply adversarial testing to the validator pipeline
- Tests: Prompt injection, data exfiltration, source spoofing
- Tools: GARAK (NVIDIA), LLM Canary, TextAttack, OpenAttack
- Deliverable: Documented vulnerabilities and mitigations
Evaluation Metric Definitions (Repo-Aligned)
The following metrics align directly with experiment code under experiments/ and rubric expectations under .agents/legal-luminary/RUBRIC.md.
| Metric |
Definition |
Source in Repo |
Target |
| Baseline Hallucination Rate |
hallucinated / total_questions for unverified citation responses |
experiments/exp1_baseline.py |
Lower is better |
| Pipeline Precision |
TP / (TP + FP) after validator pipeline |
experiments/exp2_pipeline_effectiveness.py |
>= 0.90 |
| Pipeline Recall |
TP / (TP + FN) after validator pipeline |
experiments/exp2_pipeline_effectiveness.py |
>= 0.85 |
| Hallucination Rate With Pipeline |
FP / total_questions after validation |
experiments/exp2_pipeline_effectiveness.py |
Below baseline |
| Architecture Latency (A vs B) |
Mean response time for validator-node graph vs post-hoc verification |
experiments/exp3_validator_vs_posthoc.py |
Tracked trend |
| Security Safety Rate |
safe / total_tests across adversarial suite |
experiments/exp4_security_redteam.py |
>= 0.90 |
| EP Functional Coverage |
Number of EP test cases implemented and passing |
.agents/legal-luminary/RUBRIC.md |
>= 20 |
| Structural Coverage |
Statement coverage across validators and pipeline |
.agents/legal-luminary/RUBRIC.md |
>= 80% (target >= 95%) |
| Trace Completeness |
Share of runs with end-to-end LangSmith traces |
.agents/legal-luminary/RUBRIC.md and LangSmith runs |
100% for graded runs |
| Source Attribution Rate |
Share of outputs including provenance metadata |
Pipeline outputs and integration reports |
100% |
LLM / AI Evaluation & Testing
| Tool |
Role |
| DeepEval |
LLM evaluation metrics (faithfulness, answer relevancy) |
| promptfoo |
Local testing of LLM application behavior; regression tests |
| Ragas |
RAG evaluation using Texas-sourced context and ground truth |
| LangSmith |
Tracing and evaluation of LangChain/LangGraph runs |
| TruLens |
LLM evaluation framework for monitoring pipeline |
| Phoenix (Arize) |
Observability and hallucination detection |
| Langfuse |
Open-source LLM engineering platform |
Adversarial & Robustness Testing
| Tool |
Role |
| GARAK (NVIDIA) |
Red-teaming and vulnerability scanning |
| LLM Canary |
Security benchmarking test suite |
| TextAttack |
Adversarial attacks on validator inputs |
| OpenAttack |
Textual adversarial attack toolkit |
Systematic Testing & Error Analysis
| Tool |
Role |
| Azimuth |
Dataset and error analysis for classifiers |
| CheckList |
Behavioral NLP testing for validator logic |
| Deepchecks |
Validation of ML/data components |
Project Deliverables
First Round
- Design document and threat model for validation pipeline
- Implemented validator modules:
- Legal news source verification
- Judge name verification against Texas court rosters
- Elected official verification against Texas data portals
- LangGraph prototype with validator nodes
- Unit and integration tests with documented coverage
Final Round
- Full validator suite (7 content types)
- Integration with at least one authoritative Texas source per content type
- End-to-end RAG pipeline with validation gates
- Security review report (GARAK red-team results)
- Evaluation metrics report (Experiments 1-3)
Texas Open Data Resources
| Category |
Links |
| Core Frameworks |
LangChain, LangGraph, LangSmith |
| Evaluation |
DeepEval, Ragas, TruLens, Phoenix, Langfuse |
| Testing |
promptfoo, CheckList, Deepchecks |
| Security |
GARAK, LLM Canary, TextAttack, OpenAttack |
Course Alignment
Week-by-Week Topic and Item Alignment
| Week |
Topic Discussed |
Item(s) Discussed on This Page |
Artifact / Report |
| Week 1 |
Introduction to V&V |
Verification architecture, source grounding, provenance requirements |
Problem and architecture sections |
| Week 2 |
Adequacy criterion |
Definition of “verified” (schema + authority + temporal validity + provenance) |
Pipeline stage definitions |
| Week 3 |
Project proposal and hypothesis |
Threat model and experiment plan for legal validator |
Proposal and rubric alignment |
| Week 4 |
Black-box testing |
Output-level checks of LLM answers against authoritative Texas data |
Experiment 1 baseline setup |
| Week 5 |
LangGraph and LangSmith |
Validator-node workflow, routing, traceability, observability |
LANGGRAPH_INTEGRATION_REPORT.md |
| Week 6 |
Functional and structural testing |
EP testing strategy and coverage thresholds |
Rubric components for EP and coverage |
| Week 7 |
Baseline model behavior |
Baseline hallucination measurement |
Experiment 1 report |
| Week 8 |
Validation effectiveness |
Precision/recall and hallucination reduction with pipeline |
Experiment 2 report |
| Week 9 |
Architecture comparison |
Validator nodes versus post-hoc verification tradeoff |
Experiment 3 report |
| Week 10 |
Security robustness |
Prompt injection, source spoofing, exfiltration resilience |
Experiment 4 red-team report |
| Week 11 |
Concept communication |
Documentation and conceptual synthesis for trustworthy legal AI |
Blog/report deliverables |
| Week 12 |
Formal verification |
Verification contracts and explicit pass/fail validator gates |
Pipeline contract design |
| Week 13 |
Model checking |
State-transition reasoning over validator routing and outcomes |
Validator state and routing logic |
| Week 16 |
Tracing hands-on |
LangSmith trace coverage and retry analysis |
Tracing dashboard evidence |
| Week 17 |
AI/LLM/RL evaluation |
Multi-tool evaluation stack (Ragas, DeepEval, promptfoo, TruLens, Phoenix) |
Evaluation and metrics sections |
Reports and Evaluation Outputs
| Report |
Concepts Used |
Key Metrics Included |
Primary Source |
| R1: Baseline Hallucination Report |
Black-box testing, adequacy, oracle checks |
accuracy, hallucination rate, correct/hallucinated counts |
experiments/exp1_baseline.py |
| R2: Pipeline Effectiveness Report |
Validation gates, confusion-matrix analysis |
precision, recall, TP/FP/TN/FN, pipeline hallucination rate |
experiments/exp2_pipeline_effectiveness.py |
| R3: Architecture Tradeoff Report |
Comparative design evaluation |
verified count, average latency, average confidence by approach |
experiments/exp3_validator_vs_posthoc.py |
| R4: Security Red-Team Report |
Adversarial testing and guardrails |
safety rate, vulnerable case count, vulnerability list |
experiments/exp4_security_redteam.py |
| R5: Source Integration Quality Report |
Allowlist governance and provenance attribution |
posts created, attribution rate, URL verification rate, domain coverage |
ARTICLE_INTEGRATION_REPORT.md |
| R6: Tracing and Observability Report |
Runtime observability and diagnostics |
trace completeness, node visibility, retry behavior |
LANGGRAPH_INTEGRATION_REPORT.md and LangSmith runs |
Current Evaluation Snapshot (Evidence-Based)
| Metric Group |
Value |
Evidence |
| Experiment 1 test set size |
10 citation prompts |
experiments/exp1_baseline.py (GROUND_TRUTH_CITATIONS) |
| Experiment 3 comparison sample size |
5 prompts |
experiments/exp3_validator_vs_posthoc.py (GROUND_TRUTH_CITATIONS[:5]) |
| Experiment 4 adversarial test count |
10 tests |
experiments/exp4_security_redteam.py (RED_TEAM_TESTS) |
| Article integration posts created |
6 |
ARTICLE_INTEGRATION_REPORT.md |
| Allowlist domain count |
78 |
ARTICLE_INTEGRATION_REPORT.md |
| Source attribution rate (articles) |
100% |
ARTICLE_INTEGRATION_REPORT.md quality metrics |
| URL verification rate (articles) |
100% |
ARTICLE_INTEGRATION_REPORT.md quality metrics |
| Coverage policy threshold |
>= 80% (target >= 95%) |
.agents/legal-luminary/RUBRIC.md |
About This Project
This validation pipeline ensures that information about legal news, judges, elected officials, elections, laws, court documents, and legal templates is grounded in verifiable data from Texas government open data portals, with clear provenance on every output.
Disclaimer: This is an academic project for CS5374 Software Verification and Validation at Texas Tech University. The validation pipeline is designed to reduce hallucination rates but should not be used as the sole source for legal research or advice.
Sources & Verification
Verified: 2026-03-21