Spaces:

sanzgiri
/

schema-translator

Paused

App Files Files Community

schema-translator / schema_translator.md

sanzgiri

Complete Schema Translator implementation - All 8 phases

a584f85 about 1 month ago

preview code

raw

history blame contribute delete

29.1 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Schema Translator Requirements Document

Project Overview

Build an intelligent contract schema translation system that enables querying across multiple enterprise customers with heterogeneous database schemas using LLM-powered semantic understanding.

Core Problem

Multiple enterprise customers maintain different database schemas for contract data
Same business concepts represented differently across customers (structure, data types, semantics)
Need unified query interface that works across all customer schemas
Must handle semantic ambiguity (e.g., "contract_value" meaning lifetime total vs. annual revenue)

Success Criteria

User can ask natural language questions about contracts across all customers
System correctly interprets semantic differences between customer schemas
Results are normalized and harmonized for cross-customer comparison
New customers can be onboarded with minimal manual configuration
System learns and adapts from query patterns and errors

Architecture Components

1. Mock Data Layer

Purpose: Generate realistic contract databases for 6 different customer schemas

Requirements:

Create SQLite databases for 6 customers (A, B, C, D, E, F)
Each customer has unique schema characteristics:
- Customer A: Single table, DATE expiry, LIFETIME contract_value
- Customer B: Normalized (3 tables: headers, status_history, renewal_schedule), LIFETIME value
- Customer C: Single table, DATE expiry, LIFETIME value, different column names
- Customer D: Single table, INTEGER days_remaining instead of date, LIFETIME value
- Customer E: Single table, DATE expiry, LIFETIME contract_value with explicit duration
- Customer F: Single table, DATE expiry, ANNUAL contract_value (ARR)

Data Requirements:

50 contracts per customer
Fields: contract ID/name, status, expiration (various formats), value, industry/sector, customer name, start date
Realistic value ranges: $100K-$5M lifetime, $100K-$2M annual
Mix of expiration dates: some expired (-60 days), some upcoming (0-365 days)
Industry variety: Technology, Healthcare, Finance, Manufacturing, Retail (with variations in naming)

Deliverable:

schema_translator/mock_data.py - MockDataGenerator class
databases/ directory with 6 SQLite .db files
Script to regenerate all databases

2. Data Models

Purpose: Define typed data structures for the entire system

Requirements:

SchemaColumn: name, data_type, semantic_meaning, transformations, sample_values
SchemaTable: name, columns, relationships
CustomerSchema: customer_id, customer_name, tables, semantic_notes, last_analyzed
SemanticConcept: concept_id, concept_name, description, customer_mappings
QueryFilter: concept, operator, value, semantic_note
SemanticQueryPlan: intent, filters, aggregations, projections
QueryResult: customer_id, data, sql_executed, execution_time_ms
HarmonizedResult: results, total_count, customers_queried

Deliverable:

schema_translator/models.py using Pydantic BaseModel
Full type hints and validation
JSON serialization support

3. Configuration Management

Purpose: Centralized configuration for API keys, paths, LLM settings

Requirements:

Load from environment variables and .env file
ANTHROPIC_API_KEY (required)
Model name: claude-sonnet-4-20250514
Database directory path
Knowledge graph storage path
LLM parameters: max_tokens, temperature
Validation on startup

Deliverable:

schema_translator/config.py - Config class
.env.example file
Configuration validation method

4. Knowledge Graph

Purpose: Store semantic relationships between schemas and concepts

Requirements:

Core Concepts to Model:

contract_expiration - When contract ends (DATE or INTEGER days)
contract_value - Monetary value (LIFETIME vs ANNUAL semantics)
contract_status - Current state (single table vs history table)
industry_sector - Business vertical (various naming conventions)
contract_identifier - Primary key (ID vs name)
customer_name - Client organization
contract_start - Contract begin date

Customer Mappings:

For each customer, map each concept to:

Table name
Column name
Data type
Semantic type (e.g., LIFETIME_TOTAL, ANNUAL_RECURRING_REVENUE)
Required transformation (if any)
Join requirements (for multi-table schemas)

Transformation Rules:

days_remaining → DATE: DATE_ADD(NOW(), INTERVAL days_remaining DAY)
DATE → days_remaining: DATEDIFF(expiry_date, NOW())
ANNUAL → LIFETIME: contract_value * term_years
LIFETIME → ANNUAL: contract_value / term_years

Deliverable:

schema_translator/knowledge_graph.py - SchemaKnowledgeGraph class
Methods: add_concept(), add_customer_mapping(), add_transformation(), get_mapping(), save(), load()
JSON persistence
NetworkX graph structure for relationship traversal
Pre-populated with all 6 customer mappings

5. LLM Schema Analyzer Agent

Purpose: Automatically analyze new customer schemas and propose mappings

Requirements:

Core Functionality:

Analyze Schema:
- Input: Customer database connection, customer_id
- Extract table structures, column names, data types
- Query sample data from each column
- Identify primary keys and foreign keys
Semantic Analysis:
- Use LLM to interpret column names and sample values
- Determine semantic meaning (is "contract_value" lifetime or annual?)
- Identify calculation needs (days_remaining vs date conversion)
- Detect normalization patterns (single vs multi-table)
Mapping Proposal:
- Generate proposed mappings to canonical concepts
- Suggest transformation logic
- Flag ambiguities requiring human confirmation
- Provide confidence scores
Validation:
- Test proposed mappings with sample queries
- Verify JOIN logic for multi-table schemas
- Confirm value ranges are reasonable

LLM Prompts Required:

Schema interpretation prompt
Semantic disambiguation prompt (for ambiguous columns)
Transformation generation prompt
Validation prompt

Deliverable:

schema_translator/agents/schema_analyzer.py - SchemaAnalyzerAgent class
Methods: analyze_schema(), propose_mappings(), validate_mappings()
Integration with Anthropic API
Structured output parsing

6. Query Understanding Agent

Purpose: Convert natural language queries to Semantic Query Plans

Requirements:

Input Processing:

Accept natural language questions about contracts
Parse intent (find, count, aggregate, compare)
Extract filters (time ranges, value thresholds, industry, status)
Identify aggregations (sum, count, average)
Determine projections (which fields to return)

Semantic Query Plan Generation:

Convert to schema-independent intermediate representation:

{
"intent": "find_contracts",
"filters": [
{
"concept": "contract_expiration",
"operator": "within_next_days",
"value": 30,
"semantic_note": "expiration may be date or days_remaining"
},
{
"concept": "contract_value",
"operator": "greater_than",
"value": 500000,
"semantic_note": "need to normalize to lifetime total"
},
{
"concept": "industry_sector",
"operator": "equals",
"value": "Technology"
}
],
"projections": ["contract_identifier", "contract_value", "contract_expiration", "customer_name"],
"aggregations": null
}

Operator Mapping:

Natural language → standardized operators
"expiring soon", "due in next X days" → within_next_days
"over $X", "worth more than" → greater_than
"in Q1", "between dates" → date_range
"technology sector", "tech companies" → equals (with normalization)

LLM Prompts Required:

Query understanding prompt
Ambiguity resolution prompt
Context-aware concept extraction

Deliverable:

schema_translator/agents/query_understanding.py - QueryUnderstandingAgent class
Methods: parse_query(), create_semantic_plan()
Natural language processing with LLM
Structured SemanticQueryPlan output

7. Query Compiler

Purpose: Compile Semantic Query Plans to customer-specific SQL

Requirements:

Core Compilation:

Concept Resolution:
- For each concept in plan, lookup customer mapping in knowledge graph
- Retrieve table, column, type, semantic type
- Identify required transformations
SQL Generation:
- Build SELECT clause with projections
- Build FROM clause (single table or JOINs)
- Build WHERE clause with filters
- Apply transformations in WHERE/SELECT as needed
- Build GROUP BY, HAVING for aggregations
Schema-Specific Handling:
- Single table (A, C, D, E, F): Direct SELECT
- Multi-table (B): Auto-generate JOINs
- Date fields (A, C, E, F): Direct date comparisons
- Days remaining (D): Convert to date calculations
- Lifetime value (A, C, D, E): Direct value comparison
- Annual value (F): Multiply by term_years for lifetime equivalent
Industry Normalization:
- Map "Technology" → customer-specific terms (Tech, Information Technology)
- Handle semantic equivalence

Example Compilations:

Query: "Contracts expiring in next 30 days worth over $500K"

Customer A SQL:

SELECT contract_id, contract_name, contract_value, expiry_date, customer_name
FROM contracts
WHERE expiry_date BETWEEN CURRENT_DATE AND DATE(CURRENT_DATE, '+30 days')
AND contract_value > 500000

Customer B SQL:

SELECT h.id, h.contract_name, h.contract_value, r.renewal_date, h.client_name
FROM contract_headers h
JOIN renewal_schedule r ON h.id = r.contract_id
JOIN contract_status_history s ON h.id = s.contract_id
WHERE r.renewal_date BETWEEN CURRENT_DATE AND DATE(CURRENT_DATE, '+30 days')
AND h.contract_value > 500000
AND s.status_date = (
SELECT MAX(status_date)
FROM contract_status_history
WHERE contract_id = h.id
)

Customer D SQL:

SELECT contract_id, contract_title, contract_value,
DATE(CURRENT_DATE, '+' || days_remaining || ' days') as expiry_date,
customer_org
FROM contracts
WHERE days_remaining BETWEEN 0 AND 30
AND contract_value > 500000

Customer F SQL:

SELECT contract_id, name,
(contract_value * term_years) as lifetime_value,
expiration_date, account
FROM contracts
WHERE expiration_date BETWEEN CURRENT_DATE AND DATE(CURRENT_DATE, '+30 days')
AND (contract_value * term_years) > 500000

Deliverable:

schema_translator/query_compiler.py - QueryCompiler class
Methods: compile_for_customer(), generate_select(), generate_where(), generate_joins()
SQL injection prevention
Query validation

8. Database Executor

Purpose: Execute customer-specific SQL queries and collect results

Requirements:

Execution:

Connect to customer SQLite databases
Execute compiled SQL queries
Measure execution time
Handle database errors gracefully
Return structured QueryResult objects

Multi-Customer Execution:

Execute queries across multiple customers in parallel (optional: can be sequential for MVP)
Collect results from all customers
Tag results with customer_id
Include executed SQL for debugging

Error Handling:

Catch SQL errors (invalid column, syntax, etc.)
Return partial results if some customers succeed
Log errors with context
Provide actionable error messages

Deliverable:

schema_translator/database_executor.py - DatabaseExecutor class
Methods: execute_query(), execute_for_customer(), execute_all_customers()
Connection pooling
Result aggregation

9. Result Harmonizer

Purpose: Normalize and harmonize results from different customer schemas

Requirements:

Normalization:

Value Normalization:
- Convert all contract values to same semantic unit (lifetime total)
- Customer F: multiply annual by term_years
- Add metadata indicating original vs calculated
Date Normalization:
- Convert all expiration representations to ISO date format
- Customer D: calculate actual date from days_remaining
- Ensure consistent timezone handling
Field Name Normalization:
- Map customer-specific field names to canonical names
- contract_id/id → contract_identifier
- customer_name/client_name/account → customer_name
- expiry_date/expiration_date/renewal_date → expiration_date
Industry Name Normalization:
- Map variations to canonical terms
- "Tech", "Technology", "Information Technology" → "Technology"
- Maintain consistent taxonomy

Result Formatting:

Unified schema for display
Include source customer
Include both original and normalized values where applicable
Sort by specified criteria (e.g., expiration date, value)
Support pagination

Deliverable:

schema_translator/result_harmonizer.py - ResultHarmonizer class
Methods: harmonize_results(), normalize_value(), normalize_date(), normalize_industry()
Configurable normalization rules

10. Chat Orchestrator

Purpose: Coordinate the entire query flow from natural language to harmonized results

Requirements:

Workflow:

Receive natural language query from user
Optionally filter by specific customer(s) or query all
Call QueryUnderstandingAgent → SemanticQueryPlan
For each customer:
- Call QueryCompiler → customer-specific SQL
- Call DatabaseExecutor → QueryResult
Call ResultHarmonizer → HarmonizedResult
Format for display
Return to user interface

Additional Features:

Query validation before execution
Caching for repeated queries
Query history tracking
Performance metrics (end-to-end latency)
Error aggregation and reporting

Conversation Context:

Maintain conversation state for follow-up questions
Reference previous queries ("show me more like this")
Clarification handling ("Did you mean X or Y?")

Deliverable:

schema_translator/orchestrator.py - ChatOrchestrator class
Methods: process_query(), validate_query(), execute_pipeline()
Integration with all agents and components
Comprehensive logging

11. Chat Interface (Chainlit)

Purpose: Web-based chat UI for interacting with the Schema Translator

Requirements:

UI Features:

Chat Interface:
- Natural language input box
- Streaming responses
- Display harmonized results in tables
- Show per-customer breakdowns
- Display executed SQL (collapsible/expandable)
- Error messages with retry options
Customer Selection:
- Dropdown or checkboxes to select specific customers
- "All customers" option (default)
- Show customer schema summaries on hover
Result Display:
- Table view with sortable columns
- Highlight normalized values
- Show original vs calculated values
- Export to CSV option
- Pagination for large result sets
Query Examples:
- Pre-populated example queries
- "Show me contracts expiring in next 30 days"
- "Find technology contracts over $1M"
- "What's the total value of contracts expiring in Q1?"
- "Compare contract values across customers"
Debug Mode:
- Toggle to show semantic query plan
- Show customer-specific SQL for each customer
- Show knowledge graph lookups
- Execution time per customer

Chainlit-Specific:

Use @cl.on_message for query handling
Use cl.Message for responses
Use cl.DataTable for results display (or custom HTML)
Use cl.Action for interactive buttons
Session management for conversation context

Deliverable:

app.py - Main Chainlit application
Integration with ChatOrchestrator
Responsive UI with tables and formatting
Error handling and user feedback

12. Learning and Feedback Loop

Purpose: Improve system over time based on query results and user feedback

Requirements:

Feedback Collection:

Capture user corrections ("This result is wrong")
Track query failures and errors
Log semantic ambiguities that required clarification
Record query patterns and frequency

Learning Mechanisms:

Schema Drift Detection:
- Periodically check customer schemas for changes
- Detect new columns, modified types
- Flag when mappings may be stale
Transformation Refinement:
- When queries fail, analyze error
- Use LLM to propose fixes
- Update knowledge graph with successful fixes
Semantic Learning:
- Track which industry name mappings are used most
- Learn customer-specific terminology preferences
- Improve concept disambiguation over time
Query Pattern Analysis:
- Identify common query types
- Pre-generate optimized mappings for frequent patterns
- Suggest query completions

Deliverable:

schema_translator/learning/feedback_loop.py - FeedbackLoop class
schema_translator/learning/schema_drift_detector.py - SchemaDriftDetector class
Methods: record_feedback(), detect_drift(), refine_transformations()
Periodic background tasks

Implementation Phases

Phase 1: Foundation (Week 1)

Goal: Core data structures and mock data

Tasks:

Set up project structure
Implement configuration management
Define all Pydantic models
Generate mock data for all 6 customers
Test database creation and data access

Validation:

All 6 SQLite databases created with 50 contracts each
Can query each database directly
Models serialize to/from JSON correctly

Phase 2: Knowledge Graph (Week 1-2)

Goal: Build and populate the knowledge graph

Tasks:

Implement SchemaKnowledgeGraph class
Define 7 core concepts with aliases
Create mappings for all 6 customers
Implement transformation rules
Add JSON persistence
Write unit tests

Validation:

Knowledge graph loads and saves correctly
Can query concept mappings for each customer
Transformations are correctly defined
No missing mappings for any customer

Phase 3: Query Compiler & Executor (Week 2)

Goal: Generate and execute customer-specific SQL

Tasks:

Implement QueryCompiler
Handle single-table schemas (A, C, D, E, F)
Handle multi-table schema (B)
Implement date/days_remaining conversions
Implement lifetime/annual value conversions
Implement DatabaseExecutor
Write extensive unit tests with real queries

Validation:

Same semantic query compiles to different SQL for each customer
All compiled queries execute successfully
Results are returned in QueryResult format
Customer D date calculations are correct
Customer F value normalization is correct
Customer B joins work correctly

Phase 4: LLM Agents (Week 2-3)

Goal: Implement intelligent query understanding and schema analysis

Tasks:

Set up Anthropic API integration
Implement QueryUnderstandingAgent
Test natural language → SemanticQueryPlan conversion
Implement SchemaAnalyzerAgent
Test auto-analysis on a mock "Customer G"
Implement prompt engineering and few-shot examples

Validation:

Natural language queries correctly parsed
Semantic query plans are accurate
Schema analyzer correctly interprets new schemas
LLM proposes reasonable mappings
Ambiguity detection works (flags when uncertain)

Phase 5: Result Harmonization (Week 3)

Goal: Normalize and present unified results

Tasks:

Implement ResultHarmonizer
Implement value normalization (annual → lifetime)
Implement date normalization (days → date)
Implement field name mapping
Implement industry name normalization
Test with multi-customer result sets

Validation:

Customer F annual values correctly converted
Customer D dates correctly calculated
Field names unified across customers
Industry names normalized
Results sortable and filterable

Phase 6: Orchestration (Week 3)

Goal: Connect all components into cohesive pipeline

Tasks:

Implement ChatOrchestrator
Integrate all agents and components
Implement error handling and logging
Implement query validation
Add performance monitoring
Write integration tests

Validation:

End-to-end query flow works
Errors handled gracefully
Can query all customers simultaneously
Can filter by specific customer(s)
Query history maintained
Performance acceptable (<5s for simple queries)

Phase 7: UI Implementation (Week 4)

Goal: Build user-facing chat interface

Tasks:

Set up Chainlit application
Implement chat message handling
Implement result table display
Add customer selection controls
Add example queries
Implement debug mode
Add error handling and user feedback
Style and polish UI

Validation:

Chat interface responsive and intuitive
Results display in readable tables
Customer filtering works
Example queries load and execute
Debug mode shows SQL and query plans
Errors displayed clearly with retry options

Phase 8: Learning & Polish (Week 4)

Goal: Add learning capabilities and final refinements

Tasks:

Implement FeedbackLoop
Implement SchemaDriftDetector
Add query failure analysis
Add transformation refinement
Comprehensive testing across all components
Documentation and deployment guide
Performance optimization

Validation:

System learns from errors
Schema drift detected
Query failures analyzed and fixed
End-to-end system robust and performant
Documentation complete

File Structure

schema_translator/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
│
├── databases/ # SQLite databases
│ ├── customer_a.db
│ ├── customer_b.db
│ ├── customer_c.db
│ ├── customer_d.db
│ ├── customer_e.db
│ └── customer_f.db
│
├── schema_translator/
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── models.py # Pydantic data models
│ ├── mock_data.py # Mock data generation
│ ├── knowledge_graph.py # Schema knowledge graph
│ ├── query_compiler.py # SQL generation
│ ├── database_executor.py # Query execution
│ ├── result_harmonizer.py # Result normalization
│ ├── orchestrator.py # Main pipeline orchestrator
│ │
│ ├── agents/
│ │ ├── __init__.py
│ │ ├── schema_analyzer.py # LLM schema analysis
│ │ └── query_understanding.py # NLP → Semantic query plan
│ │
│ └── learning/
│ ├── __init__.py
│ ├── feedback_loop.py # User feedback processing
│ └── schema_drift_detector.py # Schema change detection
│
├── tests/
│ ├── __init__.py
│ ├── test_models.py
│ ├── test_knowledge_graph.py
│ ├── test_query_compiler.py
│ ├── test_database_executor.py
│ ├── test_agents.py
│ └── test_integration.py
│
├── app.py # Chainlit application
├── knowledge_graph.json # Persisted knowledge graph
└── query_history.json # Query logs

Testing Requirements

Unit Tests

All data models validate correctly
Knowledge graph operations (add, query, save, load)
Query compiler for each customer schema type
Database executor with mock databases
Result harmonizer with various data types
Each agent independently

Integration Tests

End-to-end query flow (NL → results)
Multi-customer queries
Error scenarios (invalid SQL, missing mappings)
Schema analysis on mock new customer
Learning loop with feedback

Performance Tests

Query latency (target: <5s for simple queries across 6 customers)
Knowledge graph lookup speed
LLM API call optimization (minimize calls)
Database query optimization

Non-Functional Requirements

Security

SQL injection prevention (parameterized queries)
API key protection (environment variables)
Input validation on all user queries

Scalability

Knowledge graph should support 50+ customers
Query execution should parallelize where possible
Results pagination for large datasets

Maintainability

Comprehensive docstrings
Type hints throughout
Clear separation of concerns
Modular architecture for easy updates

Observability

Logging at all pipeline stages
Query performance metrics
Error tracking with context
Debug mode for development

Example Queries to Support

"Show me contracts expiring in the next 30 days"
"Find all technology contracts worth over $1 million"
"What's the total value of contracts expiring in Q1 2025?"
"Show me inactive or expired contracts"
"Which customers have contracts expiring soon?" (group by customer)
"Compare contract values between healthcare and technology sectors"
"Find contracts for Customer A that expire before February"
"What's the average contract value across all customers?"
"Show me the top 10 highest value contracts"
"Find contracts that started in 2023 and are still active"

Deliverables Summary

✅ Working system with all components integrated
✅ 6 customer mock databases with realistic data
✅ Knowledge graph with all mappings
✅ LLM-powered query understanding
✅ Customer-specific SQL generation
✅ Result harmonization and normalization
✅ Chainlit chat interface
✅ Comprehensive test suite
✅ Documentation and README
✅ Deployment instructions

Success Metrics

Accuracy: >95% of test queries return correct results
Coverage: All 10 example queries work across all 6 customers
Performance: <5 seconds for simple queries, <15 seconds for complex
Usability: Non-technical users can successfully query data
Adaptability: New customer onboarding takes <1 hour

Tech Stack

Language: Python 3.10+
LLM: Anthropic Claude (claude-sonnet-4-20250514)
Database: SQLite
UI Framework: Chainlit
Data Validation: Pydantic
Graph: NetworkX
Testing: pytest
Environment: python-dotenv

Getting Started Instructions

# 1. Clone repository
git clone <repo-url>
cd schema_translator

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Set up environment
cp .env.example .env
# Edit .env and add ANTHROPIC_API_KEY

# 5. Generate mock data
python -m schema_translator.mock_data

# 6. Run tests
pytest tests/

# 7. Start application
chainlit run app.py

Notes for Claude Code

Build incrementally, validating each component before moving to next
Start with data layer (models, mock_data, config) before logic
Test each customer schema type independently
Use real LLM calls early to validate prompts
Focus on Customer A, D, and F first (they demonstrate key differences)
Customer B can be added later (multi-table complexity)
Prioritize correctness over performance initially
Add comprehensive error messages for debugging
Use print/logging statements liberally during development