Spaces:
Paused
A newer version of the Gradio SDK is available:
6.1.0
Schema Translator Requirements Document
Project Overview
Build an intelligent contract schema translation system that enables querying across multiple enterprise customers with heterogeneous database schemas using LLM-powered semantic understanding.
Core Problem
- Multiple enterprise customers maintain different database schemas for contract data
- Same business concepts represented differently across customers (structure, data types, semantics)
- Need unified query interface that works across all customer schemas
- Must handle semantic ambiguity (e.g., "contract_value" meaning lifetime total vs. annual revenue)
Success Criteria
- User can ask natural language questions about contracts across all customers
- System correctly interprets semantic differences between customer schemas
- Results are normalized and harmonized for cross-customer comparison
- New customers can be onboarded with minimal manual configuration
- System learns and adapts from query patterns and errors
Architecture Components
1. Mock Data Layer
Purpose: Generate realistic contract databases for 6 different customer schemas
Requirements:
- Create SQLite databases for 6 customers (A, B, C, D, E, F)
- Each customer has unique schema characteristics:
- Customer A: Single table, DATE expiry, LIFETIME contract_value
- Customer B: Normalized (3 tables: headers, status_history, renewal_schedule), LIFETIME value
- Customer C: Single table, DATE expiry, LIFETIME value, different column names
- Customer D: Single table, INTEGER days_remaining instead of date, LIFETIME value
- Customer E: Single table, DATE expiry, LIFETIME contract_value with explicit duration
- Customer F: Single table, DATE expiry, ANNUAL contract_value (ARR)
Data Requirements:
- 50 contracts per customer
- Fields: contract ID/name, status, expiration (various formats), value, industry/sector, customer name, start date
- Realistic value ranges: $100K-$5M lifetime, $100K-$2M annual
- Mix of expiration dates: some expired (-60 days), some upcoming (0-365 days)
- Industry variety: Technology, Healthcare, Finance, Manufacturing, Retail (with variations in naming)
Deliverable:
schema_translator/mock_data.py- MockDataGenerator classdatabases/directory with 6 SQLite .db files- Script to regenerate all databases
2. Data Models
Purpose: Define typed data structures for the entire system
Requirements:
- SchemaColumn: name, data_type, semantic_meaning, transformations, sample_values
- SchemaTable: name, columns, relationships
- CustomerSchema: customer_id, customer_name, tables, semantic_notes, last_analyzed
- SemanticConcept: concept_id, concept_name, description, customer_mappings
- QueryFilter: concept, operator, value, semantic_note
- SemanticQueryPlan: intent, filters, aggregations, projections
- QueryResult: customer_id, data, sql_executed, execution_time_ms
- HarmonizedResult: results, total_count, customers_queried
Deliverable:
schema_translator/models.pyusing Pydantic BaseModel- Full type hints and validation
- JSON serialization support
3. Configuration Management
Purpose: Centralized configuration for API keys, paths, LLM settings
Requirements:
- Load from environment variables and .env file
- ANTHROPIC_API_KEY (required)
- Model name: claude-sonnet-4-20250514
- Database directory path
- Knowledge graph storage path
- LLM parameters: max_tokens, temperature
- Validation on startup
Deliverable:
schema_translator/config.py- Config class.env.examplefile- Configuration validation method
4. Knowledge Graph
Purpose: Store semantic relationships between schemas and concepts
Requirements:
Core Concepts to Model:
- contract_expiration - When contract ends (DATE or INTEGER days)
- contract_value - Monetary value (LIFETIME vs ANNUAL semantics)
- contract_status - Current state (single table vs history table)
- industry_sector - Business vertical (various naming conventions)
- contract_identifier - Primary key (ID vs name)
- customer_name - Client organization
- contract_start - Contract begin date
Customer Mappings:
For each customer, map each concept to:
- Table name
- Column name
- Data type
- Semantic type (e.g., LIFETIME_TOTAL, ANNUAL_RECURRING_REVENUE)
- Required transformation (if any)
- Join requirements (for multi-table schemas)
Transformation Rules:
- days_remaining β DATE:
DATE_ADD(NOW(), INTERVAL days_remaining DAY) - DATE β days_remaining:
DATEDIFF(expiry_date, NOW()) - ANNUAL β LIFETIME:
contract_value * term_years - LIFETIME β ANNUAL:
contract_value / term_years
Deliverable:
schema_translator/knowledge_graph.py- SchemaKnowledgeGraph class- Methods: add_concept(), add_customer_mapping(), add_transformation(), get_mapping(), save(), load()
- JSON persistence
- NetworkX graph structure for relationship traversal
- Pre-populated with all 6 customer mappings
5. LLM Schema Analyzer Agent
Purpose: Automatically analyze new customer schemas and propose mappings
Requirements:
Core Functionality:
Analyze Schema:
- Input: Customer database connection, customer_id
- Extract table structures, column names, data types
- Query sample data from each column
- Identify primary keys and foreign keys
Semantic Analysis:
- Use LLM to interpret column names and sample values
- Determine semantic meaning (is "contract_value" lifetime or annual?)
- Identify calculation needs (days_remaining vs date conversion)
- Detect normalization patterns (single vs multi-table)
Mapping Proposal:
- Generate proposed mappings to canonical concepts
- Suggest transformation logic
- Flag ambiguities requiring human confirmation
- Provide confidence scores
Validation:
- Test proposed mappings with sample queries
- Verify JOIN logic for multi-table schemas
- Confirm value ranges are reasonable
LLM Prompts Required:
- Schema interpretation prompt
- Semantic disambiguation prompt (for ambiguous columns)
- Transformation generation prompt
- Validation prompt
Deliverable:
schema_translator/agents/schema_analyzer.py- SchemaAnalyzerAgent class- Methods: analyze_schema(), propose_mappings(), validate_mappings()
- Integration with Anthropic API
- Structured output parsing
6. Query Understanding Agent
Purpose: Convert natural language queries to Semantic Query Plans
Requirements:
Input Processing:
- Accept natural language questions about contracts
- Parse intent (find, count, aggregate, compare)
- Extract filters (time ranges, value thresholds, industry, status)
- Identify aggregations (sum, count, average)
- Determine projections (which fields to return)
Semantic Query Plan Generation:
Convert to schema-independent intermediate representation:
{
"intent": "find_contracts",
"filters": [
{
"concept": "contract_expiration",
"operator": "within_next_days",
"value": 30,
"semantic_note": "expiration may be date or days_remaining"
},
{
"concept": "contract_value",
"operator": "greater_than",
"value": 500000,
"semantic_note": "need to normalize to lifetime total"
},
{
"concept": "industry_sector",
"operator": "equals",
"value": "Technology"
}
],
"projections": ["contract_identifier", "contract_value", "contract_expiration", "customer_name"],
"aggregations": null
}
Operator Mapping:
- Natural language β standardized operators
- "expiring soon", "due in next X days" β within_next_days
- "over $X", "worth more than" β greater_than
- "in Q1", "between dates" β date_range
- "technology sector", "tech companies" β equals (with normalization)
LLM Prompts Required:
- Query understanding prompt
- Ambiguity resolution prompt
- Context-aware concept extraction
Deliverable:
schema_translator/agents/query_understanding.py- QueryUnderstandingAgent class- Methods: parse_query(), create_semantic_plan()
- Natural language processing with LLM
- Structured SemanticQueryPlan output
7. Query Compiler
Purpose: Compile Semantic Query Plans to customer-specific SQL
Requirements:
Core Compilation:
Concept Resolution:
- For each concept in plan, lookup customer mapping in knowledge graph
- Retrieve table, column, type, semantic type
- Identify required transformations
SQL Generation:
- Build SELECT clause with projections
- Build FROM clause (single table or JOINs)
- Build WHERE clause with filters
- Apply transformations in WHERE/SELECT as needed
- Build GROUP BY, HAVING for aggregations
Schema-Specific Handling:
- Single table (A, C, D, E, F): Direct SELECT
- Multi-table (B): Auto-generate JOINs
- Date fields (A, C, E, F): Direct date comparisons
- Days remaining (D): Convert to date calculations
- Lifetime value (A, C, D, E): Direct value comparison
- Annual value (F): Multiply by term_years for lifetime equivalent
Industry Normalization:
- Map "Technology" β customer-specific terms (Tech, Information Technology)
- Handle semantic equivalence
Example Compilations:
Query: "Contracts expiring in next 30 days worth over $500K"
Customer A SQL:
SELECT contract_id, contract_name, contract_value, expiry_date, customer_name
FROM contracts
WHERE expiry_date BETWEEN CURRENT_DATE AND DATE(CURRENT_DATE, '+30 days')
AND contract_value > 500000
Customer B SQL:
SELECT h.id, h.contract_name, h.contract_value, r.renewal_date, h.client_name
FROM contract_headers h
JOIN renewal_schedule r ON h.id = r.contract_id
JOIN contract_status_history s ON h.id = s.contract_id
WHERE r.renewal_date BETWEEN CURRENT_DATE AND DATE(CURRENT_DATE, '+30 days')
AND h.contract_value > 500000
AND s.status_date = (
SELECT MAX(status_date)
FROM contract_status_history
WHERE contract_id = h.id
)
Customer D SQL:
SELECT contract_id, contract_title, contract_value,
DATE(CURRENT_DATE, '+' || days_remaining || ' days') as expiry_date,
customer_org
FROM contracts
WHERE days_remaining BETWEEN 0 AND 30
AND contract_value > 500000
Customer F SQL:
SELECT contract_id, name,
(contract_value * term_years) as lifetime_value,
expiration_date, account
FROM contracts
WHERE expiration_date BETWEEN CURRENT_DATE AND DATE(CURRENT_DATE, '+30 days')
AND (contract_value * term_years) > 500000
Deliverable:
schema_translator/query_compiler.py- QueryCompiler class- Methods: compile_for_customer(), generate_select(), generate_where(), generate_joins()
- SQL injection prevention
- Query validation
8. Database Executor
Purpose: Execute customer-specific SQL queries and collect results
Requirements:
Execution:
- Connect to customer SQLite databases
- Execute compiled SQL queries
- Measure execution time
- Handle database errors gracefully
- Return structured QueryResult objects
Multi-Customer Execution:
- Execute queries across multiple customers in parallel (optional: can be sequential for MVP)
- Collect results from all customers
- Tag results with customer_id
- Include executed SQL for debugging
Error Handling:
- Catch SQL errors (invalid column, syntax, etc.)
- Return partial results if some customers succeed
- Log errors with context
- Provide actionable error messages
Deliverable:
schema_translator/database_executor.py- DatabaseExecutor class- Methods: execute_query(), execute_for_customer(), execute_all_customers()
- Connection pooling
- Result aggregation
9. Result Harmonizer
Purpose: Normalize and harmonize results from different customer schemas
Requirements:
Normalization:
Value Normalization:
- Convert all contract values to same semantic unit (lifetime total)
- Customer F: multiply annual by term_years
- Add metadata indicating original vs calculated
Date Normalization:
- Convert all expiration representations to ISO date format
- Customer D: calculate actual date from days_remaining
- Ensure consistent timezone handling
Field Name Normalization:
- Map customer-specific field names to canonical names
- contract_id/id β contract_identifier
- customer_name/client_name/account β customer_name
- expiry_date/expiration_date/renewal_date β expiration_date
Industry Name Normalization:
- Map variations to canonical terms
- "Tech", "Technology", "Information Technology" β "Technology"
- Maintain consistent taxonomy
Result Formatting:
- Unified schema for display
- Include source customer
- Include both original and normalized values where applicable
- Sort by specified criteria (e.g., expiration date, value)
- Support pagination
Deliverable:
schema_translator/result_harmonizer.py- ResultHarmonizer class- Methods: harmonize_results(), normalize_value(), normalize_date(), normalize_industry()
- Configurable normalization rules
10. Chat Orchestrator
Purpose: Coordinate the entire query flow from natural language to harmonized results
Requirements:
Workflow:
- Receive natural language query from user
- Optionally filter by specific customer(s) or query all
- Call QueryUnderstandingAgent β SemanticQueryPlan
- For each customer:
- Call QueryCompiler β customer-specific SQL
- Call DatabaseExecutor β QueryResult
- Call ResultHarmonizer β HarmonizedResult
- Format for display
- Return to user interface
Additional Features:
- Query validation before execution
- Caching for repeated queries
- Query history tracking
- Performance metrics (end-to-end latency)
- Error aggregation and reporting
Conversation Context:
- Maintain conversation state for follow-up questions
- Reference previous queries ("show me more like this")
- Clarification handling ("Did you mean X or Y?")
Deliverable:
schema_translator/orchestrator.py- ChatOrchestrator class- Methods: process_query(), validate_query(), execute_pipeline()
- Integration with all agents and components
- Comprehensive logging
11. Chat Interface (Chainlit)
Purpose: Web-based chat UI for interacting with the Schema Translator
Requirements:
UI Features:
Chat Interface:
- Natural language input box
- Streaming responses
- Display harmonized results in tables
- Show per-customer breakdowns
- Display executed SQL (collapsible/expandable)
- Error messages with retry options
Customer Selection:
- Dropdown or checkboxes to select specific customers
- "All customers" option (default)
- Show customer schema summaries on hover
Result Display:
- Table view with sortable columns
- Highlight normalized values
- Show original vs calculated values
- Export to CSV option
- Pagination for large result sets
Query Examples:
- Pre-populated example queries
- "Show me contracts expiring in next 30 days"
- "Find technology contracts over $1M"
- "What's the total value of contracts expiring in Q1?"
- "Compare contract values across customers"
Debug Mode:
- Toggle to show semantic query plan
- Show customer-specific SQL for each customer
- Show knowledge graph lookups
- Execution time per customer
Chainlit-Specific:
- Use @cl.on_message for query handling
- Use cl.Message for responses
- Use cl.DataTable for results display (or custom HTML)
- Use cl.Action for interactive buttons
- Session management for conversation context
Deliverable:
app.py- Main Chainlit application- Integration with ChatOrchestrator
- Responsive UI with tables and formatting
- Error handling and user feedback
12. Learning and Feedback Loop
Purpose: Improve system over time based on query results and user feedback
Requirements:
Feedback Collection:
- Capture user corrections ("This result is wrong")
- Track query failures and errors
- Log semantic ambiguities that required clarification
- Record query patterns and frequency
Learning Mechanisms:
Schema Drift Detection:
- Periodically check customer schemas for changes
- Detect new columns, modified types
- Flag when mappings may be stale
Transformation Refinement:
- When queries fail, analyze error
- Use LLM to propose fixes
- Update knowledge graph with successful fixes
Semantic Learning:
- Track which industry name mappings are used most
- Learn customer-specific terminology preferences
- Improve concept disambiguation over time
Query Pattern Analysis:
- Identify common query types
- Pre-generate optimized mappings for frequent patterns
- Suggest query completions
Deliverable:
schema_translator/learning/feedback_loop.py- FeedbackLoop classschema_translator/learning/schema_drift_detector.py- SchemaDriftDetector class- Methods: record_feedback(), detect_drift(), refine_transformations()
- Periodic background tasks
Implementation Phases
Phase 1: Foundation (Week 1)
Goal: Core data structures and mock data
Tasks:
- Set up project structure
- Implement configuration management
- Define all Pydantic models
- Generate mock data for all 6 customers
- Test database creation and data access
Validation:
- All 6 SQLite databases created with 50 contracts each
- Can query each database directly
- Models serialize to/from JSON correctly
Phase 2: Knowledge Graph (Week 1-2)
Goal: Build and populate the knowledge graph
Tasks:
- Implement SchemaKnowledgeGraph class
- Define 7 core concepts with aliases
- Create mappings for all 6 customers
- Implement transformation rules
- Add JSON persistence
- Write unit tests
Validation:
- Knowledge graph loads and saves correctly
- Can query concept mappings for each customer
- Transformations are correctly defined
- No missing mappings for any customer
Phase 3: Query Compiler & Executor (Week 2)
Goal: Generate and execute customer-specific SQL
Tasks:
- Implement QueryCompiler
- Handle single-table schemas (A, C, D, E, F)
- Handle multi-table schema (B)
- Implement date/days_remaining conversions
- Implement lifetime/annual value conversions
- Implement DatabaseExecutor
- Write extensive unit tests with real queries
Validation:
- Same semantic query compiles to different SQL for each customer
- All compiled queries execute successfully
- Results are returned in QueryResult format
- Customer D date calculations are correct
- Customer F value normalization is correct
- Customer B joins work correctly
Phase 4: LLM Agents (Week 2-3)
Goal: Implement intelligent query understanding and schema analysis
Tasks:
- Set up Anthropic API integration
- Implement QueryUnderstandingAgent
- Test natural language β SemanticQueryPlan conversion
- Implement SchemaAnalyzerAgent
- Test auto-analysis on a mock "Customer G"
- Implement prompt engineering and few-shot examples
Validation:
- Natural language queries correctly parsed
- Semantic query plans are accurate
- Schema analyzer correctly interprets new schemas
- LLM proposes reasonable mappings
- Ambiguity detection works (flags when uncertain)
Phase 5: Result Harmonization (Week 3)
Goal: Normalize and present unified results
Tasks:
- Implement ResultHarmonizer
- Implement value normalization (annual β lifetime)
- Implement date normalization (days β date)
- Implement field name mapping
- Implement industry name normalization
- Test with multi-customer result sets
Validation:
- Customer F annual values correctly converted
- Customer D dates correctly calculated
- Field names unified across customers
- Industry names normalized
- Results sortable and filterable
Phase 6: Orchestration (Week 3)
Goal: Connect all components into cohesive pipeline
Tasks:
- Implement ChatOrchestrator
- Integrate all agents and components
- Implement error handling and logging
- Implement query validation
- Add performance monitoring
- Write integration tests
Validation:
- End-to-end query flow works
- Errors handled gracefully
- Can query all customers simultaneously
- Can filter by specific customer(s)
- Query history maintained
- Performance acceptable (<5s for simple queries)
Phase 7: UI Implementation (Week 4)
Goal: Build user-facing chat interface
Tasks:
- Set up Chainlit application
- Implement chat message handling
- Implement result table display
- Add customer selection controls
- Add example queries
- Implement debug mode
- Add error handling and user feedback
- Style and polish UI
Validation:
- Chat interface responsive and intuitive
- Results display in readable tables
- Customer filtering works
- Example queries load and execute
- Debug mode shows SQL and query plans
- Errors displayed clearly with retry options
Phase 8: Learning & Polish (Week 4)
Goal: Add learning capabilities and final refinements
Tasks:
- Implement FeedbackLoop
- Implement SchemaDriftDetector
- Add query failure analysis
- Add transformation refinement
- Comprehensive testing across all components
- Documentation and deployment guide
- Performance optimization
Validation:
- System learns from errors
- Schema drift detected
- Query failures analyzed and fixed
- End-to-end system robust and performant
- Documentation complete
File Structure
schema_translator/
βββ README.md
βββ requirements.txt
βββ .env.example
βββ .gitignore
β
βββ databases/ # SQLite databases
β βββ customer_a.db
β βββ customer_b.db
β βββ customer_c.db
β βββ customer_d.db
β βββ customer_e.db
β βββ customer_f.db
β
βββ schema_translator/
β βββ __init__.py
β βββ config.py # Configuration management
β βββ models.py # Pydantic data models
β βββ mock_data.py # Mock data generation
β βββ knowledge_graph.py # Schema knowledge graph
β βββ query_compiler.py # SQL generation
β βββ database_executor.py # Query execution
β βββ result_harmonizer.py # Result normalization
β βββ orchestrator.py # Main pipeline orchestrator
β β
β βββ agents/
β β βββ __init__.py
β β βββ schema_analyzer.py # LLM schema analysis
β β βββ query_understanding.py # NLP β Semantic query plan
β β
β βββ learning/
β βββ __init__.py
β βββ feedback_loop.py # User feedback processing
β βββ schema_drift_detector.py # Schema change detection
β
βββ tests/
β βββ __init__.py
β βββ test_models.py
β βββ test_knowledge_graph.py
β βββ test_query_compiler.py
β βββ test_database_executor.py
β βββ test_agents.py
β βββ test_integration.py
β
βββ app.py # Chainlit application
βββ knowledge_graph.json # Persisted knowledge graph
βββ query_history.json # Query logs
Testing Requirements
Unit Tests
- All data models validate correctly
- Knowledge graph operations (add, query, save, load)
- Query compiler for each customer schema type
- Database executor with mock databases
- Result harmonizer with various data types
- Each agent independently
Integration Tests
- End-to-end query flow (NL β results)
- Multi-customer queries
- Error scenarios (invalid SQL, missing mappings)
- Schema analysis on mock new customer
- Learning loop with feedback
Performance Tests
- Query latency (target: <5s for simple queries across 6 customers)
- Knowledge graph lookup speed
- LLM API call optimization (minimize calls)
- Database query optimization
Non-Functional Requirements
Security
- SQL injection prevention (parameterized queries)
- API key protection (environment variables)
- Input validation on all user queries
Scalability
- Knowledge graph should support 50+ customers
- Query execution should parallelize where possible
- Results pagination for large datasets
Maintainability
- Comprehensive docstrings
- Type hints throughout
- Clear separation of concerns
- Modular architecture for easy updates
Observability
- Logging at all pipeline stages
- Query performance metrics
- Error tracking with context
- Debug mode for development
Example Queries to Support
- "Show me contracts expiring in the next 30 days"
- "Find all technology contracts worth over $1 million"
- "What's the total value of contracts expiring in Q1 2025?"
- "Show me inactive or expired contracts"
- "Which customers have contracts expiring soon?" (group by customer)
- "Compare contract values between healthcare and technology sectors"
- "Find contracts for Customer A that expire before February"
- "What's the average contract value across all customers?"
- "Show me the top 10 highest value contracts"
- "Find contracts that started in 2023 and are still active"
Deliverables Summary
- β Working system with all components integrated
- β 6 customer mock databases with realistic data
- β Knowledge graph with all mappings
- β LLM-powered query understanding
- β Customer-specific SQL generation
- β Result harmonization and normalization
- β Chainlit chat interface
- β Comprehensive test suite
- β Documentation and README
- β Deployment instructions
Success Metrics
- Accuracy: >95% of test queries return correct results
- Coverage: All 10 example queries work across all 6 customers
- Performance: <5 seconds for simple queries, <15 seconds for complex
- Usability: Non-technical users can successfully query data
- Adaptability: New customer onboarding takes <1 hour
Tech Stack
- Language: Python 3.10+
- LLM: Anthropic Claude (claude-sonnet-4-20250514)
- Database: SQLite
- UI Framework: Chainlit
- Data Validation: Pydantic
- Graph: NetworkX
- Testing: pytest
- Environment: python-dotenv
Getting Started Instructions
# 1. Clone repository
git clone <repo-url>
cd schema_translator
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set up environment
cp .env.example .env
# Edit .env and add ANTHROPIC_API_KEY
# 5. Generate mock data
python -m schema_translator.mock_data
# 6. Run tests
pytest tests/
# 7. Start application
chainlit run app.py
Notes for Claude Code
- Build incrementally, validating each component before moving to next
- Start with data layer (models, mock_data, config) before logic
- Test each customer schema type independently
- Use real LLM calls early to validate prompts
- Focus on Customer A, D, and F first (they demonstrate key differences)
- Customer B can be added later (multi-table complexity)
- Prioritize correctness over performance initially
- Add comprehensive error messages for debugging
- Use print/logging statements liberally during development