schema-translator / schema_translator.md
sanzgiri's picture
Complete Schema Translator implementation - All 8 phases
a584f85
# **Schema Translator Requirements Document**
## **Project Overview**
Build an intelligent contract schema translation system that enables querying across multiple enterprise customers with heterogeneous database schemas using LLM-powered semantic understanding.
## **Core Problem**
* Multiple enterprise customers maintain different database schemas for contract data
* Same business concepts represented differently across customers (structure, data types, semantics)
* Need unified query interface that works across all customer schemas
* Must handle semantic ambiguity (e.g., "contract\_value" meaning lifetime total vs. annual revenue)
## **Success Criteria**
1. User can ask natural language questions about contracts across all customers
2. System correctly interprets semantic differences between customer schemas
3. Results are normalized and harmonized for cross-customer comparison
4. New customers can be onboarded with minimal manual configuration
5. System learns and adapts from query patterns and errors
---
## **Architecture Components**
### **1\. Mock Data Layer**
**Purpose:** Generate realistic contract databases for 6 different customer schemas
**Requirements:**
* Create SQLite databases for 6 customers (A, B, C, D, E, F)
* Each customer has unique schema characteristics:
* **Customer A:** Single table, DATE expiry, LIFETIME contract\_value
* **Customer B:** Normalized (3 tables: headers, status\_history, renewal\_schedule), LIFETIME value
* **Customer C:** Single table, DATE expiry, LIFETIME value, different column names
* **Customer D:** Single table, INTEGER days\_remaining instead of date, LIFETIME value
* **Customer E:** Single table, DATE expiry, LIFETIME contract\_value with explicit duration
* **Customer F:** Single table, DATE expiry, ANNUAL contract\_value (ARR)
**Data Requirements:**
* 50 contracts per customer
* Fields: contract ID/name, status, expiration (various formats), value, industry/sector, customer name, start date
* Realistic value ranges: $100K-$5M lifetime, $100K-$2M annual
* Mix of expiration dates: some expired (-60 days), some upcoming (0-365 days)
* Industry variety: Technology, Healthcare, Finance, Manufacturing, Retail (with variations in naming)
**Deliverable:**
* `schema_translator/mock_data.py` \- MockDataGenerator class
* `databases/` directory with 6 SQLite .db files
* Script to regenerate all databases
---
### **2\. Data Models**
**Purpose:** Define typed data structures for the entire system
**Requirements:**
* **SchemaColumn:** name, data\_type, semantic\_meaning, transformations, sample\_values
* **SchemaTable:** name, columns, relationships
* **CustomerSchema:** customer\_id, customer\_name, tables, semantic\_notes, last\_analyzed
* **SemanticConcept:** concept\_id, concept\_name, description, customer\_mappings
* **QueryFilter:** concept, operator, value, semantic\_note
* **SemanticQueryPlan:** intent, filters, aggregations, projections
* **QueryResult:** customer\_id, data, sql\_executed, execution\_time\_ms
* **HarmonizedResult:** results, total\_count, customers\_queried
**Deliverable:**
* `schema_translator/models.py` using Pydantic BaseModel
* Full type hints and validation
* JSON serialization support
---
### **3\. Configuration Management**
**Purpose:** Centralized configuration for API keys, paths, LLM settings
**Requirements:**
* Load from environment variables and .env file
* ANTHROPIC\_API\_KEY (required)
* Model name: claude-sonnet-4-20250514
* Database directory path
* Knowledge graph storage path
* LLM parameters: max\_tokens, temperature
* Validation on startup
**Deliverable:**
* `schema_translator/config.py` \- Config class
* `.env.example` file
* Configuration validation method
---
### **4\. Knowledge Graph**
**Purpose:** Store semantic relationships between schemas and concepts
**Requirements:**
#### **Core Concepts to Model:**
1. **contract\_expiration** \- When contract ends (DATE or INTEGER days)
2. **contract\_value** \- Monetary value (LIFETIME vs ANNUAL semantics)
3. **contract\_status** \- Current state (single table vs history table)
4. **industry\_sector** \- Business vertical (various naming conventions)
5. **contract\_identifier** \- Primary key (ID vs name)
6. **customer\_name** \- Client organization
7. **contract\_start** \- Contract begin date
#### **Customer Mappings:**
For each customer, map each concept to:
* Table name
* Column name
* Data type
* Semantic type (e.g., LIFETIME\_TOTAL, ANNUAL\_RECURRING\_REVENUE)
* Required transformation (if any)
* Join requirements (for multi-table schemas)
#### **Transformation Rules:**
* **days\_remaining β†’ DATE:** `DATE_ADD(NOW(), INTERVAL days_remaining DAY)`
* **DATE β†’ days\_remaining:** `DATEDIFF(expiry_date, NOW())`
* **ANNUAL β†’ LIFETIME:** `contract_value * term_years`
* **LIFETIME β†’ ANNUAL:** `contract_value / term_years`
**Deliverable:**
* `schema_translator/knowledge_graph.py` \- SchemaKnowledgeGraph class
* Methods: add\_concept(), add\_customer\_mapping(), add\_transformation(), get\_mapping(), save(), load()
* JSON persistence
* NetworkX graph structure for relationship traversal
* Pre-populated with all 6 customer mappings
---
### **5\. LLM Schema Analyzer Agent**
**Purpose:** Automatically analyze new customer schemas and propose mappings
**Requirements:**
#### **Core Functionality:**
1. **Analyze Schema:**
* Input: Customer database connection, customer\_id
* Extract table structures, column names, data types
* Query sample data from each column
* Identify primary keys and foreign keys
2. **Semantic Analysis:**
* Use LLM to interpret column names and sample values
* Determine semantic meaning (is "contract\_value" lifetime or annual?)
* Identify calculation needs (days\_remaining vs date conversion)
* Detect normalization patterns (single vs multi-table)
3. **Mapping Proposal:**
* Generate proposed mappings to canonical concepts
* Suggest transformation logic
* Flag ambiguities requiring human confirmation
* Provide confidence scores
4. **Validation:**
* Test proposed mappings with sample queries
* Verify JOIN logic for multi-table schemas
* Confirm value ranges are reasonable
**LLM Prompts Required:**
* Schema interpretation prompt
* Semantic disambiguation prompt (for ambiguous columns)
* Transformation generation prompt
* Validation prompt
**Deliverable:**
* `schema_translator/agents/schema_analyzer.py` \- SchemaAnalyzerAgent class
* Methods: analyze\_schema(), propose\_mappings(), validate\_mappings()
* Integration with Anthropic API
* Structured output parsing
---
### **6\. Query Understanding Agent**
**Purpose:** Convert natural language queries to Semantic Query Plans
**Requirements:**
#### **Input Processing:**
* Accept natural language questions about contracts
* Parse intent (find, count, aggregate, compare)
* Extract filters (time ranges, value thresholds, industry, status)
* Identify aggregations (sum, count, average)
* Determine projections (which fields to return)
#### **Semantic Query Plan Generation:**
Convert to schema-independent intermediate representation:
{
"intent": "find\_contracts",
"filters": \[
{
"concept": "contract\_expiration",
"operator": "within\_next\_days",
"value": 30,
"semantic\_note": "expiration may be date or days\_remaining"
},
{
"concept": "contract\_value",
"operator": "greater\_than",
"value": 500000,
"semantic\_note": "need to normalize to lifetime total"
},
{
"concept": "industry\_sector",
"operator": "equals",
"value": "Technology"
}
\],
"projections": \["contract\_identifier", "contract\_value", "contract\_expiration", "customer\_name"\],
"aggregations": null
}
#### **Operator Mapping:**
* Natural language β†’ standardized operators
* "expiring soon", "due in next X days" β†’ within\_next\_days
* "over $X", "worth more than" β†’ greater\_than
* "in Q1", "between dates" β†’ date\_range
* "technology sector", "tech companies" β†’ equals (with normalization)
**LLM Prompts Required:**
* Query understanding prompt
* Ambiguity resolution prompt
* Context-aware concept extraction
**Deliverable:**
* `schema_translator/agents/query_understanding.py` \- QueryUnderstandingAgent class
* Methods: parse\_query(), create\_semantic\_plan()
* Natural language processing with LLM
* Structured SemanticQueryPlan output
---
### **7\. Query Compiler**
**Purpose:** Compile Semantic Query Plans to customer-specific SQL
**Requirements:**
#### **Core Compilation:**
1. **Concept Resolution:**
* For each concept in plan, lookup customer mapping in knowledge graph
* Retrieve table, column, type, semantic type
* Identify required transformations
2. **SQL Generation:**
* Build SELECT clause with projections
* Build FROM clause (single table or JOINs)
* Build WHERE clause with filters
* Apply transformations in WHERE/SELECT as needed
* Build GROUP BY, HAVING for aggregations
3. **Schema-Specific Handling:**
* **Single table (A, C, D, E, F):** Direct SELECT
* **Multi-table (B):** Auto-generate JOINs
* **Date fields (A, C, E, F):** Direct date comparisons
* **Days remaining (D):** Convert to date calculations
* **Lifetime value (A, C, D, E):** Direct value comparison
* **Annual value (F):** Multiply by term\_years for lifetime equivalent
4. **Industry Normalization:**
* Map "Technology" β†’ customer-specific terms (Tech, Information Technology)
* Handle semantic equivalence
#### **Example Compilations:**
**Query:** "Contracts expiring in next 30 days worth over $500K"
**Customer A SQL:**
SELECT contract\_id, contract\_name, contract\_value, expiry\_date, customer\_name
FROM contracts
WHERE expiry\_date BETWEEN CURRENT\_DATE AND DATE(CURRENT\_DATE, '+30 days')
AND contract\_value \> 500000
**Customer B SQL:**
SELECT h.id, h.contract\_name, h.contract\_value, r.renewal\_date, h.client\_name
FROM contract\_headers h
JOIN renewal\_schedule r ON h.id \= r.contract\_id
JOIN contract\_status\_history s ON h.id \= s.contract\_id
WHERE r.renewal\_date BETWEEN CURRENT\_DATE AND DATE(CURRENT\_DATE, '+30 days')
AND h.contract\_value \> 500000
AND s.status\_date \= (
SELECT MAX(status\_date)
FROM contract\_status\_history
WHERE contract\_id \= h.id
)
**Customer D SQL:**
SELECT contract\_id, contract\_title, contract\_value,
DATE(CURRENT\_DATE, '+' || days\_remaining || ' days') as expiry\_date,
customer\_org
FROM contracts
WHERE days\_remaining BETWEEN 0 AND 30
AND contract\_value \> 500000
**Customer F SQL:**
SELECT contract\_id, name,
(contract\_value \* term\_years) as lifetime\_value,
expiration\_date, account
FROM contracts
WHERE expiration\_date BETWEEN CURRENT\_DATE AND DATE(CURRENT\_DATE, '+30 days')
AND (contract\_value \* term\_years) \> 500000
**Deliverable:**
* `schema_translator/query_compiler.py` \- QueryCompiler class
* Methods: compile\_for\_customer(), generate\_select(), generate\_where(), generate\_joins()
* SQL injection prevention
* Query validation
---
### **8\. Database Executor**
**Purpose:** Execute customer-specific SQL queries and collect results
**Requirements:**
#### **Execution:**
* Connect to customer SQLite databases
* Execute compiled SQL queries
* Measure execution time
* Handle database errors gracefully
* Return structured QueryResult objects
#### **Multi-Customer Execution:**
* Execute queries across multiple customers in parallel (optional: can be sequential for MVP)
* Collect results from all customers
* Tag results with customer\_id
* Include executed SQL for debugging
#### **Error Handling:**
* Catch SQL errors (invalid column, syntax, etc.)
* Return partial results if some customers succeed
* Log errors with context
* Provide actionable error messages
**Deliverable:**
* `schema_translator/database_executor.py` \- DatabaseExecutor class
* Methods: execute\_query(), execute\_for\_customer(), execute\_all\_customers()
* Connection pooling
* Result aggregation
---
### **9\. Result Harmonizer**
**Purpose:** Normalize and harmonize results from different customer schemas
**Requirements:**
#### **Normalization:**
1. **Value Normalization:**
* Convert all contract values to same semantic unit (lifetime total)
* Customer F: multiply annual by term\_years
* Add metadata indicating original vs calculated
2. **Date Normalization:**
* Convert all expiration representations to ISO date format
* Customer D: calculate actual date from days\_remaining
* Ensure consistent timezone handling
3. **Field Name Normalization:**
* Map customer-specific field names to canonical names
* contract\_id/id β†’ contract\_identifier
* customer\_name/client\_name/account β†’ customer\_name
* expiry\_date/expiration\_date/renewal\_date β†’ expiration\_date
4. **Industry Name Normalization:**
* Map variations to canonical terms
* "Tech", "Technology", "Information Technology" β†’ "Technology"
* Maintain consistent taxonomy
#### **Result Formatting:**
* Unified schema for display
* Include source customer
* Include both original and normalized values where applicable
* Sort by specified criteria (e.g., expiration date, value)
* Support pagination
**Deliverable:**
* `schema_translator/result_harmonizer.py` \- ResultHarmonizer class
* Methods: harmonize\_results(), normalize\_value(), normalize\_date(), normalize\_industry()
* Configurable normalization rules
---
### **10\. Chat Orchestrator**
**Purpose:** Coordinate the entire query flow from natural language to harmonized results
**Requirements:**
#### **Workflow:**
1. Receive natural language query from user
2. Optionally filter by specific customer(s) or query all
3. Call QueryUnderstandingAgent β†’ SemanticQueryPlan
4. For each customer:
* Call QueryCompiler β†’ customer-specific SQL
* Call DatabaseExecutor β†’ QueryResult
5. Call ResultHarmonizer β†’ HarmonizedResult
6. Format for display
7. Return to user interface
#### **Additional Features:**
* Query validation before execution
* Caching for repeated queries
* Query history tracking
* Performance metrics (end-to-end latency)
* Error aggregation and reporting
#### **Conversation Context:**
* Maintain conversation state for follow-up questions
* Reference previous queries ("show me more like this")
* Clarification handling ("Did you mean X or Y?")
**Deliverable:**
* `schema_translator/orchestrator.py` \- ChatOrchestrator class
* Methods: process\_query(), validate\_query(), execute\_pipeline()
* Integration with all agents and components
* Comprehensive logging
---
### **11\. Chat Interface (Chainlit)**
**Purpose:** Web-based chat UI for interacting with the Schema Translator
**Requirements:**
#### **UI Features:**
1. **Chat Interface:**
* Natural language input box
* Streaming responses
* Display harmonized results in tables
* Show per-customer breakdowns
* Display executed SQL (collapsible/expandable)
* Error messages with retry options
2. **Customer Selection:**
* Dropdown or checkboxes to select specific customers
* "All customers" option (default)
* Show customer schema summaries on hover
3. **Result Display:**
* Table view with sortable columns
* Highlight normalized values
* Show original vs calculated values
* Export to CSV option
* Pagination for large result sets
4. **Query Examples:**
* Pre-populated example queries
* "Show me contracts expiring in next 30 days"
* "Find technology contracts over $1M"
* "What's the total value of contracts expiring in Q1?"
* "Compare contract values across customers"
5. **Debug Mode:**
* Toggle to show semantic query plan
* Show customer-specific SQL for each customer
* Show knowledge graph lookups
* Execution time per customer
#### **Chainlit-Specific:**
* Use @cl.on\_message for query handling
* Use cl.Message for responses
* Use cl.DataTable for results display (or custom HTML)
* Use cl.Action for interactive buttons
* Session management for conversation context
**Deliverable:**
* `app.py` \- Main Chainlit application
* Integration with ChatOrchestrator
* Responsive UI with tables and formatting
* Error handling and user feedback
---
### **12\. Learning and Feedback Loop**
**Purpose:** Improve system over time based on query results and user feedback
**Requirements:**
#### **Feedback Collection:**
* Capture user corrections ("This result is wrong")
* Track query failures and errors
* Log semantic ambiguities that required clarification
* Record query patterns and frequency
#### **Learning Mechanisms:**
1. **Schema Drift Detection:**
* Periodically check customer schemas for changes
* Detect new columns, modified types
* Flag when mappings may be stale
2. **Transformation Refinement:**
* When queries fail, analyze error
* Use LLM to propose fixes
* Update knowledge graph with successful fixes
3. **Semantic Learning:**
* Track which industry name mappings are used most
* Learn customer-specific terminology preferences
* Improve concept disambiguation over time
4. **Query Pattern Analysis:**
* Identify common query types
* Pre-generate optimized mappings for frequent patterns
* Suggest query completions
**Deliverable:**
* `schema_translator/learning/feedback_loop.py` \- FeedbackLoop class
* `schema_translator/learning/schema_drift_detector.py` \- SchemaDriftDetector class
* Methods: record\_feedback(), detect\_drift(), refine\_transformations()
* Periodic background tasks
---
## **Implementation Phases**
### **Phase 1: Foundation (Week 1\)**
**Goal:** Core data structures and mock data
**Tasks:**
1. Set up project structure
2. Implement configuration management
3. Define all Pydantic models
4. Generate mock data for all 6 customers
5. Test database creation and data access
**Validation:**
* All 6 SQLite databases created with 50 contracts each
* Can query each database directly
* Models serialize to/from JSON correctly
---
### **Phase 2: Knowledge Graph (Week 1-2)**
**Goal:** Build and populate the knowledge graph
**Tasks:**
1. Implement SchemaKnowledgeGraph class
2. Define 7 core concepts with aliases
3. Create mappings for all 6 customers
4. Implement transformation rules
5. Add JSON persistence
6. Write unit tests
**Validation:**
* Knowledge graph loads and saves correctly
* Can query concept mappings for each customer
* Transformations are correctly defined
* No missing mappings for any customer
---
### **Phase 3: Query Compiler & Executor (Week 2\)**
**Goal:** Generate and execute customer-specific SQL
**Tasks:**
1. Implement QueryCompiler
2. Handle single-table schemas (A, C, D, E, F)
3. Handle multi-table schema (B)
4. Implement date/days\_remaining conversions
5. Implement lifetime/annual value conversions
6. Implement DatabaseExecutor
7. Write extensive unit tests with real queries
**Validation:**
* Same semantic query compiles to different SQL for each customer
* All compiled queries execute successfully
* Results are returned in QueryResult format
* Customer D date calculations are correct
* Customer F value normalization is correct
* Customer B joins work correctly
---
### **Phase 4: LLM Agents (Week 2-3)**
**Goal:** Implement intelligent query understanding and schema analysis
**Tasks:**
1. Set up Anthropic API integration
2. Implement QueryUnderstandingAgent
3. Test natural language β†’ SemanticQueryPlan conversion
4. Implement SchemaAnalyzerAgent
5. Test auto-analysis on a mock "Customer G"
6. Implement prompt engineering and few-shot examples
**Validation:**
* Natural language queries correctly parsed
* Semantic query plans are accurate
* Schema analyzer correctly interprets new schemas
* LLM proposes reasonable mappings
* Ambiguity detection works (flags when uncertain)
---
### **Phase 5: Result Harmonization (Week 3\)**
**Goal:** Normalize and present unified results
**Tasks:**
1. Implement ResultHarmonizer
2. Implement value normalization (annual β†’ lifetime)
3. Implement date normalization (days β†’ date)
4. Implement field name mapping
5. Implement industry name normalization
6. Test with multi-customer result sets
**Validation:**
* Customer F annual values correctly converted
* Customer D dates correctly calculated
* Field names unified across customers
* Industry names normalized
* Results sortable and filterable
---
### **Phase 6: Orchestration (Week 3\)**
**Goal:** Connect all components into cohesive pipeline
**Tasks:**
1. Implement ChatOrchestrator
2. Integrate all agents and components
3. Implement error handling and logging
4. Implement query validation
5. Add performance monitoring
6. Write integration tests
**Validation:**
* End-to-end query flow works
* Errors handled gracefully
* Can query all customers simultaneously
* Can filter by specific customer(s)
* Query history maintained
* Performance acceptable (\<5s for simple queries)
---
### **Phase 7: UI Implementation (Week 4\)**
**Goal:** Build user-facing chat interface
**Tasks:**
1. Set up Chainlit application
2. Implement chat message handling
3. Implement result table display
4. Add customer selection controls
5. Add example queries
6. Implement debug mode
7. Add error handling and user feedback
8. Style and polish UI
**Validation:**
* Chat interface responsive and intuitive
* Results display in readable tables
* Customer filtering works
* Example queries load and execute
* Debug mode shows SQL and query plans
* Errors displayed clearly with retry options
---
### **Phase 8: Learning & Polish (Week 4\)**
**Goal:** Add learning capabilities and final refinements
**Tasks:**
1. Implement FeedbackLoop
2. Implement SchemaDriftDetector
3. Add query failure analysis
4. Add transformation refinement
5. Comprehensive testing across all components
6. Documentation and deployment guide
7. Performance optimization
**Validation:**
* System learns from errors
* Schema drift detected
* Query failures analyzed and fixed
* End-to-end system robust and performant
* Documentation complete
---
## **File Structure**
schema\_translator/
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
β”œβ”€β”€ .gitignore
β”‚
β”œβ”€β”€ databases/ \# SQLite databases
β”‚ β”œβ”€β”€ customer\_a.db
β”‚ β”œβ”€β”€ customer\_b.db
β”‚ β”œβ”€β”€ customer\_c.db
β”‚ β”œβ”€β”€ customer\_d.db
β”‚ β”œβ”€β”€ customer\_e.db
β”‚ └── customer\_f.db
β”‚
β”œβ”€β”€ schema\_translator/
β”‚ β”œβ”€β”€ \_\_init\_\_.py
β”‚ β”œβ”€β”€ config.py \# Configuration management
β”‚ β”œβ”€β”€ models.py \# Pydantic data models
β”‚ β”œβ”€β”€ mock\_data.py \# Mock data generation
β”‚ β”œβ”€β”€ knowledge\_graph.py \# Schema knowledge graph
β”‚ β”œβ”€β”€ query\_compiler.py \# SQL generation
β”‚ β”œβ”€β”€ database\_executor.py \# Query execution
β”‚ β”œβ”€β”€ result\_harmonizer.py \# Result normalization
β”‚ β”œβ”€β”€ orchestrator.py \# Main pipeline orchestrator
β”‚ β”‚
β”‚ β”œβ”€β”€ agents/
β”‚ β”‚ β”œβ”€β”€ \_\_init\_\_.py
β”‚ β”‚ β”œβ”€β”€ schema\_analyzer.py \# LLM schema analysis
β”‚ β”‚ └── query\_understanding.py \# NLP β†’ Semantic query plan
β”‚ β”‚
β”‚ └── learning/
β”‚ β”œβ”€β”€ \_\_init\_\_.py
β”‚ β”œβ”€β”€ feedback\_loop.py \# User feedback processing
β”‚ └── schema\_drift\_detector.py \# Schema change detection
β”‚
β”œβ”€β”€ tests/
β”‚ β”œβ”€β”€ \_\_init\_\_.py
β”‚ β”œβ”€β”€ test\_models.py
β”‚ β”œβ”€β”€ test\_knowledge\_graph.py
β”‚ β”œβ”€β”€ test\_query\_compiler.py
β”‚ β”œβ”€β”€ test\_database\_executor.py
β”‚ β”œβ”€β”€ test\_agents.py
β”‚ └── test\_integration.py
β”‚
β”œβ”€β”€ app.py \# Chainlit application
β”œβ”€β”€ knowledge\_graph.json \# Persisted knowledge graph
└── query\_history.json \# Query logs
---
## **Testing Requirements**
### **Unit Tests**
* All data models validate correctly
* Knowledge graph operations (add, query, save, load)
* Query compiler for each customer schema type
* Database executor with mock databases
* Result harmonizer with various data types
* Each agent independently
### **Integration Tests**
* End-to-end query flow (NL β†’ results)
* Multi-customer queries
* Error scenarios (invalid SQL, missing mappings)
* Schema analysis on mock new customer
* Learning loop with feedback
### **Performance Tests**
* Query latency (target: \<5s for simple queries across 6 customers)
* Knowledge graph lookup speed
* LLM API call optimization (minimize calls)
* Database query optimization
---
## **Non-Functional Requirements**
### **Security**
* SQL injection prevention (parameterized queries)
* API key protection (environment variables)
* Input validation on all user queries
### **Scalability**
* Knowledge graph should support 50+ customers
* Query execution should parallelize where possible
* Results pagination for large datasets
### **Maintainability**
* Comprehensive docstrings
* Type hints throughout
* Clear separation of concerns
* Modular architecture for easy updates
### **Observability**
* Logging at all pipeline stages
* Query performance metrics
* Error tracking with context
* Debug mode for development
---
## **Example Queries to Support**
1. "Show me contracts expiring in the next 30 days"
2. "Find all technology contracts worth over $1 million"
3. "What's the total value of contracts expiring in Q1 2025?"
4. "Show me inactive or expired contracts"
5. "Which customers have contracts expiring soon?" (group by customer)
6. "Compare contract values between healthcare and technology sectors"
7. "Find contracts for Customer A that expire before February"
8. "What's the average contract value across all customers?"
9. "Show me the top 10 highest value contracts"
10. "Find contracts that started in 2023 and are still active"
---
## **Deliverables Summary**
1. βœ… Working system with all components integrated
2. βœ… 6 customer mock databases with realistic data
3. βœ… Knowledge graph with all mappings
4. βœ… LLM-powered query understanding
5. βœ… Customer-specific SQL generation
6. βœ… Result harmonization and normalization
7. βœ… Chainlit chat interface
8. βœ… Comprehensive test suite
9. βœ… Documentation and README
10. βœ… Deployment instructions
---
## **Success Metrics**
1. **Accuracy:** \>95% of test queries return correct results
2. **Coverage:** All 10 example queries work across all 6 customers
3. **Performance:** \<5 seconds for simple queries, \<15 seconds for complex
4. **Usability:** Non-technical users can successfully query data
5. **Adaptability:** New customer onboarding takes \<1 hour
---
## **Tech Stack**
* **Language:** Python 3.10+
* **LLM:** Anthropic Claude (claude-sonnet-4-20250514)
* **Database:** SQLite
* **UI Framework:** Chainlit
* **Data Validation:** Pydantic
* **Graph:** NetworkX
* **Testing:** pytest
* **Environment:** python-dotenv
---
## **Getting Started Instructions**
\# 1\. Clone repository
git clone \<repo-url\>
cd schema\_translator
\# 2\. Create virtual environment
python \-m venv venv
source venv/bin/activate \# On Windows: venv\\Scripts\\activate
\# 3\. Install dependencies
pip install \-r requirements.txt
\# 4\. Set up environment
cp .env.example .env
\# Edit .env and add ANTHROPIC\_API\_KEY
\# 5\. Generate mock data
python \-m schema\_translator.mock\_data
\# 6\. Run tests
pytest tests/
\# 7\. Start application
chainlit run app.py
---
## **Notes for Claude Code**
* Build incrementally, validating each component before moving to next
* Start with data layer (models, mock\_data, config) before logic
* Test each customer schema type independently
* Use real LLM calls early to validate prompts
* Focus on Customer A, D, and F first (they demonstrate key differences)
* Customer B can be added later (multi-table complexity)
* Prioritize correctness over performance initially
* Add comprehensive error messages for debugging
* Use print/logging statements liberally during development