Semantic Transformer on Meaning Tokens (UGDF)
Overview
This repository implements a Semantic Transformer trained end-to-end on Meaning Tokens instead of traditional subword tokens (BPE, WordPiece, etc.). Meaning Tokens are validated, constrained concepts governed by a Universal Generalised Data Framework (UGDF).
The core idea is simple but foundational:
Modern AI fails not because models are weak, but because meaning is not structurally represented.
This system moves meaning out of model weights and into an explicit, versioned, auditable data framework.
Language becomes an input/output interface β not the reasoning substrate.
Why This Exists
The Problem With Current AI Systems
Modern LLMs operate on:
- Statistical tokens
- Learned correlations
- Unconstrained generation
As a result, they:
- Hallucinate facts
- Drift in meaning over time
- Cannot enforce truth or safety structurally
- Treat code, language, and policy as the same thing (text)
Even when an answer sounds correct, it is not guaranteed to be correct.
This repository demonstrates a different paradigm.
Core Principles
- Meaning is authoritative, not probabilistic
- Models reason over concepts, not strings
- Constraints are enforced before and after inference
- Language is a rendering layer, not the source of truth
These principles are enforced architecturally β not via prompts or RLHF.
What Is UGDF?
UGDF (Universal Generalised Data Framework) is a meaning-preserving data system.
Each atomic unit of data is a Concept, not a value or token.
Every concept must satisfy a fixed set of invariants:
- Identity (what it is)
- Definition (what it means)
- Scope (where it applies)
- Constraints (what it is not allowed to be)
- Purpose (why it exists)
- Time (version and history)
- Usage (where and how it is used)
If any invariant is missing, the concept is invalid.
This prevents silent semantic drift.
Meaning Tokens
Definition
A Meaning Token is a stable identifier mapped to a UGDF concept.
Example:
1001 β ENTITY:FELINE
1002 β STATE:REST
1003 β RELATION:ON
1004 β OBJECT:SURFACE
Meaning Tokens are:
- Language-independent
- Versioned
- Constraint-aware
- Stable across time
They replace:
- Subwords
- Byte tokens
- Frequency-based vocabularies
Real Example: Natural Language
Input Text
The cat sat on the mat.
Traditional Tokenization (BPE)
["The", "ca", "##t", "sat", "on", "the", "mat"]
No meaning. No constraints. No truth.
Meaning Tokenization
[
ENTITY:FELINE,
STATE:REST,
RELATION:ON,
OBJECT:SURFACE
]
Each token is validated against UGDF:
- FELINE β not furniture
- REST β not motion
- SURFACE β not living
Hallucinations are structurally impossible.
Real Example: Hallucination Prevention
Question
Is a cat a piece of furniture?
UGDF Constraint
FELINE.constraints = ["not furniture", "not inanimate"]
Result
β The claim is rejected before generation.
No probability. No guessing. No "helpful" but incorrect answer.
Real Example: Python Code Safety
Input Code
eval("2 + 2")
Meaning Tokens
CALL:DYNAMIC_EXECUTION
RISK:CODE_INJECTION
UGDF Constraint
DYNAMIC_EXEC β unsafe
Result
β Execution blocked structurally.
No sandbox escape. No prompt override.
Semantic Transformer Architecture
High-Level Flow
Meaning Token IDs
β
Semantic Embedding Layer
β
Transformer Encoder
β
Semantic Output Head
β
UGDF Constraint Validation
The transformer never sees raw text.
Training Data Format
This model is trained on meaning β meaning mappings.
Example (Language Task)
{
"input": ["ENTITY:FELINE", "STATE:REST"],
"output": ["ENTITY:FELINE", "STATE:REST", "RELATION:ON", "OBJECT:SURFACE"]
}
Example (Code Task)
{
"input": ["FUNCTION_DEF", "USER_INPUT"],
"output": ["SECURITY_WARNING", "REQUIRES_SANITIZATION"]
}
No text. No surface tokens. No ambiguity.
Constraint Masking (Critical Innovation)
UGDF constraints are applied directly to logits.
If a token is forbidden:
logit = -β
This makes hallucination mathematically impossible.
Language Realization (Optional)
After semantic reasoning, output can be rendered into natural language using a separate language model.
Important:
- The language model has zero authority
- It cannot change meaning
- It only expresses meaning
This cleanly separates:
- Truth
- Reasoning
- Expression
What This System Guarantees
| Property | Guaranteed |
|---|---|
| Hallucination prevention | β |
| Semantic stability | β |
| Explainability | β |
| Code safety | β |
| Multilingual consistency | β |
| Regulatory auditability | β |
These are architectural guarantees, not probabilistic ones.
Trade-Offs (Honest)
Costs
- Requires ontology curation
- Slower tokenization
- Smaller initial datasets
Gains
- Orders-of-magnitude reliability
- Structural safety
- Long-term meaning stability
- Cross-domain intelligence
This is infrastructure, not a prompt trick.
License
This project is released under the Apache License 2.0.
Commercial use is explicitly permitted.
Final Note
Transformers were never the problem.
Tokens were.
This repository demonstrates a path beyond statistical language models toward meaning-native intelligence.