You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Semantic Transformer on Meaning Tokens (UGDF)

Overview

This repository implements a Semantic Transformer trained end-to-end on Meaning Tokens instead of traditional subword tokens (BPE, WordPiece, etc.). Meaning Tokens are validated, constrained concepts governed by a Universal Generalised Data Framework (UGDF).

The core idea is simple but foundational:

Modern AI fails not because models are weak, but because meaning is not structurally represented.

This system moves meaning out of model weights and into an explicit, versioned, auditable data framework.

Language becomes an input/output interface — not the reasoning substrate.

Why This Exists

The Problem With Current AI Systems

Modern LLMs operate on:

Statistical tokens
Learned correlations
Unconstrained generation

As a result, they:

Hallucinate facts
Drift in meaning over time
Cannot enforce truth or safety structurally
Treat code, language, and policy as the same thing (text)

Even when an answer sounds correct, it is not guaranteed to be correct.

This repository demonstrates a different paradigm.

Core Principles

Meaning is authoritative, not probabilistic
Models reason over concepts, not strings
Constraints are enforced before and after inference
Language is a rendering layer, not the source of truth

These principles are enforced architecturally — not via prompts or RLHF.

What Is UGDF?

UGDF (Universal Generalised Data Framework) is a meaning-preserving data system.

Each atomic unit of data is a Concept, not a value or token.

Every concept must satisfy a fixed set of invariants:

Identity (what it is)
Definition (what it means)
Scope (where it applies)
Constraints (what it is not allowed to be)
Purpose (why it exists)
Time (version and history)
Usage (where and how it is used)

If any invariant is missing, the concept is invalid.

This prevents silent semantic drift.

Meaning Tokens

Definition

A Meaning Token is a stable identifier mapped to a UGDF concept.

Example:

1001 → ENTITY:FELINE
1002 → STATE:REST
1003 → RELATION:ON
1004 → OBJECT:SURFACE

Meaning Tokens are:

Language-independent
Versioned
Constraint-aware
Stable across time

They replace:

Subwords
Byte tokens
Frequency-based vocabularies

Real Example: Natural Language

Input Text

The cat sat on the mat.

Traditional Tokenization (BPE)

["The", "ca", "##t", "sat", "on", "the", "mat"]

No meaning. No constraints. No truth.

Meaning Tokenization

[
  ENTITY:FELINE,
  STATE:REST,
  RELATION:ON,
  OBJECT:SURFACE
]

Each token is validated against UGDF:

FELINE → not furniture
REST → not motion
SURFACE → not living

Hallucinations are structurally impossible.

Real Example: Hallucination Prevention

Question

Is a cat a piece of furniture?

UGDF Constraint

FELINE.constraints = ["not furniture", "not inanimate"]

Result

❌ The claim is rejected before generation.

No probability. No guessing. No "helpful" but incorrect answer.

Real Example: Python Code Safety

Input Code

eval("2 + 2")

Meaning Tokens

CALL:DYNAMIC_EXECUTION
RISK:CODE_INJECTION

UGDF Constraint

DYNAMIC_EXEC → unsafe

Result

❌ Execution blocked structurally.

No sandbox escape. No prompt override.

Semantic Transformer Architecture

High-Level Flow

Meaning Token IDs
  ↓
Semantic Embedding Layer
  ↓
Transformer Encoder
  ↓
Semantic Output Head
  ↓
UGDF Constraint Validation

The transformer never sees raw text.

Training Data Format

This model is trained on meaning → meaning mappings.

Example (Language Task)

{
  "input": ["ENTITY:FELINE", "STATE:REST"],
  "output": ["ENTITY:FELINE", "STATE:REST", "RELATION:ON", "OBJECT:SURFACE"]
}

Example (Code Task)

{
  "input": ["FUNCTION_DEF", "USER_INPUT"],
  "output": ["SECURITY_WARNING", "REQUIRES_SANITIZATION"]
}

No text. No surface tokens. No ambiguity.

Constraint Masking (Critical Innovation)

UGDF constraints are applied directly to logits.

If a token is forbidden:

logit = -∞

This makes hallucination mathematically impossible.

Language Realization (Optional)

After semantic reasoning, output can be rendered into natural language using a separate language model.

Important:

The language model has zero authority
It cannot change meaning
It only expresses meaning

This cleanly separates:

Truth
Reasoning
Expression

What This System Guarantees

Property	Guaranteed
Hallucination prevention	✅
Semantic stability	✅
Explainability	✅
Code safety	✅
Multilingual consistency	✅
Regulatory auditability	✅

These are architectural guarantees, not probabilistic ones.

Trade-Offs (Honest)

Costs

Requires ontology curation
Slower tokenization
Smaller initial datasets

Gains

Orders-of-magnitude reliability
Structural safety
Long-term meaning stability
Cross-domain intelligence

This is infrastructure, not a prompt trick.

License

This project is released under the Apache License 2.0.

Commercial use is explicitly permitted.

Final Note

Transformers were never the problem.

Tokens were.

This repository demonstrates a path beyond statistical language models toward meaning-native intelligence.

Downloads last month: -; Downloads are not tracked for this model. How to track