You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Semantic Transformer on Meaning Tokens (UGDF)

Overview

This repository implements a Semantic Transformer trained end-to-end on Meaning Tokens instead of traditional subword tokens (BPE, WordPiece, etc.). Meaning Tokens are validated, constrained concepts governed by a Universal Generalised Data Framework (UGDF).

The core idea is simple but foundational:

Modern AI fails not because models are weak, but because meaning is not structurally represented.

This system moves meaning out of model weights and into an explicit, versioned, auditable data framework.

Language becomes an input/output interface β€” not the reasoning substrate.


Why This Exists

The Problem With Current AI Systems

Modern LLMs operate on:

  • Statistical tokens
  • Learned correlations
  • Unconstrained generation

As a result, they:

  • Hallucinate facts
  • Drift in meaning over time
  • Cannot enforce truth or safety structurally
  • Treat code, language, and policy as the same thing (text)

Even when an answer sounds correct, it is not guaranteed to be correct.

This repository demonstrates a different paradigm.


Core Principles

  1. Meaning is authoritative, not probabilistic
  2. Models reason over concepts, not strings
  3. Constraints are enforced before and after inference
  4. Language is a rendering layer, not the source of truth

These principles are enforced architecturally β€” not via prompts or RLHF.


What Is UGDF?

UGDF (Universal Generalised Data Framework) is a meaning-preserving data system.

Each atomic unit of data is a Concept, not a value or token.

Every concept must satisfy a fixed set of invariants:

  • Identity (what it is)
  • Definition (what it means)
  • Scope (where it applies)
  • Constraints (what it is not allowed to be)
  • Purpose (why it exists)
  • Time (version and history)
  • Usage (where and how it is used)

If any invariant is missing, the concept is invalid.

This prevents silent semantic drift.


Meaning Tokens

Definition

A Meaning Token is a stable identifier mapped to a UGDF concept.

Example:

1001 β†’ ENTITY:FELINE
1002 β†’ STATE:REST
1003 β†’ RELATION:ON
1004 β†’ OBJECT:SURFACE

Meaning Tokens are:

  • Language-independent
  • Versioned
  • Constraint-aware
  • Stable across time

They replace:

  • Subwords
  • Byte tokens
  • Frequency-based vocabularies

Real Example: Natural Language

Input Text

The cat sat on the mat.

Traditional Tokenization (BPE)

["The", "ca", "##t", "sat", "on", "the", "mat"]

No meaning. No constraints. No truth.

Meaning Tokenization

[
  ENTITY:FELINE,
  STATE:REST,
  RELATION:ON,
  OBJECT:SURFACE
]

Each token is validated against UGDF:

  • FELINE β†’ not furniture
  • REST β†’ not motion
  • SURFACE β†’ not living

Hallucinations are structurally impossible.


Real Example: Hallucination Prevention

Question

Is a cat a piece of furniture?

UGDF Constraint

FELINE.constraints = ["not furniture", "not inanimate"]

Result

❌ The claim is rejected before generation.

No probability. No guessing. No "helpful" but incorrect answer.


Real Example: Python Code Safety

Input Code

eval("2 + 2")

Meaning Tokens

CALL:DYNAMIC_EXECUTION
RISK:CODE_INJECTION

UGDF Constraint

DYNAMIC_EXEC β†’ unsafe

Result

❌ Execution blocked structurally.

No sandbox escape. No prompt override.


Semantic Transformer Architecture

High-Level Flow

Meaning Token IDs
  ↓
Semantic Embedding Layer
  ↓
Transformer Encoder
  ↓
Semantic Output Head
  ↓
UGDF Constraint Validation

The transformer never sees raw text.


Training Data Format

This model is trained on meaning β†’ meaning mappings.

Example (Language Task)

{
  "input": ["ENTITY:FELINE", "STATE:REST"],
  "output": ["ENTITY:FELINE", "STATE:REST", "RELATION:ON", "OBJECT:SURFACE"]
}

Example (Code Task)

{
  "input": ["FUNCTION_DEF", "USER_INPUT"],
  "output": ["SECURITY_WARNING", "REQUIRES_SANITIZATION"]
}

No text. No surface tokens. No ambiguity.


Constraint Masking (Critical Innovation)

UGDF constraints are applied directly to logits.

If a token is forbidden:

logit = -∞

This makes hallucination mathematically impossible.


Language Realization (Optional)

After semantic reasoning, output can be rendered into natural language using a separate language model.

Important:

  • The language model has zero authority
  • It cannot change meaning
  • It only expresses meaning

This cleanly separates:

  • Truth
  • Reasoning
  • Expression

What This System Guarantees

Property Guaranteed
Hallucination prevention βœ…
Semantic stability βœ…
Explainability βœ…
Code safety βœ…
Multilingual consistency βœ…
Regulatory auditability βœ…

These are architectural guarantees, not probabilistic ones.


Trade-Offs (Honest)

Costs

  • Requires ontology curation
  • Slower tokenization
  • Smaller initial datasets

Gains

  • Orders-of-magnitude reliability
  • Structural safety
  • Long-term meaning stability
  • Cross-domain intelligence

This is infrastructure, not a prompt trick.


License

This project is released under the Apache License 2.0.

Commercial use is explicitly permitted.


Final Note

Transformers were never the problem.

Tokens were.

This repository demonstrates a path beyond statistical language models toward meaning-native intelligence.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support