Model Card for Model ID

This is a fine-tuned (human TATA/no-TATA) model of adehoffer/Promoter-GPT-model for generating synthetic promoter sequences.

Model Details

Model Description

This is a fine-tuned model of Adele de Hoffer's Promoter-GPT-model for generating synthetic promoter sequences. I trained my own base model of Promoter-GPT following Adele's guide here: https://huggingface.co/blog/hugging-science/promoter-gpt and then loaded and fine-tuned that model on InstaDeepAI/nucleotide_transformer_downstream_tasks_revised TATA/noTATA human promoter sequence datasets.

This model generates synthetic promoter sequences given a seed.

Disclaimer: Research use only; not for clinical or commercial use. Sequences require wet-lab validation—no direct improvement over the base model is claimed.

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: ae-314
Model type: Causal language model (GPT-2 ~0.43M params), 3-mer WordLevel tokenizer, context length 298 tokens
Language(s) (NLP): DNA (A,C,T,G)
License: cc-by-nc-3.0
**Finetuned from model custom base Promoter-GPT trained from scratch using: https://huggingface.co/blog/hugging-science/promoter-gpt

Model Sources [optional]

Repository: https://huggingface.co/ae-314/promoter-gpt-ft-tata
**Paper by: Adele de Hoffer: (https://huggingface.co/blog/hugging-science/promoter-gpt)

Evaluation

Test (balanced human promoters, 300 bp): loss = 1.2884 · perplexity = 3.63
Generation (N=50): GC% ≈ 60.6 ± 15.2; TATA 4-mer ≈ 26%; TATAWA ≈ 10%; unique 6-mer ratio ≈ 0.815; ≥6-bp homopolymer ≈ 74%.
Notes: Perplexity is on a mixed TATA+no-TATA domain (not directly comparable to an AT-rich 200 bp setup). Generation stats are unconditional (no control tokens).

Training Details

Base: custom Promoter-GPT (GPT-2 ~0.43M params)
Data: human promoters (300 bp), mixed promoter_tata + promoter_no_tata (positives), balanced
Tokenization & context: 3-mer WordLevel; 298 tokens (full 300 bp; positional embeddings expanded to n_positions=298)
Optimizer: AdamW (weight_decay=0.01), LR=1e-4, cosine schedule, warmup ≈10%
Batch / Accum: 128 / 8 · Epochs: 3 · Precision: fp32
Hardware: Google Colab T4 GPU

Direct Use

Unconditional generation of synthetic human promoter-like sequences from a short seed (research/education only).
Exploration of promoter sequence properties (e.g., GC%, k-mer distributions).

Downstream Use

Starting point for further fine-tuning on specific promoter subtypes or organism-specific data.
Conditioning or control-token experiments (e.g., motif presence) in future work.

Out-of-Scope Use

Clinical, diagnostic, or therapeutic applications.
Any wet-lab use without proper biosafety review and experimental validation.
Harmful/dual-use sequence design.

Bias, Risks, and Limitations

Reflects the mixed human promoter domain (TATA + no-TATA, 300 bp); may skew GC-rich and show simple repeats.
No experimental validation; outputs are not guaranteed functional or safe.
Perplexity improved/worsened may not correlate with biological realism.

Recommendations

Treat generations as hypotheses; validate in wet lab.
Use conservative sampling (lower temperature, repetition penalty) to reduce repeat bias.
Do not use clinically or commercially

How to Get Started with the Model

#aaron e
# ae-314

from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast
import torch, re

# Load model & tokenizer
repo_id = "ae-314/promoter-gpt-ft-tata"
tok = PreTrainedTokenizerFast.from_pretrained(repo_id)
model = GPT2LMHeadModel.from_pretrained(repo_id).eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# K-mer helpers
def kmerize(seq, k=3):
    return " ".join(seq[i:i+k] for i in range(len(seq)-k+1))

def dekmerize_ids(ids, k=3):
    toks = tok.convert_ids_to_tokens(ids)
    kmers = [t for t in toks if re.fullmatch(r"[ACGT]{%d}" % k, t)]
    if not kmers: return ""
    seq = kmers[0]
    for t in kmers[1:]:
        seq += t[-1]
    return seq

# Generate N sequences (unconditional)
def generate_batch(seed="ATGG", N=50, k=3, temperature=0.9, top_p=0.9):
    inp = tok.encode(kmerize(seed, k), return_tensors="pt").to(device)
    max_new = 298 - inp.shape[1]  # 300 bp -> 298 3-mers
    assert max_new > 0, f"Seed too long in k-mer tokens ({inp.shape[1]})"
    pad_id = tok.pad_token_id if tok.pad_token_id is not None else (tok.eos_token_id or 0)
    with torch.no_grad():
        outs = model.generate(
            inp,
            max_new_tokens=max_new,
            do_sample=True, temperature=temperature, top_p=top_p,
            num_return_sequences=N,
            pad_token_id=pad_id
        )
    return [dekmerize_ids(outs[i].tolist(), k) for i in range(outs.shape[0])]

# Run and show first 5
seqs = generate_batch(seed="ATGG", N=50)
for i, s in enumerate(seqs[:5], 1):
    print(f"{i:02d}: {s}")

Downloads last month: 7

Safetensors

Model size

444k params

Tensor type

F32

Model tree for ae-314/promoter-gpt-ft-tata

Base model

adehoffer/promoter-gpt-model

Finetuned

(1)

this model

ae-314
/

promoter-gpt-ft-tata