Text Generation
Transformers
Safetensors
gpt2
text-generation-inference

Model Card for Model ID

This is a fine-tuned (human TATA/no-TATA) model of adehoffer/Promoter-GPT-model for generating synthetic promoter sequences.

Model Details

Model Description

This is a fine-tuned model of Adele de Hoffer's Promoter-GPT-model for generating synthetic promoter sequences. I trained my own base model of Promoter-GPT following Adele's guide here: https://huggingface.co/blog/hugging-science/promoter-gpt and then loaded and fine-tuned that model on InstaDeepAI/nucleotide_transformer_downstream_tasks_revised TATA/noTATA human promoter sequence datasets.

This model generates synthetic promoter sequences given a seed.

Disclaimer: Research use only; not for clinical or commercial use. Sequences require wet-lab validation—no direct improvement over the base model is claimed.

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: ae-314
  • Model type: Causal language model (GPT-2 ~0.43M params), 3-mer WordLevel tokenizer, context length 298 tokens
  • Language(s) (NLP): DNA (A,C,T,G)
  • License: cc-by-nc-3.0
  • **Finetuned from model custom base Promoter-GPT trained from scratch using: https://huggingface.co/blog/hugging-science/promoter-gpt

Model Sources [optional]

Evaluation

  • Test (balanced human promoters, 300 bp): loss = 1.2884 · perplexity = 3.63
  • Generation (N=50): GC% ≈ 60.6 ± 15.2; TATA 4-mer ≈ 26%; TATAWA10%; unique 6-mer ratio ≈ 0.815; ≥6-bp homopolymer ≈ 74%.
  • Notes: Perplexity is on a mixed TATA+no-TATA domain (not directly comparable to an AT-rich 200 bp setup). Generation stats are unconditional (no control tokens).

Training Details

  • Base: custom Promoter-GPT (GPT-2 ~0.43M params)
  • Data: human promoters (300 bp), mixed promoter_tata + promoter_no_tata (positives), balanced
  • Tokenization & context: 3-mer WordLevel; 298 tokens (full 300 bp; positional embeddings expanded to n_positions=298)
  • Optimizer: AdamW (weight_decay=0.01), LR=1e-4, cosine schedule, warmup ≈10%
  • Batch / Accum: 128 / 8 · Epochs: 3 · Precision: fp32
  • Hardware: Google Colab T4 GPU

Direct Use

  • Unconditional generation of synthetic human promoter-like sequences from a short seed (research/education only).
  • Exploration of promoter sequence properties (e.g., GC%, k-mer distributions).

Downstream Use

  • Starting point for further fine-tuning on specific promoter subtypes or organism-specific data.
  • Conditioning or control-token experiments (e.g., motif presence) in future work.

Out-of-Scope Use

  • Clinical, diagnostic, or therapeutic applications.
  • Any wet-lab use without proper biosafety review and experimental validation.
  • Harmful/dual-use sequence design.

Bias, Risks, and Limitations

  • Reflects the mixed human promoter domain (TATA + no-TATA, 300 bp); may skew GC-rich and show simple repeats.
  • No experimental validation; outputs are not guaranteed functional or safe.
  • Perplexity improved/worsened may not correlate with biological realism.

Recommendations

  • Treat generations as hypotheses; validate in wet lab.
  • Use conservative sampling (lower temperature, repetition penalty) to reduce repeat bias.
  • Do not use clinically or commercially

How to Get Started with the Model

#aaron e
# ae-314

from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast
import torch, re

# Load model & tokenizer
repo_id = "ae-314/promoter-gpt-ft-tata"
tok = PreTrainedTokenizerFast.from_pretrained(repo_id)
model = GPT2LMHeadModel.from_pretrained(repo_id).eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# K-mer helpers
def kmerize(seq, k=3):
    return " ".join(seq[i:i+k] for i in range(len(seq)-k+1))

def dekmerize_ids(ids, k=3):
    toks = tok.convert_ids_to_tokens(ids)
    kmers = [t for t in toks if re.fullmatch(r"[ACGT]{%d}" % k, t)]
    if not kmers: return ""
    seq = kmers[0]
    for t in kmers[1:]:
        seq += t[-1]
    return seq

# Generate N sequences (unconditional)
def generate_batch(seed="ATGG", N=50, k=3, temperature=0.9, top_p=0.9):
    inp = tok.encode(kmerize(seed, k), return_tensors="pt").to(device)
    max_new = 298 - inp.shape[1]  # 300 bp -> 298 3-mers
    assert max_new > 0, f"Seed too long in k-mer tokens ({inp.shape[1]})"
    pad_id = tok.pad_token_id if tok.pad_token_id is not None else (tok.eos_token_id or 0)
    with torch.no_grad():
        outs = model.generate(
            inp,
            max_new_tokens=max_new,
            do_sample=True, temperature=temperature, top_p=top_p,
            num_return_sequences=N,
            pad_token_id=pad_id
        )
    return [dekmerize_ids(outs[i].tolist(), k) for i in range(outs.shape[0])]

# Run and show first 5
seqs = generate_batch(seed="ATGG", N=50)
for i, s in enumerate(seqs[:5], 1):
    print(f"{i:02d}: {s}")
Downloads last month
7
Safetensors
Model size
444k params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ae-314/promoter-gpt-ft-tata

Finetuned
(1)
this model

Dataset used to train ae-314/promoter-gpt-ft-tata