Model Card for Model ID
This is a fine-tuned (human TATA/no-TATA) model of adehoffer/Promoter-GPT-model for generating synthetic promoter sequences.
Model Details
Model Description
This is a fine-tuned model of Adele de Hoffer's Promoter-GPT-model for generating synthetic promoter sequences. I trained my own base model of Promoter-GPT following Adele's guide here: https://huggingface.co/blog/hugging-science/promoter-gpt and then loaded and fine-tuned that model on InstaDeepAI/nucleotide_transformer_downstream_tasks_revised TATA/noTATA human promoter sequence datasets.
This model generates synthetic promoter sequences given a seed.
Disclaimer: Research use only; not for clinical or commercial use. Sequences require wet-lab validation—no direct improvement over the base model is claimed.
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: ae-314
- Model type: Causal language model (GPT-2 ~0.43M params), 3-mer WordLevel tokenizer, context length 298 tokens
- Language(s) (NLP): DNA (A,C,T,G)
- License: cc-by-nc-3.0
- **Finetuned from model custom base Promoter-GPT trained from scratch using: https://huggingface.co/blog/hugging-science/promoter-gpt
Model Sources [optional]
- Repository: https://huggingface.co/ae-314/promoter-gpt-ft-tata
- **Paper by: Adele de Hoffer: (https://huggingface.co/blog/hugging-science/promoter-gpt)
Evaluation
- Test (balanced human promoters, 300 bp): loss = 1.2884 · perplexity = 3.63
- Generation (N=50): GC% ≈ 60.6 ± 15.2; TATA 4-mer ≈ 26%; TATAWA ≈ 10%; unique 6-mer ratio ≈ 0.815; ≥6-bp homopolymer ≈ 74%.
- Notes: Perplexity is on a mixed TATA+no-TATA domain (not directly comparable to an AT-rich 200 bp setup). Generation stats are unconditional (no control tokens).
Training Details
- Base: custom Promoter-GPT (GPT-2 ~0.43M params)
- Data: human promoters (300 bp), mixed
promoter_tata+promoter_no_tata(positives), balanced - Tokenization & context: 3-mer WordLevel; 298 tokens (full 300 bp; positional embeddings expanded to
n_positions=298) - Optimizer: AdamW (weight_decay=0.01), LR=1e-4, cosine schedule, warmup ≈10%
- Batch / Accum: 128 / 8 · Epochs: 3 · Precision: fp32
- Hardware: Google Colab T4 GPU
Direct Use
- Unconditional generation of synthetic human promoter-like sequences from a short seed (research/education only).
- Exploration of promoter sequence properties (e.g., GC%, k-mer distributions).
Downstream Use
- Starting point for further fine-tuning on specific promoter subtypes or organism-specific data.
- Conditioning or control-token experiments (e.g., motif presence) in future work.
Out-of-Scope Use
- Clinical, diagnostic, or therapeutic applications.
- Any wet-lab use without proper biosafety review and experimental validation.
- Harmful/dual-use sequence design.
Bias, Risks, and Limitations
- Reflects the mixed human promoter domain (TATA + no-TATA, 300 bp); may skew GC-rich and show simple repeats.
- No experimental validation; outputs are not guaranteed functional or safe.
- Perplexity improved/worsened may not correlate with biological realism.
Recommendations
- Treat generations as hypotheses; validate in wet lab.
- Use conservative sampling (lower temperature, repetition penalty) to reduce repeat bias.
- Do not use clinically or commercially
How to Get Started with the Model
#aaron e
# ae-314
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast
import torch, re
# Load model & tokenizer
repo_id = "ae-314/promoter-gpt-ft-tata"
tok = PreTrainedTokenizerFast.from_pretrained(repo_id)
model = GPT2LMHeadModel.from_pretrained(repo_id).eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# K-mer helpers
def kmerize(seq, k=3):
return " ".join(seq[i:i+k] for i in range(len(seq)-k+1))
def dekmerize_ids(ids, k=3):
toks = tok.convert_ids_to_tokens(ids)
kmers = [t for t in toks if re.fullmatch(r"[ACGT]{%d}" % k, t)]
if not kmers: return ""
seq = kmers[0]
for t in kmers[1:]:
seq += t[-1]
return seq
# Generate N sequences (unconditional)
def generate_batch(seed="ATGG", N=50, k=3, temperature=0.9, top_p=0.9):
inp = tok.encode(kmerize(seed, k), return_tensors="pt").to(device)
max_new = 298 - inp.shape[1] # 300 bp -> 298 3-mers
assert max_new > 0, f"Seed too long in k-mer tokens ({inp.shape[1]})"
pad_id = tok.pad_token_id if tok.pad_token_id is not None else (tok.eos_token_id or 0)
with torch.no_grad():
outs = model.generate(
inp,
max_new_tokens=max_new,
do_sample=True, temperature=temperature, top_p=top_p,
num_return_sequences=N,
pad_token_id=pad_id
)
return [dekmerize_ids(outs[i].tolist(), k) for i in range(outs.shape[0])]
# Run and show first 5
seqs = generate_batch(seed="ATGG", N=50)
for i, s in enumerate(seqs[:5], 1):
print(f"{i:02d}: {s}")
- Downloads last month
- 7
Model tree for ae-314/promoter-gpt-ft-tata
Base model
adehoffer/promoter-gpt-model