Predicting end of sequence

#31

by pipparichter - opened Aug 1, 2023

Aug 1, 2023

Hello! I am trying to use the model to predict whether or not a protein sequence is complete (as opposed to erroneously truncated). It seems as though ProtGPT2 model tends to extend already-complete protein sequences (i.e. it would rarely predict 0 as the next token). I wanted to double-check these results using the API on the browser, which led to the following observation: I noticed that it never predicts and <|end of text|> token -- it simply spits out a warning saying "no text generated." Does this mean that the model has concluded the input sequence is complete?

Thank you!

nferruz

Owner Aug 4, 2023

Hi!
The model will terminate sequences producing an <|endoftext|> token, but it won't appear during generation because special tokens are not displayed unless you choose to. Bear in mind that this model tends to generate sequences a bit longer than natural on average (it likes to speak a lot), so it may continue sequences that are already complete. Other models we've trained later do not show this behavior.

On a different note, what do you mean by the API in the browser? Do you mean here in HF? If so, I would not rely on the generation because it's automatically produced by HF without using the most optimal generation parameters. Also, it does not filter by perplexity. In any case, the HF API would not show the special character <|endoftext|>, so I guess the behaviour would be 'no text generated' as you said (but with no experience generating from this browser, I'd do it locally instead :)).

Hope this helps,
Noelia

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment