Inference seems to be very slow on A100 even when flash_attn is enabled

by boydcheung - opened 25 days ago

25 days ago

•

Could you help testing the latency/inference speed of this 2B model?

Any suggestions what might be the cause of the problem? I've used the same version of transformers as in model card for inference.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment