Inference seems to be very slow on A100 even when flash_attn is enabled

#7
by boydcheung - opened

Could you help testing the latency/inference speed of this 2B model?

Any suggestions what might be the cause of the problem? I've used the same version of transformers as in model card for inference.

Sign up or log in to comment