Inference seems to be very slow on A100 even when flash_attn is enabled
#7
by
boydcheung
- opened
Could you help testing the latency/inference speed of this 2B model?
Any suggestions what might be the cause of the problem? I've used the same version of transformers as in model card for inference.