Hi, I'm really curious about options for LLMs for ...
# 06-technical-discussion
s
Hi, I'm really curious about options for LLMs for low response latency / high throughout use cases - are there any good resources for these types of inference metrics on smaller models (such as phi, orca or similar) or on specific hardware?
g
I think https://groq.com/ is a good example of latency trade off, but they trade off at the cost of number of requests a minute (though they're updating that soon). As far as metrics, it's usually processed in tokens / s - and smaller models are much faster. However, their accuracy comparatively is significantly reduced. Pretty much what you'd expect 🙂