A community of founders and builders creating the next generation of technology.

Cerebral Valley

Hi, I'm really curious about options for LLMs for low response latency / high throughout use cases - are there any good resources for these types of inference metrics on smaller models (such as phi, orca or similar) or on specific hardware?

I think <https://groq.com/> is a good example of latency trade off, but they trade off at the cost of number of requests a minute (though they're updating that soon).

As far as metrics, it's usually processed in tokens / s - and smaller models are much faster. However, their accuracy comparatively is significantly reduced.

Pretty much what you'd expect :slightly_smiling_face: