More thoughts on LLM pricing: gpt-3.5-turbo and g...
# 06-technical-discussion
d
More thoughts on LLM pricing: gpt-3.5-turbo and gpt-4 launched around the same time. gpt-3.5-turbo offered a huge price reduction from the previous model and gpt-4 represented a huge quality increase from the previous model. Can we make guesses about whether gpt-4 already has the gpt-3.5-turbo style optimizations, and if it doesn't have it can we make estimates of what kind of cost reductions might be coming in a hypothetical future gpt-4-turbo? tl;dr - it looks like gpt-4 already has the "turbo" style optimizations baked in. There will certainly be more optimizations to come but it doesn't look like there is an "easy" 10x turbo optimization waiting ready to dropped onto gpt-4. Math in ๐Ÿงต
gpt-3.5-turbo reportedly has 154B parameters. If those are 4 byte values that means it takes about 600GB of memory to hold the model. gpt-3.5-turbo generates about 10 words/second or about 25 tokens/second, which is about 100K tokens/hour. Azure rents 880GB A100 instances for $19/hr short term or $10/hr long term contract. $10/hr divided by 100K tokens/hr implies a GPU cost of $0.10/1K tokens per gpt-3.5-turbo evaluation. The actual billed cost for gpt-3.5-turbo is $0.002/1K tokens. Going from 3 to 3.5-turbo the time to compute tokens didn't change much but the cost to evaluate tokens dropped significantly, suggesting they are multiplexing more requests into each evaluation (this would be a SIMD-like approach where you are performing a single set of model memory accesses and then doing parallel evaluations of multiple input streams on that single pull of model data). $0.10 / $0.002 = 50, suggesting 3.5-turbo is multiplexing 50 requests into each evaluation. 3.5-turbo is a 10x cost improvement over the previous generation, so that suggests they increased the amount of multiplexing from 5x to 10x when they released gpt-3.5-turbo (keep in mind there is a TON of work going on in how to optimize LLM evaluations and describing this particular optimization as a SIMD-like multiplexing is probably a gross oversimplification, but the observation that response times didn't change significantly suggests the amount of compute happening per response didn't change significantly between the models, advocating for some kind of a multiplexing optimization). So what about gpt-4? Most people seem to be of the opinion that gpt-4 is about a 1 trillion parameter model, or about 6x the size of gpt-3.5-turbo. Assuming 4 bytes per parameter means a 1T model requires 4TB of ram, or 5x 880GB A100 GPUs. gpt-4 is slower than gpt-3.5-turbo, generating about 4 words per second or about 10 tokens/second (that's about 2.5x slower than the 25 tokens/second of gpt-3.5-turbo, because it has to do more numerical operations per evaluation). For gpt-4 we need 5x 880GB A100's to hold the model in memory, at a cost of 5 x $10/hour = $50/hour. 10 tokens/second = 36K tokens/hour. $50/hour divided by 36K tokens/hour = $1.4/1K tokens per evaluation of gpt-4. The billing model for gpt-4 is more complicated than the gpt-3.5-turbo billing model. For this calculation I'm going to simplify the gpt-4 billing to "$0.05/1K tokens". Comparing the GPU cost per evaluation and the billed cost per request we get $1.4/1K tokens divided by $0.05/1k tokens = 25 requests multiplexed together in each request. gpt-3.5-turbo (which can handle 4k of context) is being multiplexed at 50 requests per evaluation. gpt-4 (which can handle 8k of context at the pricing I used) is being multiplexed at 25 request per evaluation. There's a lot of hand-waving and approximations going into those numbers, but taken together these calculations strongly suggest that the current gpt-4 already has the "gpt-3.5-turbo class" optimizations baked into it because 4K context/request x 50 simultaneous requests = 8K context/request x 25 simultaneous requests. I'll admit I went into this hoping there was an "easy" 10x or so of turbo style optimization waiting to be applied to gpt-4. The numbers suggest that kind of quick easy win on gpt-4 is less likely than I had hoped and suggest the turbo style optimization is already in place for gpt-4. That said, there is a ton of work happening all across the industry on how to optimize this kind of computation in both hardware and software, so we'll definitely get cost and speed improvements going forward, they are likely just going to be more hard fought than an "easy" increase in the multiplexing of requests per evaluation.
๐Ÿ‘๐Ÿพ 1
๐Ÿ‘ 2
o
That is a wonderfully thorough analysis. Thank you @Don Alvarez!
a
Wow. Amazing deep dive @Don Alvarez