Question for people using multi-modal LLMs with im...
# 06-technical-discussion
c
Question for people using multi-modal LLMs with image inputs: what are some best practices for optimizing images as inputs to the model so you can 1) reduce number of tokens but 2) maintain quality responses? Finding that my GPT-4o bill explodes when I start to embed base64-encoded images 🙂 Any known research or people working in this area?
And yes, I tried reducing the size of the image... but kind of curious what other tricks people have attempted
h
VIT's split the images into patches. The dimension of the patches cannot be changed once the model is trained. Here is a code which might be helpful https://github.com/lanl/vision_transformers_explained/blob/main/notebooks/VisionTransformersExplained.ipynb
a
I'm just chiming in to say I see you and the struggle is real.