A community of founders and builders creating the next generation of technology.

Cerebral Valley

Question for people using multi-modal LLMs with image inputs: what are some best practices for optimizing images as inputs to the model so you can 1) reduce number of tokens but 2) maintain quality responses? Finding that my GPT-4o bill explodes when I start to embed base64-encoded images :slightly_smiling_face:

Any known research or people working in this area?

And yes, I tried reducing the size of the image... but kind of curious what other tricks people have attempted

VIT's split the images into patches. The dimension of the patches cannot be changed once the model is trained.

Here is a code which might be helpful

<https://github.com/lanl/vision_transformers_explained/blob/main/notebooks/VisionTransformersExplained.ipynb>

I'm just chiming in to say I see you and the struggle is real.