Question for people using multi-modal LLMs with image inputs: what are some best practices for optimizing images as inputs to the model so you can 1) reduce number of tokens but 2) maintain quality responses? Finding that my GPT-4o bill explodes when I start to embed base64-encoded images 🙂
Any known research or people working in this area?