Hey, just saw this.
Excellent question, I don't think I've encountered any papers on this topic, here's a couple I found that may be useful?
https://arxiv.org/html/2401.06827v1
https://arxiv.org/html/2403.11537v1
However, I would also argue it depends on LLM model embeddings at the end of the day, so you're kinda comparing apples and oranges, with different LLMS, unless you know the embedding.
What you could do, is:
• Either amass a ton of data, or generate synthetic data given the embedding model (biased, but may work for you)
• Compare results of different prompts through analysis, after running it through the LLM.
If you're looking for something generalizable, I'd look at the papers I sent for inspiration, or generalizable tokenization techniques. However, I'm less optimistic about that since you have introduced randomness per model (so each model doesn't generate the same response, ergo ChatGPT), and you have to consider that each LLM is finetuned / trained specifically with their own embeddings in mind. So for LLM A, Prompt A might be better than Prompt B, but for LLM B, Prompt B might be better - so it does depend on the embedding matrix.
Anyway, just some thoughts, curious if you end up finding anything. Great Q