Hi everyone, I need your help. Does anyone know of...
# 06-technical-discussion
d
Hi everyone, I need your help. Does anyone know of a platform or research paper that discusses LLM Tokenization? I want to evaluate different types of prompts formats at the tokenization level. For example, I would compare these types of prompts to see which ones LLM can understand better and which type of prompt can give us results equal to the original ones.
Copy code
Meet Vevanshu, a 25-year-old student at the University of Texas at Arlington. He's passionate about badminton.

Meet Btakwxfu, a 57-year-old gfovzew at the Dafrrirrja da Eiqfs jn Tamtcqvub. He's also passionate about badminton. Also, meet Whoyein Arcwve
g
Hey, just saw this. Excellent question, I don't think I've encountered any papers on this topic, here's a couple I found that may be useful? https://arxiv.org/html/2401.06827v1 https://arxiv.org/html/2403.11537v1 However, I would also argue it depends on LLM model embeddings at the end of the day, so you're kinda comparing apples and oranges, with different LLMS, unless you know the embedding. What you could do, is: • Either amass a ton of data, or generate synthetic data given the embedding model (biased, but may work for you) • Compare results of different prompts through analysis, after running it through the LLM. If you're looking for something generalizable, I'd look at the papers I sent for inspiration, or generalizable tokenization techniques. However, I'm less optimistic about that since you have introduced randomness per model (so each model doesn't generate the same response, ergo ChatGPT), and you have to consider that each LLM is finetuned / trained specifically with their own embeddings in mind. So for LLM A, Prompt A might be better than Prompt B, but for LLM B, Prompt B might be better - so it does depend on the embedding matrix. Anyway, just some thoughts, curious if you end up finding anything. Great Q
d
Thank you for your response. It is indeed an interesting problem. Currently, I am focusing on GPT models. From this exercise, I’ve gained two insights: 1. First, we should use a prompt format that consumes fewer tokens, as this helps to preserve the context length window. 2. Second, we should choose a format with more words that are already in the vocabulary, making it easier for the language model to understand. In both insights, the goal is to minimize the number of tokens used in the prompt. Of course, we need to test these outcomes to be certain, but based on our findings, this is a reasonable conclusion.
p
https://github.com/boundaryml/baml might be a great resource