I read the paper and thought about it last night. This paper rings my BS radar... I’m hoping someone can help me to understand better. So here’s my thinking.
Let’s presume I have a 4096 token memory limit. I then allocate 1024 tokens to memory space which I will update as I parse/generate my sequence.
It seems to me that the interesting part of the paper is how to manage the memory in a strategy to produce the required outcome.
If I have a particularly differentiated data point, like a note about a vivid purple cat in a sea of information about shades of gray, then perhaps I can get lucky in that data point persists across naive recursive steps (compress the information in this step and recurse).
I don’t know if I’m too naive to have missed it, but I didn’t get the sense that the paper discussed how memory might be tuned to the task. Like, if I have any real data, with real information density, how could I actually put it to use?
If we take an example from the student: The student will read a long body of text, and in their head, generalize the content in a running outline. On paper, they take notes that will act as an “index” of highly relevant references.
When the student writes the paper, they use that outline and sample their notes to come up with a hypothetical theme or point to their paper. Then, they build an outline and gather resources to construct their paper.
This sequence describes a hierarchy of working memory, much like memory layers in CPU architecture, could memory for LLMs have a similar design? What could the memory management coprocessor look like?