I was talking to a colleague about their exploration of chat as a data source and how noisy or low quality the terse format makes the data. This is especially the case when compared against more dense data, such as an internal documentation page or pdf.
Clearly the latter is more structured. I have a theory that an LLM can be trained to anchor sparse chat data to a larger more dense corpus, thereby improving both the fidelity of the overall corpus, and providing a means for explainability in cogniq’s generations.
Does this theory have any merit? How could it be tested?
06/15/2023, 5:01 PM
I'm not a specialist in this area, but training on LLM output is definitely both possible to do well (eg Microsoft's Orca model) and hard to do well.
On a hand-wavy note, I recall reading a paper arguing the "wisdom of the crowds" does not apply to LLM bots (meaning, if you ask 1000 people how many jellybeans are in the jar they will average out to the right answer but if you give 1000 variations on a question to an LLM its answers will cluster around a couple wrong local minima).
06/15/2023, 8:14 PM
What do you mean by "trained to"? Is the goal to fine tune an LLM to be more effective at extracting semantic meaning from sparse chat data?
06/16/2023, 2:24 AM
Yes, that's the goal. The approach is what I'm wrestling with.