I was talking to a colleague about their exploration of chat as a data source and how noisy or low quality the terse format makes the data. This is especially the case when compared against more dense data, such as an internal documentation page or pdf.
Clearly the latter is more structured. I have a theory that an LLM can be trained to anchor sparse chat data to a larger more dense corpus, thereby improving both the fidelity of the overall corpus, and providing a means for explainability in cogniq’s generations.
Does this theory have any merit? How could it be tested?