@Bill Metangmo - seems like there are a few things that might be helpful.
In terms of frameworks for working with LLMs, I'm sure you've seen/played with Langchain (although you can get quite far with just calling the LLMs directly).
In terms of improving the quality of question generation or assessing quality of content, that's where it can get tricky. Usually, people try some number of prompts on a small reference dataset until they're happy with the quality of generations. Quality of generation goes up the more clearly you're able to describe what you want from the LLM, and the more examples you provide. Of course, this just boils down to understanding your problem domain really well.