Despite their impressive capabilities, large language models (LLMs) are not without flaws. These AI models sometimes "hallucinate," producing incorrect or unsupported information in response to queries.
Due to this hallucination issue, LLM responses often require verification by human fact-checkers, especially when deployed in high-stakes areas like healthcare or finance. However, the validation process typically involves reading lengthy documents cited by the model, a task so cumbersome and error-prone that it can deter some users from deploying generative AI models in the first place.
To assist human validators, MIT researchers have developed a user-friendly system that enables quicker verification of LLM responses. This tool, named SymGen, allows an LLM to generate responses with citations that directly point to specific locations in a source document, such as a particular cell in a database.
Users can hover over highlighted sections of the text response to view the data the model used to generate that specific word or phrase. Simultaneously, non-highlighted sections indicate which sentences require additional verification.
"We empower people to selectively focus on the parts of the text that concern them the most. Ultimately, SymGen can instill greater confidence in a model’s responses, as users can easily scrutinize them to ensure the information is verified," says Shannon Shen, a graduate student in electrical engineering and computer science and co-lead author of a paper on SymGen.
Through a user study, Shen and her colleagues found that SymGen reduced verification time by about 20% compared to manual procedures. By enabling quicker and easier validation of model outputs, SymGen could help users identify errors in LLMs deployed in various real-world scenarios, from generating clinical notes to synthesizing financial market reports.
Shen co-authored the paper with Lucas Torroba Hennigen, co-lead author and EECS graduate student; Aniruddha "Ani" Nrusimha, EECS graduate student; Bernhard Gapp, president of the Good Data Initiative; and principal authors David Sontag, EECS professor, MIT Jameel Clinic member, and head of the clinical machine learning group at CSAIL; and Yoon Kim, assistant professor of EECS and CSAIL member. The research was recently presented at the Conference on Language Modeling.
Symbolic References
To facilitate validation, many LLMs are designed to generate citations pointing to external documents alongside their linguistic responses, allowing users to verify them. However, these verification systems are often designed as an afterthought, without considering the effort required for users to sift through numerous citations, Shen explains.
"Generative AI aims to reduce the time it takes for a user to complete a task. If you have to spend hours reading all these documents to verify that the model is saying something reasonable, then it’s less useful to put the generations into practice," Shen says.
The researchers approached the validation problem from the perspective of the humans performing the work.
A SymGen user first provides the LLM with data it can reference in its response, such as a table containing basketball game statistics. Then, instead of immediately asking the model to perform a task, like generating a game summary from this data, the researchers introduce an intermediate step. They prompt the model to generate its response in a symbolic form.
With this prompt, whenever the model wants to cite words in its response, it must write the specific cell from the data table containing the referenced information. For example, if the model wants to cite "Portland Trailblazers" in its response, it replaces this text with the name of the data table cell containing those words.
"Through this intermediate step of presenting the text in a symbolic format, we can have very fine-grained references. We can say that for every stretch of text in the output, this is exactly where in the data it corresponds," Torroba Hennigen explains.
SymGen then resolves each reference using a rule-based tool that copies the corresponding text from the data table into the model’s response.
"In this way, we know it’s a textual copy, so we know there won’t be any errors in the part of the text that corresponds to the actual data variable," Shen adds.
Streamlining Validation
The model can create symbolic responses due to its training method. Large language models are fed vast amounts of data from the internet, some of which is recorded in a "placeholder format" where codes replace actual values.
When SymGen prompts the model to generate a symbolic response, it uses a similar structure.
"We design the prompt in a specific way to leverage the LLM’s capabilities," Shen adds.
In a user study, most participants reported that SymGen made it easier to verify LLM-generated text. They could validate model responses about 20% faster than using standard methods.
However, SymGen is limited by the quality of the source data. The LLM might cite an incorrect variable, and a human validator might not notice.
Additionally, the user must have source data in a structured format, like a table, to feed into SymGen. Currently, the system only works with tabular data.
In the future, researchers aim to enhance SymGen to handle arbitrary text and other data forms. With this capability, it could help validate parts of AI-generated legal document summaries, for example. They also plan to test SymGen with physicians to explore how it might identify errors in AI-generated clinical summaries.
This work is funded, in part, by Liberty Mutual and the MIT Quest for Intelligence Initiative.