Can Language Models Understand Visual Concepts? MIT Researchers Think So
You’ve likely heard the saying, "a picture is worth a thousand words," but can a large language model (LLM) grasp visual concepts without ever having seen an image? Surprisingly, the answer is yes. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have found that language models trained solely on text possess a robust understanding of the visual world. These models can generate image-rendering code to create complex scenes with intriguing objects and compositions. Even when their initial attempts are not perfect, LLMs can refine their images through self-correction.
Visual Knowledge from Text
The visual understanding of these language models stems from how concepts like shapes and colors are described on the internet, whether in text or code. When given an instruction such as "draw a parrot in the jungle," the LLMs draw upon their extensive reading of previous descriptions. To evaluate the visual knowledge of LLMs, the CSAIL team created a "vision benchmark" using a "visual aptitude dataset." This dataset tested the models’ abilities to draw, recognize, and self-correct visual concepts. By collecting the final versions of these illustrations, the researchers trained a computer vision system to identify content in real photos.
Training Without Visual Data
"We are essentially training a vision system without directly using visual data," explains Tamar Rott Shaham, co-lead author of the study and a postdoctoral researcher in Electrical Engineering and Computer Science (EECS) at MIT CSAIL. "Our team queried language models to write image-rendering codes to generate data for us, then trained the vision system to evaluate natural images. We were inspired by the question of how visual concepts are represented through other mediums, like text. To express their visual knowledge, LLMs can use code as a common ground between text and vision."
Generating Visual Data
To create this dataset, the researchers first queried the models to generate code for various shapes, objects, and scenes. They then compiled this code to render simple digital illustrations, such as a row of bicycles, demonstrating that LLMs understand spatial relationships well enough to draw two-wheelers in a horizontal line. Another example included a car-shaped cake, combining two random concepts. The language model also produced a glowing light bulb, indicating its ability to create visual effects.
"Our work shows that when you ask an LLM (without multimodal pre-training) to create an image, it knows much more than it initially appears," says Pratyusha Sharma, co-lead author and a PhD student at EECS and CSAIL. "For instance, if you ask it to draw a chair, the model knows more about this piece of furniture than it might immediately render. Users can then query the model to improve the visual output iteratively. Surprisingly, the model can enrich the drawing significantly by refining the rendering code."
Enhancing Computer Vision Systems
The researchers compiled these illustrations to train a computer vision system capable of recognizing objects in real photos, even if it had never seen them before. Using these text-generated synthetic data as the sole reference, the system outperformed other procedurally generated image datasets trained with authentic photos.
The CSAIL team believes that combining the hidden visual knowledge of LLMs with the artistic capabilities of other AI tools, such as diffusion models, could be beneficial. Systems like Midjourney sometimes lack the finesse to systematically refine the finer details of an image, making it difficult to handle requests like reducing the number of cars in a photo or placing an object behind another. If an LLM sketched the requested change for the diffusion model beforehand, the resulting modification could be more satisfactory.
Challenges and Future Directions
Ironically, as Rott Shaham and Sharma acknowledge, LLMs sometimes fail to recognize the same concepts they can draw. This became evident when the models incorrectly identified human recreations of images in the dataset. Such diverse representations of the visual world likely triggered misconceptions in the language models.
While the models struggled to perceive these abstract representations, they showed creativity in drawing the same concepts differently each time. When researchers repeatedly asked the LLMs to draw concepts like strawberries and arches, they produced images from various angles with different shapes and colors, suggesting that the models might have a genuine mental imagery of visual concepts rather than merely reciting examples they had seen before.
The CSAIL team believes this procedure could serve as a benchmark for evaluating how well a generative AI model can train a computer vision system. Additionally, the researchers aim to expand the tasks on which they challenge the language models. For their recent study, the MIT group notes that they did not have access to the training set of the LLMs they used, making it difficult to further investigate the origins of their visual knowledge. In the future, they plan to explore training an even better vision model by allowing the LLM to work directly with it.
Sharma and Rott Shaham are joined in the paper by Stephanie Fu ’22, former CSAIL affiliate, MNG ’23, and EECS PhD students Manel Baradad, Adrián Rodríguez-Muñoz ’22, and Shivam Duggal, all affiliated with CSAIL; as well as MIT Associate Professor Phillip Isola and Professor Antonio Torralba. Their work was supported, in part, by a grant from the MIT-IBM Watson AI Lab, a LaCaixa fellowship, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. They are presenting their paper this week at the IEEE/CVF Conference on Computer Vision and Pattern Recognition.