Using GPT-4 with Vision as an Art Critic

How OpenAI’s latest model can provide insights into visual arts

**minimalist painting of an AI art critic**, Image created using an AI image creation program, Midjourney, and edited by the author

About two weeks ago, OpenAI started rolling out access to their GPT-4 model with vision (GPT-4V) [1]. As a subscriber to ChatGPT Plus, which costs US$20 per month, I got early access to the new model. The system is available here: chat.openai.com.

With GPT-4V, I can submit a text prompt with up to four images. The system will analyze the prompt and images and respond accordingly. Since I have been writing about the creative uses of AI and ML, my first instinct was to see if GPT-4V can effectively function as an art critic.

Here is my first experiment. I had the system review an acrylic painting of a New England landscape that I made back in 2002 with the help of a reference image, some custom software, and an overhead projector. I took a photo of my painting and sent it to GPT-4V for review.

***ChatGPT User Interface*** *(left), **Warren Park in Autumn** (right), Screenshot and Painting by Author*

Here’s my prompt, followed by GPT-4V’s review.

RobG: Based on the photo of this painting, write a review with constructive criticism. Warren Park in Autumn Robert A. Gonsalves Acrylic on Canvas Board 24” x 32”

GPT-4V: “Warren Park in Autumn” by Robert A. Gonsalves is a mesmerizing depiction of nature’s beauty in its prime. The artist’s choice of vibrant autumnal colors captures the essence of the season, making the scene come alive with fiery oranges and rich browns. The pointillist technique employed lends a dreamy, almost ethereal quality to the piece, allowing viewers to feel as if they are being engulfed in a whirlwind of leaves.

Constructive Criticism:

1. While the pointillism creates a captivating texture, a bit more variation in dot sizes might add further depth and realism to the foliage.

2. The sky, although beautifully rendered with its cool blues, could benefit from a smoother transition or gradient to give it a more natural feel.

3. The signature, “Robert Arthur 11/02”, while important, appears slightly prominent. It might be more aesthetically pleasing if it were integrated more subtly into the painting.

Overall, this artwork is a brilliant representation of autumn, capturing the viewer’s attention and transporting them to the heart of Warren Park.

The first part is quite flattering. I’ve noticed that ChatGPT generally offers upbeat responses. It pointed out some objective characteristics of my painting, like the pointillist style and autumnal colors. And I liked the enthusiasm when it said the painting gave viewers the feeling of “being engulfed in a whirlwind of leaves.”

***Details of Warren Park in Autumn, Foliage*** *(left), **Sky** (middle), **and Signature** (right), Images by Author*

In my prompt, I asked for some “constructive criticism,” and it provided some. The note about the dot sizes was fair, but suggesting a smooth gradient in the sky seemed off, as pointillist paintings are typically made with only dots of colors. However, I was impressed with the system’s ability to read my signature and date since I chose colors with barely any contrast. And, yes, my signature is unusually large for the painting.

In this article, I will provide some background on the GPT-4V model and see how well it works when reviewing other forms of visual art, including multimedia installations and AI-generated art.

The GPT-4V Model

In March 2023, OpenAI launched GPT-4, with the non-vision variant offered to ChatGPT Plus users. Here’s what the authors said about the vision feature.

GPT-4 accepts prompts consisting of both images and text, which — parallel to the text-only setting — lets the user specify any vision or language task. Specifically, the model generates text outputs given inputs consisting of arbitrarily interlaced text and images. Over a range of domains — including documents with text and photographs, diagrams, or screenshots — GPT-4 exhibits similar capabilities as it does on text-only inputs. - OpenAI [1]

Model Size and Architecture

Unlike how OpenAI released previous models, it did not disclose much information about the size and architecture of its GPT-4 models. However, there are leaked reports that GPT-4 has about 1.8 trillion parameters. It uses a mixture of experts (MoE) system, with 16 experts within its model, each with about 111B parameters. For reference, the earlier GPT-3 models use 175 billion parameters [3].

Training Process

OpenAI did not disclose much about what data they used to train the GPT-4 models. On their research site, they said that their GPT-4 base model was trained with a corpus of public and licensed data, encompassing diverse reasoning abilities, statements, and ideologies. To ensure the model’s responses align with user intent, they performed fine-tuning via Reinforcement Learning with Human Feedback (RLHF).

Experimental Results

To test the models, OpenAI ran a diverse set of benchmarks, including simulating exams initially designed for humans. They did not run specific training for these exams. The models saw a minority of the problems in the exams during training; for each exam, they ran a variant with these questions removed and reported the lower score [1]. Here are some of the results.

**Test Results of GPT Models**, Data from OpenAI, Graph by Author

You can see that the GPT-4 models outperformed the GPT-3.5 model significantly for most of the tests. It’s interesting to see how GPT-4V did better than the standard GPT-4 on some tests, like the GRE Quantitative test. From their paper, I read that the researchers at OpenAI transcribed any images and diagrams in the tests as objectively as possible to be used for both the vision and no-vision models [1]. However, they didn’t offer any insights as to why GPT-4 with vision did better on tests than the standard GPT-4 model. I hypothesize that GPT-4V was better equipped to answer these “word” problems that had descriptions of images.

Limitations of the Models

The authors describe the shortcomings of GPT-4 and GPT-4V. They report that the models, despite their advancements, share some weaknesses with their predecessors: they aren’t always reliable, can produce “hallucinations,” have a restricted context window, and don’t learn from experience. These limitations, especially in crucial contexts, warrant caution when using them. The models’ capabilities and constraints introduce new safety concerns, emphasizing the need for research due to its potential societal impact. The paper delves into potential risks, including bias, disinformation, and more, and outlines measures taken to minimize these risks, such as adversarial testing and a safety pipeline [1].

Initial Release of GPT-4V

Initially, the visual version, GPT-4V, was exclusively accessible through Be My Eyes, an app created by a Danish startup. This app aids individuals who are blind or have low vision by connecting them with volunteers for tasks like product identification and airport navigation. Harnessing the capabilities of GPT-4V, the startup introduced a “Virtual Volunteer” feature in the app, designed to match the understanding level of a human assistant [2].

OpenAI’s Data Collection Policy

OpenAI changed its data collection policy. By default, your prompts and responses can be used by OpenAI to train future models. They previously used an online form where users could opt out of the data collection. With their new policy, data collection is now tied to chat history. If you turn off chat history, they will not use your prompts and responses to train their models. You can read about their new policy here.

Next, I’ll show more experiments using GPT-4V as an art critic.

**ChatGPT User Interface**, Screenshot by Author

GPT-4V as an Art Critic

Can GPT-4V act as an art critic? I asked it this very question in the screenshot above, and it answered, “Absolutely!” and explained why it could. Let’s see how it does in a few examples.

**Swarming by Anna Kristina Goransson and Rob Gonsalves**, Images by Author

Installation Art

The first example is a review of an interactive installation called “Swarming,” which I co-constructed with my friend and artist, Anna Kristina Goransson. The installation showcases felted and dyed forms, which act as homes for “boids,” digital creatures that appear to fly out of the forms and fly around through projection, eventually going back into the forms. The boids can be disrupted and moved by the viewer through a Microsoft Kinect that is watching the viewer.

I sent GPT-4V the three images and a description of the installation to get a review. Here’s how the interaction went.

RobG: Based on the photos and description of this interactive installation, write a review with constructive criticism. [description]

GPT-4V: Review of “Swarming” by Anna Kristina Goransson and Rob Gonsalves

“Swarming” serves as an ambitious testament to the possibilities of blending tactile artistry with the immersive capabilities of modern technology. The vividly hued felt sculptures, designed by Anna Kristina Goransson, lend a whimsical texture to the exhibit. These organic forms juxtaposed against their animated counterparts offer a surreal escape from the mundane, channeling a world where art literally comes to life.

The four modes of interaction — Sleeping, Scouting, Swarming, and Interacting — have been thoughtfully designed to give visitors a layered experience. The mere presence of an audience can change the installation’s behavior, bridging the gap between passive viewing and active participation. The use of the Microsoft Kinect interface to track visitors’ movements and trigger visual and auditory responses is both innovative and engaging. This immediate feedback loop allows visitors to feel truly integrated within the artwork.

However, there are areas that might benefit from further refinement:

1. Ambient Sound: The silent modes, specifically Sleeping and Scouting, could potentially utilize subtle ambient sounds. This would enhance the experience by ensuring that the atmosphere doesn’t feel too stark or empty during these quieter moments.

2. Clarity on Interaction: A simple guide or indication on how visitors can transition the exhibit from one mode to another might be beneficial. This can help ensure that visitors can fully explore the depth of interactivity that “Swarming” offers.

3. Lighting: Considering the importance of video projectors in this installation, special attention should be given to the ambient lighting of the exhibit room. Proper lighting can enhance the visibility of projections and ensure the felt sculptures are appropriately illuminated.

4. Diversity in Animation: While the “boids” offer a dynamic element to the exhibit, introducing varied patterns or behaviors could further enrich the viewer’s experience. This could be as simple as diversifying the colors or sizes of the boids.

In conclusion, “Swarming” is an evocative blend of traditional art and cutting-edge technology, offering a multi-sensory experience that is both innovative and nostalgic. With a few adjustments, this installation has the potential to captivate audiences even further, making each visit a unique journey of discovery.

GPT-4V’s review of the piece was very positive. It started with a highly favorable assessment of the installation using words like “ambitious,” “whimsical,” “innovative,” and “engaging.” It’s almost as if I paid it to write the review. Oh, wait. I did. 🙂

But it did return with some reasonable points of constructive criticism. The first point, about making sounds during the sleeping and scouting modes, doesn’t take the art sitter’s sanity into account. The points about having instructions for viewers and the lighting conditions are valid. The final point about having more diversity in the animation is good, too. But the suggestion to vary the sizes of the boids sounds a lot like its suggestion to change the size of the pointillist dots in my opening example. Hmm, let’s see if this trend continues in other examples.

**User Interfaces for Dream Studio** (right), **DALL-E 3** (center), **and Midjourney** (left), Screenshots by Author

AI-Generated Art

To test the system’s comparative art criticism abilities, I used the same prompt as input into three text-to-image generation models, asking GPT-4V to compare the generated images. The image generation models I used are the latest available at Dream Studio from Stability AI, DALL-E 3 from OpenAI, and Midjourney. I have a paid account for each of these services. You can see the user interfaces in the screenshots above.

Of these three image generation systems, DALL-E 3 has an interesting twist; it uses ChatGPT as the interface. When I asked it to generate a painting using the prompt, “Impressionist painting of autumn woods,” it first expanded it to a much longer prompt, “Oil painting in the Impressionist style, depicting a dense forest during autumn. Golden leaves carpet the ground while the trees are adorned with vibrant shades of orange, red, and yellow. Dappled sunlight filters through the canopy, casting shimmering patches of light onto the forest floor.” This is a pretty cool feature, so I used it. If I had typed, “Create an Impressionist painting of autumn wood. Use this exact prompt.” it would skip the prompt expansion step.

Here are the results of generating images with “Impressionist painting of autumn woods.”

***“Impressionist painting of autumn woods,” Generated in Dream Studio*** *(left), **DALL-E 3** (middle), **and Midjourney** (right), Images by Author*

Here’s the prompt I sent into GPT-4V, along with the three images.

RobG: Analyze these three images created with the prompt, “Impressionist painting of autumn woods.” Provide contrastive feedback referring to the images as left, middle, and right.

GPT-4V: Of course! Let’s delve into the analysis of these three impressionist paintings of autumn woods.

Left Image: The left image utilizes a muted color palette, predominantly showcasing soft oranges, yellows, and cool grays. The trees are painted with slender trunks and have a smooth texture, giving the entire scene a delicate, ethereal vibe. The foreground has a clear path covered with fallen leaves, guiding the viewer’s eye into the depth of the woods. There’s a sense of serenity and quiet in this painting, as if one might hear the whisper of the wind rustling through the leaves.

Middle Image: The middle image is bolder and more vibrant in its color choices, with deep golds, fiery oranges, and hints of green. There is an intense play of light and shadow, particularly how the sunlight filters through the canopy, casting beams of light onto the forest floor. Unlike the left image, this scene is denser, with a concentration of trees that appear closer together. The focus on the forest floor, covered with scattered leaves illuminated by the sun, creates a magical, almost dreamy atmosphere.

Right Image: The right image is the most dynamic and vivid of the three, with a color palette that pops, showcasing bright oranges, reds, and contrasting blues. There’s a reflection in what appears to be a stream or a puddle, doubling the impact of the autumn colors and creating a mirror image of the trees. This painting feels more abstract than the others, with bolder brushstrokes and a play of reflections, making it seem as if the viewer is witnessing the scene after a rainfall. There’s an emphasis on movement, perhaps due to the water’s ripple effect and the directionality of the fallen leaves.

Contrastive Feedback: While all three paintings beautifully capture the essence of autumn woods, they do so in distinct ways. The left painting offers a calm and muted representation, focusing on the tranquility of the woods. In contrast, the middle painting immerses the viewer in a radiant forest bathed in sunlight, exuding warmth and enchantment. The right image stands out for its bold colors and abstraction, using reflections and vibrant hues to convey the dynamic nature of the season. Each has its own unique charm, making them all captivating in their own right.

Wow! That’s an excellent, comprehensive analysis of the three paintings. Again, the reviews are all very positive. I liked how it started comparing the images beginning with the description of the second image, “Unlike the left image, this scene is denser …” And it kinda crowned the third image as the winner early on, “The right image is the most dynamic and vivid of the three …” But it ended on an upbeat note, where everyone gets a star, “Each has its own unique charm …”

Looking for some concrete constructive criticism, I asked it this.

RobG: Briefly describe how each of the paintings could be improved.

GPT-4V: Left Image: To enhance its allure, the artist could introduce a bit more contrast in the colors to give the scene more depth. A touch of brighter, more vibrant autumn colors might also help certain elements pop, ensuring that the image doesn’t appear too monochromatic or washed out.

Middle Image: While the play of light and shadow is beautifully executed, the artist could work on defining the trees a bit more, giving them more texture or variance in bark patterns. This would break up the uniformity and provide additional points of interest.

Right Image: The dynamic nature of this painting is its strength, but it could benefit from a clearer focal point. Whether it’s a specific tree, reflection, or another element, giving the viewer’s eye a place to rest might bring a bit more balance to the scene. The artist could also refine the reflections in the water, ensuring they align more naturally with the trees they mirror.

Here are the three images again for reference.

***Generated Paintings from Dream Studio*** *(left), **DALL-E 3** (middle), **and Midjourney** (right), Images by Author*

I agree with all three of these pieces of advice. The left image could use a boost of contrast and color saturation. The middle image could use some more definition in the trees. And the right image is missing a central object of focus.

Conclusion and Next Steps

In this exploration, I found GPT-4V’s capabilities as an art critic both promising and insightful. By juxtaposing its analytical abilities against various forms of visual art, it was evident that the model can provide both positive feedback and constructive criticism with a fair amount of depth.

From the feedback on my painting to more complex interactive installations and AI-generated art, GPT-4V demonstrated a nuanced understanding of art and a surprisingly discerning eye. Although the model was notably positive in its reviews, which may reflect OpenAI’s design decisions to avoid overly negative responses, it did not hesitate to offer suggestions for improvement when asked.

However, it is crucial to keep in mind that while GPT-4V can offer an analysis, the actual value of art often lies in the subjective experience of the viewer. AI tools like GPT-4V can provide feedback, but they shouldn’t replace the human touch, which brings personal perspectives, emotions, and experiences into interpreting art.

In the future, it would be intriguing to explore how GPT-4V might respond to more abstract art forms or even delve into a comparative analysis from historical art periods. For example, it would be interesting to have it critique two French impressionist paintings and see if it recognizes them. Furthermore, as the model continues to evolve and improve, there’s potential for even deeper insights into art, design, and other visual mediums.

In conclusion, GPT-4V offers a unique and valuable tool for artists seeking feedback. Still, it is essential to approach its critiques with a balanced perspective, remembering that art is as much about the human experience as it is about technique or visual appeal.

Acknowledgments

I thank Jennifer Lim and Oliver Strimpel for reviewing this article and providing feedback.

References

[1] OpenAI, GPT-4 Technical Report (2023)

[2] OpenAI, Be My Eyes — Be My Eyes uses GPT-4 to transform visual accessibility (2023)

[3] T. Brown et al., Language Models are Few-Shot Learners (2020)

Postscript

**Fan Correction: It’s Not 90 Feet From The Pitcher’s Mound To Home Plate!**, Image by Author

Although GPT-4V liked my “Warren Park in Autumn” painting, Conan O’Brien was not so kind. You can hear his assessment of my paintings in this “Fan Correction” clip from 2013, starting at the 01:36 mark.