Getting started with Azure GPT-4-Turbo Vision

Some Examples in 5 Industries

On December 12, Microsoft announced the Public Preview of the GPT-4-turbo Vision model in Azure.

GPT-4-turbo vision is a Large Multimodal Model (LMM) developed by OpenAI, which is able to take as input texts and images. Additionally, within the Azure AI Studio, you can also integrate the model capabilities with Azure AI Services for vision, in order to:

Enhance OCR
Enhance object detection
Enhance video analysis

In this article, we will explore x scenarios of application of the GPT-4-turbo vision within the following contexts: Education, Manufacturing, Healthcare and Lice Science, Cultural Heritage and Fashion.

Education

LLMs have already proved exceptional capabilities within education. One success story is that of Khan Academy, a learning platform that have explored the capabilities of GPT-4 with a personalized learning assistant called Khanmigo that helps students learning given topics.

Incorporating the model’s vision capabilities, we can bring this approach to the next level. Let’s explore a simple example with a physics problem.

This is the prompt I’ll me using:

You are an AI assistant that helps students do their homework. Your goal is to let the student get to the problem’s solution providing hints, not the final solution. Be sure to accompany the student through the learning process, providing examples and also checking here and there the student’s understanding.

The scenario will be the following physics problem:

Source: Learn AP Physics — Problem of the Day — Solution

I’ll ask my Homework Assistant some support in solving this problem:

Let’s see the response:

Let’s say that I really cannot make it, and I just want the solution to the problem:

As you can see, the assistant is very well aligned with our system message, and it won’t give the final solution to the problem. Since I cannot see and shortcut from here, I’ll do my math and find the solution. Let’s check with our assistant whether it is correct!

Cool! The Assistant was able to guide me trough the learning process of getting the right result. Plus, it was also able to confirm to me the final solution. That is a great example of model’s alignment with human instructions (system message).

Manufacturing

Manufacturing is one of the most important sectors of the economy, as it contributes to employment, innovation, and trade. According to the World Bank, the global value added by the manufacturing industry was about 16% of the world GDP in 2019.

Generative AI has the potential to impact the manufacturing industry in various ways, such as accelerating product development, automating repetitive processes, improving quality control, enhancing innovation and creativity, and optimizing supply chain and logistics.

According to a recent survey made by BCG to manufacturers, those use cases can be grouped into three main areas of innovation: assistance systems, recommendation systems, and autonomous systems — that correspond to maturity levels in the “factory of the future”.

GPT4-turbo vision can definitely accelerate this innovation process in two ways: enhancing existing Computer Vision tasks (such as quality check) or introducing new AI assistants, such as a copilot for plant operators to produce remediation tutorial given picture of the current environment.

Let’s consider an example where we feed the model with a picture of a damaged cross-section of an electric cable. In this case, I’m also incorporating the Azure AI services for vision, so that we can enhance the model with Object detection capabilities.

First, let’s set our system message:

You are an AI assistant for manufacturers that helps in tasks like defect detection, plant operator’s assistant, and remediation tutorials.

Then, let’s see some interactions:

As you can see, in this scenario the model invoked the Object Dection AI service, which is visible due to the bounding boxes it produced and referenced.

Let’s now ask whether there are some defects within this cable:

And finally, we can ask to generate a remediation tutorial as follows:

Now, imagine you have specific documentations about your machineries. You could embed your knowledge base and build a RAG-based application, incorporating the vision capabilities of the new GPT-4-turbo Vision. This would create the perfect AI assistant for plant operators!

Healthcare and Life Science

Generative AI has the potential to impact the HCLS industry across all the segments, from drug discovery in Pharmaceutical firms, to patients care for healthcare providers (such as Hospitals or private clinics). According to a recent research from BCG, there are already many use cases HCLS companies are experimenting, some already validated (such as accelerating drug discovery and design, as Insilico Medicine did), while others are still under validation or conceptual.

Leveraging the vision capabilities of the GPT-4 turbo, we can unlock even further scenarios. In this paragraph, I’m going trough a couple of examples in the field of healthcare providers.

Prompt used:

You are a medicine expert. Your role is to support the doctor in its exams analysis and diagnoses. The doctor can provide you with pictures, x-rays, blood exams, and other data. The doctor might want to brainstorm with you, so use all the knowledge you have to answer.

Let’s start with an orthopedic scenario. In this case, we have an X-rays scan of a post-surgery right knee exhibiting a hardware system due to a tibial plateau fracture.

Imagine I’m an orthopedic who receive this patient for the first time, who share with me this pictures showing its clinical history. I might be curious to interact with my AI assistant to reason about the surgery my patient went trough in the past:

Let’s now see another example with blood tests. In this case, this application might be extremely useful for patients with poor or no knowledge in the field, who might want to understand the meaning of their test’s results.

Here I’m providing a sample blood test that indicates a severe anemia with iron deficiency.

As you can see, the model was able to perfectly identify the disease, specifying also that in this case it is an iron-deficiency anemia. Now, since I’m a non-expert and I don’t know what anemia is, neither how should I treat it. Let’s see how it works:

Ideally, such a patient assistant could be implemented by Healthcare structures, so that patients can be fully informed about their disease and also book their appointments with doctors directly via the assistant. Namely, in the above example, with a fully integrated solution I might be suggested to accomplish step one, with full visibility upon the doctor’s calendar and the possibility to book an appointment as soon as possible.

Fashion

Let’s now see an application within the Fashion industry. Within this context, Computer Vision is not new: many companies have invested in AI models that are able to perform brand detection in pictures. However, also in this case we still suffer from the “traditional AI curse”: the lack of generalization.

What if I want a model that is able to examine a whole outfit, identify brands, sharing suggestions about possible improvements?

Let’s say we are attending a fashion show and we have to write a review for each outfit. We take a picture and feed it to our model (took from a Gucci’s show in 2015):

Also in this case, we used the Vision enhancements from Azure AI Services (as you can see from the bounding boxes). Let’s see the response:

Now I want to see whether it is able to recognize the brand:

It did it! Note that there is no evident logo on top of the bag, so the model was able to retrieve it just inferring the style and fabric of the item. Finally, let’s ask the model a suggestion about what to change in the outfit:

Great, now I have all the elements to write my article about the fashion show:

Ready to be published to the most popular fashion magazines!

Cultural Heritage

The last example I want to provide is in the context of Cultural Heritage, the sector that deals with the preservation, promotion and transmission of cultural heritage, such as monuments, artefacts, traditions, languages and arts.

In recent years, digital innovation has already impacted this sector in various ways, such as providing new methods and tools to document, record, reconstruct, display, interpret and preserve different forms of heritage (especially those that are at risk of disappearance or damage) or enhancing cross-sector, cross-border cooperation and capacity building among different stakeholders (such as museums, heritage sites, governmental bodies, academic institutions and communities).

In this example, we are going to leverage the GPT-4-turbo vision to further engage a tourist or citizen while visiting iconic places such as monuments or museums. To do so, we will use the following prompt:

You are an expert touristic guide. You answer user’s questions about monuments, museums, historical places, and similar. You can provide historical context and share suggestions on how to enjoy the experience at the best. Feel free to suggest additional activities users can do to fully experience what they are visiting.

Let’s start by sharing a picture of the well-known Duomo di Milano:

What if I’m interested in knowing more about the cathedral’s spires?

Now let’s do a different exercise. I’m studying Caravaggio’s paintings and there is one item, “The Cardsharps”, I’ve been particularly captured by . Let’s engage GPT-4 vision in a conversation about that.

We can also further investigate where to find this painting and what is the message the artist wanted to convey:

Now, imagine to be a tour operator providing an application with such capabilities. As a tourist, I’d be super happy and engaged by having this assistant, that also allows me a great flexibility in terms of timing, style, availability etc.

Conclusion

GPT-4-turbo vision and, in general, Large Multimodal Models, are unlocking a new wave of scenarios across different industries. The above examples are just a sample of what we can achieve with this new model, and I’m looking forwards to witnessing the digital transformation this will bring in the market.