avatarDariusz Gross #DATAsculptor

Summary

Microsoft's Turing Bletchley is a state-of-the-art AI model that interprets images and text across 94 languages, marking a significant advancement in image-language understanding.

Abstract

The Turing Bletchley model, introduced by Microsoft, represents a breakthrough in AI by enabling image-language tasks across a wide array of languages. Trained on billions of words and images using a large dataset, the model employs a language-agnostic vector representation and contrastive loss to align semantically similar pictures and texts. It leverages the ZeRO optimizer and DeepSpeed library for PyTorch, resulting in superior performance over previous architectures. T-Bletchley excels in image understanding, captioning, translation, and object tagging, and it encodes images and text into vector representations that map with each other, facilitating a bottom-up approach to language understanding through computational analysis of visual data. The model's architecture, which includes transformers for image and text encodings, allows for fine-tuning to capture subtle image content, such as facial expressions, paving the way for potential applications in AI art and beyond.

Opinions

  • The Turing Bletchley model is seen as a significant breakthrough in AI, particularly in the field of image-language understanding.
  • The model's ability to handle a wide variety of images effortlessly and understand text and images as seamlessly as humans is highlighted.
  • T-Bletchley's potential to revolutionize AI is emphasized, with its capacity to understand a wide variety of objects, actions, and concepts in the real world.
  • The model's training on a large, diverse dataset is believed to have resulted in a robust system capable of handling images from all kinds of web sources.
  • The opinion that T-Bletchley could lead to an AI art movement is expressed, with the model's sensitivity to facial expressions and subtle elements of human behavior seen as key to creating meaningful art.
  • The model is considered to have advantages over previous architectures, such as OpenAI’s CLIP, due to its universal image-language representation capabilities.

November 2021- State of the Art: Transformers

Power to the Pixel

DEMO: Turing Bletchley

source: https://turing.microsoft.com/bletchley

Turing Bletchley, this model can interpret pictures in a groundbreaking way. Microsoft moves beyond language for AI with the introduction of T-Bletchley, a model that can perform image-language tasks in 94 languages.

TL;DR The model was trained on billions of publicly available words and images. For this purpose, Data scientists at Microsoft’s Project Turing used a large dataset containing sentences chosen randomly from the web. They built a language-agnostic vector representation for each sentence and then applied a contrastive loss on those vectors. The model was then trained using the ZeRO optimizer and the DeepSpeed library for PyTorch. The final result is a state-of-the-art image understanding model with universal capabilities but significantly better performance than previous architectures.

source:https://turing.microsoft.com/bletchley

Are there any art exhibitions worthy of a photo coming up? Do you want your memories to last a lifetime? This model has been trained to understand text and images as seamlessly as humans do. The first version of T-Bletchley represents a significant breakthrough in this mission. This latest release from Microsoft is the only one that can handle a wide variety of images effortlessly. T-Bletchley can also be used for image captioning, image translation, and object tagging applications.

T-Bletchley does a remarkable job of encoding images and text into vector representations that map with each other. This allows for semantically similar pictures and texts to align, leading to breakthroughs in image language understanding. An image with text is represented as a vector of parameters, denoting the presence or absence of content in different parts of the picture. This enables the model to quickly encode thousands of other images while maintaining high accuracy. Models need to account for the underlying data in a language-specific process to create a system that can perform better than human experts. This is a top-down approach that requires an understanding of language and complex knowledge representations. In contrast, T-Bletchley works over image analysis and deep learning with less reliance on grammatical rules and experience from experts. This bottom-up approach requires minimal understanding of the grammar and relies more on the computational analysis of visual data.

T-Bletchley brings together two different AI systems: (1) deep learning over images (e.g., image captioning, object recognition) and (2) hierarchical/graph-based probabilistic models for language processing (e.g., neural machine translation, syntactic parsing). T-Bletchley is general purpose and can be applied to a wide variety of tasks. Microsoft’s tool is a component for a bigger picture that considers the various modalities of language and their representation in images instead of or in addition to text. In T-Bletchley, language is extracted from images using a fundamental feature transform from image pixel values to vector elements specifically designed for image-to-language modeling.

OpenAI’s CLIP model has been the gold standard in zero-shot image classification. Still, this latest release from Microsoft is poised to revolutionize Artificial Intelligence because it can understand a wide variety of objects, actions, and many other concepts (dancing, programming, racing) in the real world. Today, image retrieval relies heavily on metadata that comes with images. For example, the text in the caption or link to the photo could be enough to find it in an extensive database. T-Bletchley is unique because it can actually understand this image and place a greater weight on the image itself. Image understanding can improve current retrieval systems to place a greater weight on the image itself. The T-Bletchley model was trained using billions of image-caption pairs drawn from the web. A large, diverse training dataset resulted in a robust model that can handle various images.

Model architecture

source: https://www.microsoft.com/en-us/research/blog/turing-bletchley-a-universal-image-language-representation-model-by-microsoft/

The T-Bletchley architecture is composed of transformers that perform image and text encodings. It’s similar to the BERT architecture in that sense but also includes powerful focus-based interactions between models. Like BERT, T-Bletchley can perceive multiple objects across far apart images. It is also able to see the commonalities between images and captions. Like most modern deep learning models, T-Bletchley can be fine-tuned to learn subtler shifts in image content. For example, the model is sensitive to facial expressions in images and can therefore infer emotions or intent with greater precision than published models.

source: https://www.microsoft.com/en-us/research/blog/turing-bletchley-a-universal-image-language-representation-model-by-microsoft/

T-Bletchley models are sensitive to facial expressions in images and can therefore infer emotions or intent with greater precision than published models. AI art depends on this subtlety. If the AI art model can understand and represent the subtlest elements of human behavior and motivation, it can create beautiful and meaningful art. I think we will see an AI art movement with T-Bletchley as its primary model.

AI Artists will be able to explore new creative possibilities beyond human intuition.

The T-Bletchley model was trained with a large, diverse dataset that resulted in a robust system that can handle images from all kinds of the web.

DEMO: https://turing.microsoft.com/bletchley

Reference: https://www.microsoft.com/en-us/research/blog/turing-bletchley-a-universal-image-language-representation-model-by-microsoft/

I’m curious about your opinion

Machine Learning
Deep Learning
Artificial Intelligence
Ai Art
Language
Recommended from ReadMedium