
9 minute read
Welcome To
from Image Masking
by Farin Habiba
An Illustrative Guide to Masked Image Modelling

Advertisement
Today, we can see in machine learning that methods and models from one field may carry out operations from another. For instance, some computer visionrelated tasks can be completed by models that are primarily geared toward natural language processing. We will talk about a technique that can be used from NLP to computer vision in this essay. Masked Image Modelling is the name we can give it when we apply it to Image Masking computer vision applications. We'll make an effort to comprehend how this technology functions and some of its key applications. The following list includes the main points that will be covered in this article.
Table of Contents
What is Masked Image Modelling?
The framework of Masked Image Modelling
Works Related to Masked Image Modelling
Applications of Masked Image Modelling
Let’s begin the discussion by understanding what mask image modelling is.
What is Masked Image modelling?
Masked signal learning, a sort of machine learning, uses the input's masked component to learn about and forecast the masked signal. NLP for self-supervised learning contains use cases for this kind of learning. We can observe the application of masked signal modeling for learning from vast amounts of unannotated data in various studies. For the computer vision challenge, this method can also produce results that are comparable to those of methods like contrastive learning. Masked image modelling is the process of employing masked images to perform computer vision tasks.
Applying masked image modelling can have the following difficulties:
Pixels next to one another have a strong correlation. In contrast to the signal (tokens) under the NLP data, the signals under the photos are unprocessed and low level. Whereas text signals are discrete, visual signals are continuous.
So, in order to avoid correlation when using this method with image or computer vision-related data, the operation must be very properly executed. High-level visual tasks can benefit from prediction from low-level data, and the method can modify the behavior of continuous signals.
We may see numerous examples of works that generalized similar issues using picture data modeling, including:

Pre-Trained Image Processing Transformer: This work demonstrates the use of continuous signal from images for classification problems while also using color clustering methods.
Swin Transformer V2: This work demonstrates a method for scaling a Swin transformer up to 3 billion parameters, enabling it to learn from images with a resolution of up to 1536 × 1536 and execute computer vision tasks. They have used models to apply adaption approaches for continuous signals derived from images.
AiT: AiT Pre-Training of Image Transformers: This work can be compared to the BERT model in computer vision, which employs a similar tokenization strategy employing an extra network for image data and block-wise picture masking to break the short-range connection between the pixels.
After reading the aforementioned books, we can observe a few illustrations of methods that can be applied to address the problems. We can comprehend the degree of complexity needed to create a model or framework that can manage these challenges and carry out the necessary activity. The following pictures can be used to illustrate the core concept of the model or framework for masked image modeling:
Image source
We can see the input picture patches in the above image, along with a linear layer that performs regression on the pixel values of the masked area experiencing loss. A simple model's design and conclusions may include the following: applied masking to the photos
Regression model for raw pixels portable prediction head
We may simplify the process for a transformer by performing basic masking on the photos. The continuous nature of visual information and the regression task are well matched, and a light prediction head should have the ability to significantly speed up pre-training. Although heavier heads can produce a stronger generation, they may suffer in the downstream fine-tuning operations.
The Framework of Masked Image Modelling
Masked Image Modelling (MIM) is a framework for generating high-quality images from incomplete or corrupted inputs. The framework works by iteratively refining an initial estimate of the complete image using a series of conditional models.
The MIM framework can be broken down into the following steps: Input Masking: The input image is masked, meaning that a portion of it is intentionally removed or obscured. The goal is to generate a complete image that matches the original input as closely as possible, even in areas where information was removed.
Initialization: An initial estimate of the complete image is generated, based on the available information in the unmasked portion of the input image.
Iterative Refinement: The initial estimate is iteratively refined using a series of conditional models. Each model is trained to predict a specific portion of the complete image, given the available information at that point in the refinement process.
Output Generation: Once the refinement process is complete, the final estimate of the complete image is generated by combining the results of each of the conditional models.
There are several variations of the MIM framework, each with its own set of conditional models and training strategies. Some common types of conditional models used in MIM include:
The MIM framework can be broken down into the following steps: Input Masking: The input image is masked, meaning that a portion of it is intentionally removed or obscured. The goal is to generate a complete image that matches the original input as closely as possible, even in areas where information was removed.
Autoregressive models, which generate each pixel in the image sequentially, based on the values of previously generated pixels.
Variational Autoencoder (VAE) models, which generate the complete image by sampling from a learned probability distribution. Generative Adversarial Network (GAN) models, which generate the complete image by training a generator network to produce realistic images that can fool a discriminator network.
MIM has a wide range of applications, including image inpainting, image super-resolution, and image denoising. It has proven to be an effective method for generating high-quality images from incomplete or corrupted inputs.

Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI that are trained using unsupervised learning on large amounts of text data. The models are based on deep neural networks and can generate text that is coherent and often indistinguishable from text written by humans.
In the above section, we have seen a basic architecture of the framework for masked image modelling and the components using which we can make the framework perform computer vision tasks using masked image modelling. Let’s see some of the examples where we can witness the masked image modelling.
Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI that are trained using unsupervised learning on large amounts of text data. The models are based on deep neural networks and can generate text that is coherent and often indistinguishable from text written by humans.
Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI that are trained using unsupervised learning on large amounts of text data. The models are based on deep neural networks and can generate text that is coherent and often indistinguishable from text written by humans.
Works Related to Masked Image modelling
There have been many works related to Masked Image Modelling (MIM) in recent years. Here are a few notable examples:

"Generative Image Inpainting with Contextual Attention" by Yu et al. (2018) - This paper introduced a MIM approach for image inpainting that uses contextual attention to guide the refinement process. The method achieves state-of-the-art results on several benchmarks.
"Deep Image Prior" by Ulyanov et al. (2018) - This paper proposes a MIM approach based on the assumption that convolutional neural networks can learn the structure of natural images without any training data. The method achieves impressive results on a range of image restoration tasks.

Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI that are trained using unsupervised learning on large amounts of text data. The models are based on deep neural networks and can generate text that is coherent and often indistinguishable from text written by humans.
"Plug-and-Play Generative Networks: Conditional Iterative Generation of Images in Latent Space" by Nguyen et al. (2019) - This paper introduces a MIM approach that generates images in a learned latent space. The method is capable of generating high-quality images and can be applied to a wide range of image generation tasks.
"MaskGAN: Better Text Generation via Filling in the ______" by Fedus et al. (2018) - This paper applies MIM to the task of text generation. The method uses a GAN-based architecture to generate text that fills in missing words in a given context.
"Conditional Variational Autoencoder with Soft-Attention for Multi-Modal Image Inpainting" by Li et al. (2021) - This paper introduces a MIM approach that uses a conditional variational autoencoder with soft attention for multi-modal image inpainting. The method achieves state-ofthe-art results on several benchmarks. These are just a few examples of the many works related to MIM. The field is rapidly evolving, and new approaches are being developed and refined all the time.
Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI that are trained using unsupervised learning on large amounts of text data. The models are based on deep neural networks and can generate text that is coherent and often indistinguishable from text written by humans.
The first model in the GPT family, GPT-1, was introduced in 2018 and had 117 million parameters. It was trained on a diverse corpus of web text, including books, articles, and websites, and was capable of generating coherent text in a variety of styles and genres.
Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI that are trained using unsupervised learning on large amounts of text data. The models are based on deep neural networks and can generate text that is coherent and often indistinguishable from text written by humans.
Subsequent versions of the model, including GPT-2 and GPT-3, have greatly increased the number of parameters, with GPT-3 having 175 billion parameters, making it one of the largest language models ever developed. The larger models have been shown to be capable of generating even more impressive text, with the ability to mimic different writing styles, summarize text, and even translate between languages.
The key innovation of GPT is the use of unsupervised learning to train the model. Unlike traditional supervised learning, which requires labeled data to train the model, GPT is trained on raw text data using a self-supervised learning approach. The model is trained to predict the next word in a sequence of text, given the previous words in the sequence. By doing this, the model learns to generate text that is coherent and follows the rules of language.
GPT has many practical applications, including in chatbots, automated content generation, and text completion. However, the technology also raises ethical concerns, including the potential for misuse in generating fake news, propaganda, and disinformation. As a result, the development of large-scale language models like GPT has prompted discussions about responsible AI and the need for transparency and accountability in AI research.
Variational Autoencoder with Soft-Attention for Multi-Modal Image Inpainting" by Li et al. (2021) - This paper introduces a MIM approach that uses a conditional variational autoencoder with soft attention for multi-modal image inpainting. The method achieves state-of-the-art results on several benchmarks
Applications of Masked Image Modelling
Masked image modeling is a technique used in computer vision and machine learning to train neural networks to fill in missing parts of images. Here are some applications of masked image modeling:

Image inpainting: Masked image modeling can be used to fill in missing parts of images caused by scratches, stains, or other types of damage. For example, in the field of art restoration, it can be used to repair old or damaged paintings.
Object removal: Masked image modeling can be used to remove objects from images. This is useful in applications such as photo editing or video processing, where it is necessary to remove unwanted objects from an image or video.
Image generation: Masked image modeling can be used to generate new images by filling in missing parts of existing images. This can be useful in applications such as creating realistic images of people or objects that do not exist in real life.
Image completion: Masked image modeling can be used to complete partially visible images. For example, in medical imaging, it can be used to complete scans where only part of the body is visible.
Image segmentation: Masked image modeling can be used to segment images into different regions based on their characteristics. For example, it can be used to identify and separate different types of tissue in medical images.
Overall, masked image modeling is a versatile technique that has many applications in various fields, including art restoration, photo editing, video processing, medical imaging, and more.