Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

bc8">In convolution, we use the same 3x3 filter and slide it across the input spatially. This 3x3 filter is agnostic to spatial dimensions. How can we make spatial-specific filters?</p><p id="5243">One way to think is to create a filter with the same dimensions as the input. The one like below. There is no more sliding, just broadcast it to C channels and multiply.</p><figure id="8605"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*JfQi6tlnflyPxJ-R8bSCoA.jpeg"><figcaption>Example of spatial-specific filter (Image by Author)</figcaption></figure><p id="ef4d">The above approach might not work well and also we can’t use different resolutions of images.</p><p id="190e">So, is there any other way?</p><p id="8417">Can we come up with a method to keep just 3x3 filter, for example, but a different one at each sliding position on input?</p><figure id="1bcb"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*wCx2PihDnyqNCgxFm2hU5Q.jpeg"><figcaption>Spatial-specific filters example (Image by Author)</figcaption></figure><p id="8c72">The solution for this is dynamically generating filters at every spatial position based on the neighborhood.</p><h2 id="d57f">Kernel Generation</h2><p id="63f3">For generating kernel/filter dynamically, a convolution layer is used.</p><ul><li>Take 1x1xC feature block and convert it into 1x1xK² features using convolution layer.</li><li>Phi function is just a convolution layer.</li></ul><figure id="e191"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*bg4DuQR7BGMCZuzil6P0Gg.jpeg"><figcaption>Kernel Generation (Image by Author)</figcaption></figure><ul><li>1x1xK² features are converted into KxKx1 features using another convolution operation.</li></ul><figure id="692a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gG2RzDkrPVzHtjVe-BtxBg.jpeg"><figcaption>Kernel Generation (Image by Author)</figcaption></figure><ul><li>The KxKx1 filter we get is used for performing element-wise multiplication just as in the case of convolution.</li></ul><figure id="8f5e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gUD8zUwfL41Hm7B7vw0rww.jpeg"><figcaption>Kernel Generation (Image by Author)</figcaption></figure><ul><li>The same KxK filter is broadcasted across C channels of input (Channel-agnostic).</li></ul><figure id="3483"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*TOs7PJEYyE42Hin5tgXPHw.jpeg"><figcaption>Involution Operation (Image by Author)</figcaption></figure><p id="2021">Below is the Gif demonstrating the kernel generation dynamically at every spatial location.</p><figure id="281d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5YLfwtf3aVvjeDavVkYKRw.gif"><figcaption>Spatial specific kernel generation method (Gif by Author)</figcaption></figure><p id="1fb7">The final output after performing the Involution operation looks like below.</p><figure id="40bd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ZRzxtEerz_75Vpja43VS2Q.jpeg"><figcaption>Involution operation (Image by AUthor)</figcaption></figure><p id="0f07">To make it easy for demonstration, I have used KxKx1 filter, but in the actual case, it is KxKxG filters, where G is the number of channel groups. Instead of using a single filter and broadcasts it across all C channels of input, we create G filters and broadcast them into C/G channels each.</p><h2 id="c3d5">Pseudocode of Involution</h2><p id="356e">Below is the pseudocode of the involution operation from the paper. I have added additional comments to make it clear.</p><figure id="7460"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*p_9OTFIP_zkTyL4R7ABgKA.jpeg"><figcaption>Pseudocode of Involution (Image modified from <a href="https://arxiv.org/abs/2103.06255">paper</a>)</figcaption></figure><p id="b9d3">This is the fundamental operation that replaces the actual convo

Options

lution operation.</p><h2 id="c08e">Experiments</h2><p id="acc4">RedNet — a mirror of ResNet with all Convolutions replaced by Involutions except for residual connections. This RedNet was trained on the Imagenet dataset with 224x224 image sizes. Below are the accuracy and speed benchmarks.</p><figure id="a9b4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jqYOBZZw8PxxGcxzbGNj9Q.jpeg"><figcaption>Accuracy and parameters comparison (Image from <a href="https://arxiv.org/abs/2103.06255">paper</a>)</figcaption></figure><p id="277b">They have done experiments for Object detection and segmentation tasks as well. RedNet backbone performed better than ResNet in both cases.</p><p id="310e">Below are the results of ablation studies on kernel size in Involution operation (K) and channel groups (G).</p><figure id="7827"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*qc-nfxCUQ4vOLbelTygy-Q.jpeg"><figcaption>Ablation studies (Image from <a href="https://arxiv.org/abs/2103.06255">paper</a>)</figcaption></figure><h2 id="10b9">Conclusion</h2><p id="1daf">We have seen that the Involution operation is efficient and effective for visual representation learning. This basic operation will be the building block for upcoming architectures. More details on experiments and ablation studies can be found in the paper. I am sharing the links to both paper and code in the references.</p><h2 id="9f09">References</h2><ul><li><a href="https://arxiv.org/abs/2103.06255">https://arxiv.org/abs/2103.06255</a></li><li><a href="https://github.com/d-li14/involution">https://github.com/d-li14/involution</a></li><li><a href="https://github.com/facebookresearch/detr">https://github.com/facebookresearch/detr</a></li></ul><p id="e4fd">Here are some of my posts that you might like.</p><div id="535c" class="link-block"> <a href="https://readmedium.com/context-rcnn-long-term-temporal-context-for-per-camera-object-detection-1cc493176400"> <div> <div> <h2>Context RCNN — Long Term Temporal Context for Per-Camera Object Detection</h2> <div><h3>Dynamically incorporate other frames taken by the same camera into the object detection pipeline.</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*Cmkb3bk35CB1ZIdonApzcQ.jpeg)"></div> </div> </div> </a> </div><div id="7185" class="link-block"> <a href="https://readmedium.com/why-ai-is-harder-than-we-think-6ee5b833312c"> <div> <div> <h2>Why AI Is Harder Than We Think</h2> <div><h3>An overview of UPs and Downs of AI so far and how far have we reached towards the general intelligence</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*qlJ6wEYtikJ3wJt0)"></div> </div> </div> </a> </div><div id="e03d" class="link-block"> <a href="https://readmedium.com/weighted-boxes-fusion-86fad2c6be16"> <div> <div> <h2>Weighted Boxes Fusion — A detailed view</h2> <div><h3>Method to combine boxes from ensemble object detection models</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*dm84Vww80eSQJAqqe0Rc2w.png)"></div> </div> </div> </a> </div><p id="e865"><i>Originally published at <a href="https://mlfornerd.com/involution/">https://mlfornerd.com</a> on May 17, 2021.</i></p></article></body>

Deep Learning

Involution: Inverting the Inherence of Convolution for Visual Recognition

Inverting the inherent principles of Convolution.

Convolutional Neural Networks have been ruling the computer vision domain for almost a decade now. They have been the go-to networks for any Computer Vision task ranging from Image classification to Instance Segmentation.

The basic building block in CNNs is Convolution operation. It works similar to traditional computer vision filters for detecting edges and shapes etc., The basic idea is the same except that the filter is learned from data here.

Have a look at this post for an intuitive understanding of Convolution.

Intuitively Understanding Convolutions for Deep Learning

Exploring the strong visual hierarchies that makes them work

towardsdatascience.com

The basic principles of convolution operation are Spatial agnostic and Channel Specific.

Spatial Agnostic

The same filter is used for convolution across spatial dimensions.

Intuition: Filters should learn the features of objects irrespective of location on the images. For example, the filter should identify a cat whether it is present in the top-left or bottom-right in the image.

Channel Specific

Different filters are used to learn different properties which we call channels in a CNN.

There are few limitations to these principles.

Filters won’t be able to adapt to diverse visual patterns with respect to different spatial positions.
You can’t capture the long-range spatial interactions in a single shot.
A lot of redundancy among channels (filters).

The solutions for the above limitations are:

Make spatial-specific filters to learn diverse visual patterns across spatial positions.
Use some sort of attention mechanism for capturing long-range spatial interactions. Ex: DETR
Reduce the number of channels — channel-agnostic.

Involution

To solve the limitations of Convolution, An operation named involution is introduced in this research paper by Hong Kong University of Science & Technology.

Involution is just the inverse of convolution. It has inverse characteristics of convolution, namely, spatial-specific and channel-agnostic.

Channel-agnostic

Channel-agnostic is pretty simple. Use the same filter across all the input channels. At least use few filters and broadcast them to match the input channels. This should reduce the redundancy in convolution.

Spatial Specific

In convolution, we use the same 3x3 filter and slide it across the input spatially. This 3x3 filter is agnostic to spatial dimensions. How can we make spatial-specific filters?

One way to think is to create a filter with the same dimensions as the input. The one like below. There is no more sliding, just broadcast it to C channels and multiply.

Example of spatial-specific filter (Image by Author)

The above approach might not work well and also we can’t use different resolutions of images.

So, is there any other way?

Can we come up with a method to keep just 3x3 filter, for example, but a different one at each sliding position on input?

Spatial-specific filters example (Image by Author)

The solution for this is dynamically generating filters at every spatial position based on the neighborhood.

Kernel Generation

For generating kernel/filter dynamically, a convolution layer is used.

Take 1x1xC feature block and convert it into 1x1xK² features using convolution layer.
Phi function is just a convolution layer.

1x1xK² features are converted into KxKx1 features using another convolution operation.

The KxKx1 filter we get is used for performing element-wise multiplication just as in the case of convolution.

The same KxK filter is broadcasted across C channels of input (Channel-agnostic).

Below is the Gif demonstrating the kernel generation dynamically at every spatial location.

Spatial specific kernel generation method (Gif by Author)

The final output after performing the Involution operation looks like below.

To make it easy for demonstration, I have used KxKx1 filter, but in the actual case, it is KxKxG filters, where G is the number of channel groups. Instead of using a single filter and broadcasts it across all C channels of input, we create G filters and broadcast them into C/G channels each.

Pseudocode of Involution

Below is the pseudocode of the involution operation from the paper. I have added additional comments to make it clear.

This is the fundamental operation that replaces the actual convolution operation.

Experiments

RedNet — a mirror of ResNet with all Convolutions replaced by Involutions except for residual connections. This RedNet was trained on the Imagenet dataset with 224x224 image sizes. Below are the accuracy and speed benchmarks.

Accuracy and parameters comparison (Image from paper)

They have done experiments for Object detection and segmentation tasks as well. RedNet backbone performed better than ResNet in both cases.

Below are the results of ablation studies on kernel size in Involution operation (K) and channel groups (G).

Conclusion

We have seen that the Involution operation is efficient and effective for visual representation learning. This basic operation will be the building block for upcoming architectures. More details on experiments and ablation studies can be found in the paper. I am sharing the links to both paper and code in the references.

References

Here are some of my posts that you might like.

Context RCNN — Long Term Temporal Context for Per-Camera Object Detection

Dynamically incorporate other frames taken by the same camera into the object detection pipeline.

medium.com

Why AI Is Harder Than We Think

An overview of UPs and Downs of AI so far and how far have we reached towards the general intelligence

medium.com

Weighted Boxes Fusion — A detailed view

Method to combine boxes from ensemble object detection models

medium.com

Originally published at https://mlfornerd.com on May 17, 2021.