Paper Review: Grokking-generalization and over-fitting

Summary

The "Paper Review: Grokking-generalization and over-fitting" article discusses the importance of understanding overfitting in deep learning models and the phenomenon of "grokking," where models improve generalization post-overfitting, as observed in research from OPEN AI.

Abstract

The article reviews a research paper from OPEN AI that addresses the critical concepts of generalization and overfitting in deep learning. It highlights that despite the trend of creating larger neural networks, there is a growing need to reduce the number of parameters to prevent overfitting. The paper introduces "grokking," a process where neural networks can significantly improve their generalization performance from random chance to near-perfect accuracy, even after severely overfitting. Key findings include the discovery that smaller datasets necessitate more optimization for generalization, and that weight decay is particularly effective in enhancing generalization on specific tasks. The researchers demonstrate that training on small, algorithmically generated datasets can lead to better generalization than naturally available data, as evidenced by improved face detection using GAN-generated faces. The article also notes that certain hyperparameters and architectural choices can impact the model's ability to generalize, with transformers showing promise in learning symmetric functions.

Opinions

The article suggests that the trend of increasing network size may not always be beneficial due to the risk of overfitting.
It emphasizes the importance of data efficiency, memorization, and the speed of learning in deep learning models.
The authors advocate for the effectiveness of weight decay in improving generalization.
They also imply that the choice of dataset and the symmetry of problems can significantly affect the success of generalization.
The article posits that algorithmically generated datasets can sometimes outperform naturally available data in terms of generalization capabilities.
It is suggested that certain deep learning architectures, like transformers, may inherently be better suited for learning specific types of functions.

Paper Review: Grokking-generalization and over-fitting

After the release of VGG16 in 2012, for the next few years, the networks became larger and larger. But in the last 2 years, we have seen a lot of trends of shrinking the networks be it the use of depth separable convolution or some other techniques. But let’s ask ourselves, why do we need to reduce the number of parameters of our current models? Given the advancements in nanofabrication technology, we can certainly keep pace with the advancements in the field of Deep Learning. One word answer to our question, OVERFITTING. The larger the model we make, the better chance it has of overfitting than actually learning the generalization behind the problem. Always remember, the goal of any deep learning architecture is not to learn a massive amount of data but to learn the underlying pattern. Once it has learned the pattern, it can easily handle many cases that it has not seen previously. So, without further ado, let’s jump into the GROKKING, and understand the behavior of small networks on specially crafted datasets.

The Grokking paper is a result of research at the OPEN AI labs, and it talks about something very important, generalization and over-fitting. The goal of the paper is to know about data efficiency, memorization, generalization, and speed of learning can be studied in great detail.

Long after severely overfitting, validation accuracy sometimes suddenly begins to increase from chance level toward perfect generalization. We call this phenomenon ‘grokking’.

Three key findings of the paper:

In some situations neural networks learn through a process of “grokking” a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting.

Smaller datasets require increasing amounts of optimization (number of iterations) for generalization.

Weight decay is particularly effective at improving generalization on the tasks we study.

Detailed overview

Let’s dig a little deeper into this wonderful paper, researchers of this paper showed that training networks on small algorithmically generated datasets in many cases can generate better generalization than the model trained on the naturally available data. This is not just limited to a small dataset as mentioned in the paper, because few other researchers showed that they got better face detection with face generated from GAN.

In the above figure, we see that a network was trained to do the binary operation of division modulo 97, training accuracy reached almost 100% before 10³ optimization steps but for validation, anything is moved till 10⁵, this shows that the network only reached true generalization only after 10⁶ optimizations. The datasets considered in the paper are binary operation tables of the form a ◦ b = c where a, b, c are discrete symbols with no internal structure and ◦ is a binary operation. Examples of binary operations include addition, the composition of permutations, and bivariate polynomials. Right image on the above figure is an example of the dataset being used. Distinct abstract symbols are used for all distinct elements a, b, c involved in the equations, thus the network is not aware of any internal structure of the elements, and has to learn about their properties only from their interactions with other elements. The network doesn’t see numbers in decimal notation or permutations in line notation.

If we look at the above chart we can see that adding weight noise and decay factor helps achieve generalization quickly. Weight decay improves generalization the most, but some generalization happens even with full batch optimizers and models without weight or activation noise at high percentages of training data. Also, suboptimal choice hyperparameters severely limit generalization.

Other small thins they noted were that problems symmetric with respect to the order of the operands (x + y, x ∗ y, x² + y² and x² + x∗y + y²) are easy to generalize compared to non-symmetrical counterparts (x − y, x/y, x²+ x∗y + y² + x). It might be partially architecture-dependent since it’s easy for a transformer to learn a symmetric function of the operands by ignoring positional embedding.

Some operations (for example x³ + xy² + y (mod 97)) didn’t lead to generalization within the allowed optimization budget at any percentage of data up to 95%. The converged models effectively just memorized the training dataset without finding any real patterns in the data. To such a model, the data is effectively random.

There are a few other things like topology, for that you can click here for the original paper.

The goal of this blog post is to present the original paper in a simple and easy-to-read format and also give you the key takeaways in bullet points.

Thanks for giving your time and if you think that this blog added something to your knowledge base, please consider following the AIGuys Blog, and if you are interested to become a writer at AI guys you can follow this link.