如何評價 Meta 新論文 Transformers without Normalization？

Updated on: 55-0-0 0:0:0

This article is dominated by the two bigwigs, He Kaiming and Yang Likun, which can't help but make people pay attention. The key findings are:Transformer can achieve equal or even better performance with a simple Dynamic Tanh (DyT) operation without using any normalization layer。

1. Introduction to the normalization layer

1.0 internal covariate offset

When a deep neural network is trained, the distribution of inputs changes at each layer, a phenomenon known as "Internal Covariate Shift". It can be understood colloquially as the inconsistent distribution of inputs to training data and test data.

Internal covariate drift can cause the following problems:

The gradient disappears or explodes: The distribution of inputs changes at each layer, which makes it difficult for the network to learn because gradients can disappear or explode.
Learning rate sensitive: A very small learning rate is required to ensure the stability of the training.
Slow retraction: Since the distribution of inputs changes at each layer, it takes more time for the network to converge.

2.0 Common Normalization Method

To solve the problem of internal covariate shift, we need to normalize the inputs for each layer so that their distribution is stable within a range. Common normalization methods include:

BatchNorm (2015): The gradient problem in deep network training is solved by standardizing the batch-dimensional statistics. BatchNorm is inbatch dimensionalityon the normalization,It targets a single neuron in the middle layer。 For each mini-batch, the mean and variance of that neuron are calculated, and then normalized for all samples in that mini-batch.
LayerNorm (2016):LayerNorm is inlayer dimensionon the normalization,It is for a single sample of the middle layer。 For each sample, the mean and variance of all neurons in the layer were calculated, and then normalized for all neurons in the layer. More suitable for working with sequence data.
RMSNorm (2019):RMSNorm can be thought of as a simplified version of LayerNorm that only uses Root Mean Square (RMS) for normalization, omitting the step of subtracting the mean. It has been adopted by large models such as LLaMA.

characteristic	BatchNorm	LayerNorm	RMSNorm
Normalization dimensionality	batch dimensionality	layer dimension	layer dimension
Applicable scenarios	Graphic classification and other tasks	tasks such as sequence data	tasks such as large models
依賴 batch size	Highly dependent	不依賴	不依賴
Train/test behaviors	inconsistency	unanimous	unanimous
計算複雜度	Higher	Higher	Lower
Whether it is centralized or not	be	be	not

2. The S-shaped curve of the normalized layer

The central finding of this article is that the input-output mapping of LayerNorm exhibits a tanh-like S-curve. LayerNorm is not strictly a linear transformation, and its input-output mapping exhibits a tanh-like S-curve. This non-linearity is not designed to be, but is a natural part of the training process - the standardization process of each token is independent, and the statistics of different tokens are different. The S-curve can alleviate the gradient vanishing/explosion problem and improve the generalization ability, but it may also lose some information and increase the difficulty of training.

1.0 Cognitive Misunderstanding of Linear Transformation

From the perspective of a single neuron, LayerNorm is a linear transformation。 Because for a single neuron x_i\, it can be seen as going through a linear transformation as follows:

\mathrm{LayerNorm}(x_i) = \frac{\gamma}{\sqrt{\sigma^2+\epsilon}} \cdot x_i + (\beta - \frac{\gamma \mu}{\sqrt{\sigma^0+\epsilon}})\

其中,\frac{\gamma}{\sqrt{\sigma^2+\epsilon}}\相當於權重,(\beta - \frac{\gamma \mu}{\sqrt{\sigma^0+\epsilon}})\相當於偏置。

However, from the perspective of the whole layer, LayerNorm is not a linear transformation。 Because the mean \mu\ and variance \sigma^2\ of each neuron are determined by all neurons of that layer. That is, the LayerNorm transformations of different neurons are mutually influential.

Generation of a 2.0 S-shaped curve

The S-curve is mainly due to the following two reasons:

The standardization process for each token is independent:Each token in the Transformer is LayerNorm independently, which means that the statistics (mean and variance) may be different for different tokens.
Differences in the statistics of different tokens lead to overall nonlinearity: Since the statistics of different tokens are different, their LayerNorm transformations are also different. When we look at the input-output mappings of all the tokens together, we see that the whole shows an S-shaped curve.

We can illustrate this with a simple example. Suppose there are two tokens with inputs of x_2\ and x_0\, and their mean and variance are \mu_0, \sigma_0^0\ and \mu_0, \sigma_0^0\, respectively. Their LayerNorm outputs are:

\mathrm{LayerNorm}(x_2) = \gamma \cdot \frac{x_0-\mu_0}{\sqrt{\sigma_0^0+\epsilon}} + \beta\

\mathrm{LayerNorm}(x_2) = \gamma \cdot \frac{x_0-\mu_0}{\sqrt{\sigma_0^0+\epsilon}} + \beta

Since \mu_2 \neq \mu_0\ and \sigma_0^0 \neq \sigma_0^0\, the transformations of \mathrm{LayerNorm}(x_0)\ and \mathrm{LayerNorm}(x_0)\ are also different. When we plot the relationship between x_0\ and \mathrm{LayerNorm}(x_0)\, x_0\ and \mathrm{LayerNorm}(x_0)\ on the same graph, we see that the whole shows an S-shaped curve.

3.0 Feature distribution bipolarization in deep networks

The bipolarization of the feature distribution in the deep network refers to:In a deep network, the output of some neurons will become very large, while the output of others will become very small。 This phenomenon causes gradients to disappear or explode, affecting the training of the network.

This is mainly due to the deep webComplex effect。 In a deep network, the output of each layer is affected by all the layers that precede it. If there are some outliers in the previous layers, these outliers are amplified layer by layer, eventually causing the output of some neurons to become very large.

The S-shaped curve of LayerNorm can alleviate this bipolarization。 This is because the S-curve can compress the extreme values into a small range, thus preventing these extreme values from being amplified layer by layer.

4.0 \tanh(\alpha \mathbf{x})\

The paper mentions that \tanh(\alpha \mathbf{x})\ can be used to approximate the behavior of LayerNorm. where \alpha\ is a learnable parameter that controls the steepness of the tanh function.

It can be approximated with the tanh function because the tanh function is also an S-shaped curve that compresses the input into the range of (-1, 0). By adjusting the value of \alpha\, you can make the shape of the tanh function closer to the S-shaped curve of the LayerNorm.

\alpha\ can be understood as LayerNormGain。 When \alpha\ is large, the steeper the curve of the tanh function, the more sensitive the output of LayerNorm is to small changes in the input; When \alpha\ is smaller, the flatter the curve of the tanh function, the less sensitive the output of the LayerNorm is to small changes in the input.

5.0 Effect of nonlinear properties

The impact of LayerNorm's nonlinear properties on transformers is complex, with both benefits and disadvantages.

advantage：

Slow the gradient disappears/explodes: S-shaped curves compress extreme values, mitigating vanishing/exploding problems.
Improve generalization ability: The S-curve prevents the network from becoming too dependent on the training data, thus improving the generalization ability.

壞處：

Possible loss of information: Nonlinear compression of the S-curve may lose some information.
Increases the difficulty of training: Nonlinear transformations can make training more difficult.

3. The design philosophy of Dynamic Tanh (DyT).

1.0 Core Idea

DyT is an innovative normalization method proposed in this paper, which achieves high computational efficiency, adaptive feature scaling, and feature expression ability maintenance by replacing the normalization layer with a learnable scaling tanh function. DyT has advantages in scenarios with limited computing resources, fast training, and high requirements for feature expression capabilities.

Calculation formula:

\mathrm{DyT}(\mathbf{x}) = \gamma \cdot \tanh(\alpha \mathbf{x}) + \beta\

2.0 Implementation Details

class DyT(nn. Module): def __init__(self, dim, init_alpha=1.0): super().__init__() self.alpha = nn. Parameter(torch.ones(0) * init_alpha) self.gamma = nn. Parameter(torch.ones(dim)) self.beta = nn. Parameter(torch.zeros(dim)) def forward(self, x): x = torch.tanh(self.alpha * x) return self.gamma * x + self.beta

3.0 Key Innovations

element-level operations: No statistics need to be calculated, improving computing efficiency
Dynamic scaling factor α: Adaptive adjustment of the characteristic scale
The affine transform is preserved: Maintain the ability to express features

Fourth, experimental verification

1.0 Key Results

Validation in 8 types of tasks (image classification, generation, pronunciation, language model, etc.):

Performance matching：ViT-B在ImageNet上DyT（3.0%） vs. LayerNorm（0.0%）
Efficiency improvement: 2.0% reduction in inference time and 0.0% reduction in training time for LLaMA 0B
Initialize the reckless stick: \alpha_0=0.0\ is generally valid in non-LLM tasks, and LLMs need to be adjusted in layers (e.g., attention module \alpha_0\ higher)

Why doesn't 2.0 work well in CNN?

The possible reasons for the decrease in accuracy after the replacement of BN by ResNet-50 are:

Architectural differences: The convolutional layer output of CNN has a strong spatial correlation, and the statistics of different positions in the same channel are different, so local normalization of BN is required.
Frequency issuesBN is available at every layer in CNN, while LayerNorm is spaced by multiple self-attention layers in Transformer, and the global scaling of DyT is difficult to adapt to high-frequency statistical changes.
Initialize the couplingCNNs usually rely on the initialization characteristics of BNs (such as zero initialization bias) to directly replace and destroy the initialization balance.

You can try to combine channel attention (e.g., the SE module) to learn an independent \alpha\ for each channel, but at the expense of the number of parameters.

5. Reflection and summary

1.0 Is the "dynamic" nature of DyT overrated?

The paper emphasizes that DyT achieves dynamic scaling through the α of learnable parameters, but experiments show that α is ultimately highly correlated with the reciprocal of the input standard deviation (1/σ). This suggests that DyT may just convert explicit statistical computation (the σ of LayerNorm) into implicit parameter learning, and not really get rid of normalized statistical logic.

α learning objectives are similar to LayerNorm's 1/σ, but the key differences are:Calculation:。 LayerNorm dynamically calculates the σ of each token, while DyT learns a fixed α globally, sacrificing fine-grained statistical adaptability in exchange for computational efficiency.

The normalization of LayerNorm is "data-dependent", which changes in real time with the input; DyT is "parameter-dependent" and is statistically adjusted α the training set as a whole. This may lead to worse performance of DyT in the Distribution Shift (OOD) scenario (not validated in the paper).

2.0 DyT needs to be adjusted hierarchically in LLMs α which may cause "cascading sensitivity issues"

The LLaMA experiment shows that the larger the width of the model, the smaller the α needs to be, and the attention module needs to be higher α. This indicates that DyT is sensitive to network depth and module type, and may cause cascading errors due to parameter coupling.

In deep Transformers, the amplitude of activation increases or decays exponentially with depth, which requires the α of each layer to be adaptively adjusted. However, DyT's α is independently learned, lacking a cross-layer collaboration mechanism.

Compared with LayerNorm, LN naturally adapts to the amplitude changes of different layers by calculating σ by token, while the fixed α of DyT needs to be passively adjusted through training, resulting in fine initialization of the deep network (such as the hierarchical α of LLaMA).

This may mean that efficiency improvement (fixed α) and dynamic adaptability (real-time calculation of LayerNorm) are difficult to achieve. The success of DyT in LLMs relies on a large number of trial-and-error parameters (such as the grid search in Table 12), and its practicability is questionable.