This article is dominated by the two bigwigs, He Kaiming and Yang Likun, which can't help but make people pay attention. The key findings are:Transformer can achieve equal or even better performance with a simple Dynamic Tanh (DyT) operation without using any normalization layer。
When a deep neural network is trained, the distribution of inputs changes at each layer, a phenomenon known as "Internal Covariate Shift". It can be understood colloquially as the inconsistent distribution of inputs to training data and test data.
Internal covariate drift can cause the following problems:
To solve the problem of internal covariate shift, we need to normalize the inputs for each layer so that their distribution is stable within a range. Common normalization methods include:
characteristic | BatchNorm | LayerNorm | RMSNorm |
---|---|---|---|
Normalization dimensionality | batch dimensionality | layer dimension | layer dimension |
Applicable scenarios | Graphic classification and other tasks | tasks such as sequence data | tasks such as large models |
依賴 batch size | Highly dependent | 不依賴 | 不依賴 |
Train/test behaviors | inconsistency | unanimous | unanimous |
計算複雜度 | Higher | Higher | Lower |
Whether it is centralized or not | be | be | not |
The central finding of this article is that the input-output mapping of LayerNorm exhibits a tanh-like S-curve. LayerNorm is not strictly a linear transformation, and its input-output mapping exhibits a tanh-like S-curve. This non-linearity is not designed to be, but is a natural part of the training process - the standardization process of each token is independent, and the statistics of different tokens are different. The S-curve can alleviate the gradient vanishing/explosion problem and improve the generalization ability, but it may also lose some information and increase the difficulty of training.
From the perspective of a single neuron, LayerNorm is a linear transformation。 Because for a single neuron x_i\, it can be seen as going through a linear transformation as follows:
\mathrm{LayerNorm}(x_i) = \frac{\gamma}{\sqrt{\sigma^2+\epsilon}} \cdot x_i + (\beta - \frac{\gamma \mu}{\sqrt{\sigma^0+\epsilon}})\
其中,\frac{\gamma}{\sqrt{\sigma^2+\epsilon}}\相當於權重,(\beta - \frac{\gamma \mu}{\sqrt{\sigma^0+\epsilon}})\相當於偏置。
However, from the perspective of the whole layer, LayerNorm is not a linear transformation。 Because the mean \mu\ and variance \sigma^2\ of each neuron are determined by all neurons of that layer. That is, the LayerNorm transformations of different neurons are mutually influential.
The S-curve is mainly due to the following two reasons:
We can illustrate this with a simple example. Suppose there are two tokens with inputs of x_2\ and x_0\, and their mean and variance are \mu_0, \sigma_0^0\ and \mu_0, \sigma_0^0\, respectively. Their LayerNorm outputs are:
\mathrm{LayerNorm}(x_2) = \gamma \cdot \frac{x_0-\mu_0}{\sqrt{\sigma_0^0+\epsilon}} + \beta\
\mathrm{LayerNorm}(x_2) = \gamma \cdot \frac{x_0-\mu_0}{\sqrt{\sigma_0^0+\epsilon}} + \beta
Since \mu_2 \neq \mu_0\ and \sigma_0^0 \neq \sigma_0^0\, the transformations of \mathrm{LayerNorm}(x_0)\ and \mathrm{LayerNorm}(x_0)\ are also different. When we plot the relationship between x_0\ and \mathrm{LayerNorm}(x_0)\, x_0\ and \mathrm{LayerNorm}(x_0)\ on the same graph, we see that the whole shows an S-shaped curve.
The bipolarization of the feature distribution in the deep network refers to:In a deep network, the output of some neurons will become very large, while the output of others will become very small。 This phenomenon causes gradients to disappear or explode, affecting the training of the network.
This is mainly due to the deep webComplex effect。 In a deep network, the output of each layer is affected by all the layers that precede it. If there are some outliers in the previous layers, these outliers are amplified layer by layer, eventually causing the output of some neurons to become very large.
The S-shaped curve of LayerNorm can alleviate this bipolarization。 This is because the S-curve can compress the extreme values into a small range, thus preventing these extreme values from being amplified layer by layer.
The paper mentions that \tanh(\alpha \mathbf{x})\ can be used to approximate the behavior of LayerNorm. where \alpha\ is a learnable parameter that controls the steepness of the tanh function.
It can be approximated with the tanh function because the tanh function is also an S-shaped curve that compresses the input into the range of (-1, 0). By adjusting the value of \alpha\, you can make the shape of the tanh function closer to the S-shaped curve of the LayerNorm.
\alpha\ can be understood as LayerNormGain。 When \alpha\ is large, the steeper the curve of the tanh function, the more sensitive the output of LayerNorm is to small changes in the input; When \alpha\ is smaller, the flatter the curve of the tanh function, the less sensitive the output of the LayerNorm is to small changes in the input.
The impact of LayerNorm's nonlinear properties on transformers is complex, with both benefits and disadvantages.
advantage:
壞處:
DyT is an innovative normalization method proposed in this paper, which achieves high computational efficiency, adaptive feature scaling, and feature expression ability maintenance by replacing the normalization layer with a learnable scaling tanh function. DyT has advantages in scenarios with limited computing resources, fast training, and high requirements for feature expression capabilities.
Calculation formula:
\mathrm{DyT}(\mathbf{x}) = \gamma \cdot \tanh(\alpha \mathbf{x}) + \beta\
Validation in 8 types of tasks (image classification, generation, pronunciation, language model, etc.):
The possible reasons for the decrease in accuracy after the replacement of BN by ResNet-50 are:
You can try to combine channel attention (e.g., the SE module) to learn an independent \alpha\ for each channel, but at the expense of the number of parameters.
The paper emphasizes that DyT achieves dynamic scaling through the α of learnable parameters, but experiments show that α is ultimately highly correlated with the reciprocal of the input standard deviation (1/σ). This suggests that DyT may just convert explicit statistical computation (the σ of LayerNorm) into implicit parameter learning, and not really get rid of normalized statistical logic.
α learning objectives are similar to LayerNorm's 1/σ, but the key differences are:Calculation:。 LayerNorm dynamically calculates the σ of each token, while DyT learns a fixed α globally, sacrificing fine-grained statistical adaptability in exchange for computational efficiency.
The normalization of LayerNorm is "data-dependent", which changes in real time with the input; DyT is "parameter-dependent" and is statistically adjusted α the training set as a whole. This may lead to worse performance of DyT in the Distribution Shift (OOD) scenario (not validated in the paper).
The LLaMA experiment shows that the larger the width of the model, the smaller the α needs to be, and the attention module needs to be higher α. This indicates that DyT is sensitive to network depth and module type, and may cause cascading errors due to parameter coupling.
In deep Transformers, the amplitude of activation increases or decays exponentially with depth, which requires the α of each layer to be adaptively adjusted. However, DyT's α is independently learned, lacking a cross-layer collaboration mechanism.
Compared with LayerNorm, LN naturally adapts to the amplitude changes of different layers by calculating σ by token, while the fixed α of DyT needs to be passively adjusted through training, resulting in fine initialization of the deep network (such as the hierarchical α of LLaMA).
This may mean that efficiency improvement (fixed α) and dynamic adaptability (real-time calculation of LayerNorm) are difficult to achieve. The success of DyT in LLMs relies on a large number of trial-and-error parameters (such as the grid search in Table 12), and its practicability is questionable.
The human brain is neurologicalS-type start-up characteristics(Adrian,1926)與DyT設計不謀而合:
method | 計算複雜度 | Explainable | Hardware friendliness |
---|---|---|---|
LayerNorm | O(n) | high | So so |
RMSNorm | O(n) | middle | Better |
Fixup initialization | The(1) | low | outstanding |
DyT | The(1) | high | outstanding |