Gated recurrent unit

Memory unit used in neural networks
Part of a series on
Machine learning
and data mining
Paradigms
  • Supervised learning
  • Unsupervised learning
  • Online learning
  • Batch learning
  • Meta-learning
  • Semi-supervised learning
  • Self-supervised learning
  • Reinforcement learning
  • Curriculum learning
  • Rule-based learning
  • Quantum machine learning
Problems
Learning with humans
Machine-learning venues
  • v
  • t
  • e

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.[1] The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features,[2] but lacks a context vector or output gate, resulting in fewer parameters than LSTM.[3] GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM.[4][5] GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.[6][7]

Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.[8]

The operator {\displaystyle \odot } denotes the Hadamard product in the following.

Fully gated unit

Gated Recurrent Unit, fully gated version

Initially, for t = 0 {\displaystyle t=0} , the output vector is h 0 = 0 {\displaystyle h_{0}=0} .

z t = σ ( W z x t + U z h t 1 + b z ) r t = σ ( W r x t + U r h t 1 + b r ) h ^ t = ϕ ( W h x t + U h ( r t h t 1 ) + b h ) h t = ( 1 z t ) h t 1 + z t h ^ t {\displaystyle {\begin{aligned}z_{t}&=\sigma (W_{z}x_{t}+U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma (W_{r}x_{t}+U_{r}h_{t-1}+b_{r})\\{\hat {h}}_{t}&=\phi (W_{h}x_{t}+U_{h}(r_{t}\odot h_{t-1})+b_{h})\\h_{t}&=(1-z_{t})\odot h_{t-1}+z_{t}\odot {\hat {h}}_{t}\end{aligned}}}

Variables ( d {\displaystyle d} denotes the number of input features and e {\displaystyle e} the number of output features):

  • x t R d {\displaystyle x_{t}\in \mathbb {R} ^{d}} : input vector
  • h t R e {\displaystyle h_{t}\in \mathbb {R} ^{e}} : output vector
  • h ^ t R e {\displaystyle {\hat {h}}_{t}\in \mathbb {R} ^{e}} : candidate activation vector
  • z t ( 0 , 1 ) e {\displaystyle z_{t}\in (0,1)^{e}} : update gate vector
  • r t ( 0 , 1 ) e {\displaystyle r_{t}\in (0,1)^{e}} : reset gate vector
  • W R d × e {\displaystyle W\in \mathbb {R} ^{d\times e}} , U R e × e {\displaystyle U\in \mathbb {R} ^{e\times e}} and b R e {\displaystyle b\in \mathbb {R} ^{e}} : parameter matrices and vector which need to be learned during training

Activation functions

Alternative activation functions are possible, provided that σ ( x ) [ 0 , 1 ] {\displaystyle \sigma (x)\in [0,1]} .

Type 1
Type 2
Type 3

Alternate forms can be created by changing z t {\displaystyle z_{t}} and r t {\displaystyle r_{t}} [9]

  • Type 1, each gate depends only on the previous hidden state and the bias.
    z t = σ ( U z h t 1 + b z ) r t = σ ( U r h t 1 + b r ) {\displaystyle {\begin{aligned}z_{t}&=\sigma (U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma (U_{r}h_{t-1}+b_{r})\\\end{aligned}}}
  • Type 2, each gate depends only on the previous hidden state.
    z t = σ ( U z h t 1 ) r t = σ ( U r h t 1 ) {\displaystyle {\begin{aligned}z_{t}&=\sigma (U_{z}h_{t-1})\\r_{t}&=\sigma (U_{r}h_{t-1})\\\end{aligned}}}
  • Type 3, each gate is computed using only the bias.
    z t = σ ( b z ) r t = σ ( b r ) {\displaystyle {\begin{aligned}z_{t}&=\sigma (b_{z})\\r_{t}&=\sigma (b_{r})\\\end{aligned}}}

Minimal gated unit

The minimal gated unit (MGU) is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:[10]

f t = σ ( W f x t + U f h t 1 + b f ) h ^ t = ϕ ( W h x t + U h ( f t h t 1 ) + b h ) h t = ( 1 f t ) h t 1 + f t h ^ t {\displaystyle {\begin{aligned}f_{t}&=\sigma (W_{f}x_{t}+U_{f}h_{t-1}+b_{f})\\{\hat {h}}_{t}&=\phi (W_{h}x_{t}+U_{h}(f_{t}\odot h_{t-1})+b_{h})\\h_{t}&=(1-f_{t})\odot h_{t-1}+f_{t}\odot {\hat {h}}_{t}\end{aligned}}}

Variables

  • x t {\displaystyle x_{t}} : input vector
  • h t {\displaystyle h_{t}} : output vector
  • h ^ t {\displaystyle {\hat {h}}_{t}} : candidate activation vector
  • f t {\displaystyle f_{t}} : forget vector
  • W {\displaystyle W} , U {\displaystyle U} and b {\displaystyle b} : parameter matrices and vector

Light gated recurrent unit

The light gated recurrent unit (LiGRU)[4] removes the reset gate altogether, replaces tanh with the ReLU activation, and applies batch normalization (BN):

z t = σ ( BN ( W z x t ) + U z h t 1 ) h ~ t = ReLU ( BN ( W h x t ) + U h h t 1 ) h t = z t h t 1 + ( 1 z t ) h ~ t {\displaystyle {\begin{aligned}z_{t}&=\sigma (\operatorname {BN} (W_{z}x_{t})+U_{z}h_{t-1})\\{\tilde {h}}_{t}&=\operatorname {ReLU} (\operatorname {BN} (W_{h}x_{t})+U_{h}h_{t-1})\\h_{t}&=z_{t}\odot h_{t-1}+(1-z_{t})\odot {\tilde {h}}_{t}\end{aligned}}}

LiGRU has been studied from a Bayesian perspective.[11] This analysis yielded a variant called light Bayesian recurrent unit (LiBRU), which showed slight improvements over the LiGRU on speech recognition tasks.

References

  1. ^ Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, DZmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". Association for Computational Linguistics. arXiv:1406.1078.
  2. ^ Felix Gers; Jürgen Schmidhuber; Fred Cummins (1999). "Learning to forget: Continual prediction with LSTM". 9th International Conference on Artificial Neural Networks: ICANN '99. Vol. 1999. pp. 850–855. doi:10.1049/cp:19991218. ISBN 0-85296-721-7.
  3. ^ "Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML". Wildml.com. 2015-10-27. Archived from the original on 2021-11-10. Retrieved May 18, 2016.
  4. ^ a b Ravanelli, Mirco; Brakel, Philemon; Omologo, Maurizio; Bengio, Yoshua (2018). "Light Gated Recurrent Units for Speech Recognition". IEEE Transactions on Emerging Topics in Computational Intelligence. 2 (2): 92–102. arXiv:1803.10225. doi:10.1109/TETCI.2017.2762739. S2CID 4402991.
  5. ^ Su, Yuahang; Kuo, Jay (2019). "On extended long short-term memory and dependent bidirectional recurrent neural network". Neurocomputing. 356: 151–161. arXiv:1803.01686. doi:10.1016/j.neucom.2019.04.044. S2CID 3675055.
  6. ^ Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv:1412.3555 [cs.NE].
  7. ^ Gruber, N.; Jockisch, A. (2020), "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?", Frontiers in Artificial Intelligence, 3: 40, doi:10.3389/frai.2020.00040, PMC 7861254, PMID 33733157, S2CID 220252321
  8. ^ Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv:1412.3555 [cs.NE].
  9. ^ Dey, Rahul; Salem, Fathi M. (2017-01-20). "Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks". arXiv:1701.05923 [cs.NE].
  10. ^ Heck, Joel; Salem, Fathi M. (2017-01-12). "Simplified Minimal Gated Unit Variations for Recurrent Neural Networks". arXiv:1701.03452 [cs.NE].
  11. ^ Bittar, Alexandre; Garner, Philip N. (May 2021). "A Bayesian Interpretation of the Light Gated Recurrent Unit". ICASSP 2021. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE. pp. 2965–2969. 10.1109/ICASSP39728.2021.9414259.