Showing 1–20 of 38 results
/ Date/ Name
Jun 12, 2017Attention Is All You NeedNov 6, 2019Fast Transformer Decoding: One Write-Head is All You NeedFeb 12, 2020GLU Variants Improve TransformerJan 23, 2017Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts LayerMar 5, 2020Talking-Heads AttentionNov 5, 2018Mesh-TensorFlow: Deep Learning for SupercomputersApr 11, 2018Adafactor: Adaptive Learning Rates with Sublinear Memory CostJun 4, 2010Variational Program InferenceMar 31, 2022Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$Oct 31, 2018Weakly Supervised Grammatical Error Correction using Iterative DecodingNov 7, 2018Blockwise Parallel Decoding for Deep Autoregressive ModelsSep 17, 2021Primer: Searching for Efficient Transformers for Language ModelingFeb 15, 2018Image TransformerJan 14, 2020Faster Transformer Decoding: N-gram Masked Self-AttentionJun 9, 2015Scheduled Sampling for Sequence Prediction with Recurrent Neural NetworksApr 5, 2022PaLM: Scaling Language Modeling with PathwaysSep 6, 2019High Resolution Medical Image Analysis with Spatial PartitioningFeb 6, 2016Swivel: Improving Embeddings by Noticing What's MissingFeb 10, 2020How Much Knowledge Can You Pack Into the Parameters of a Language Model?Sep 12, 2018Music Transformer