Summary
The people who worked on language and NLP about 5+ years ago are left scratching their heads about where all the encoder models went. Today I try to unpack all that is going on, in this new era of LLMs. I Hope this post will be helpful.
Denoising objective is any variation of the “span corruption” task. This is sometimes known as “infilling” or “fill in the blank” T5-11B works pretty well even after being aligned/SFT-ed.
During 2018-2021, there was an implicit paradigm shift of single task finetuning to massively multi-task models. BERT-style models are cumbersome, but the real deprecation of BERT models was because people wanted to do all tasks at once. This led to a better way of doing denoising - using autoregressive models.
Bidirectional attention is an interesting “inductive bias” for language models - one that is commonly conflated with objectives and model backbones. The usefulness of inductive biases changes at different compute regions and could have different effects on scaling curves at different regions.
Encoder-decoder and decoder-only models are both autoregressive models that have implementation-level differences and pros/cons. They are subtly different inductive biases. Optimal usage really depends on downstream use-case and pretty much application constraints. Meanwhile, for most LLM usage and niche use-cases, BERT style encoder models are mostly considered deprecated.