The letter S in a light blue, stylized speech bubble followed by SpeakBits
SpeakBitsThe letter S in a light blue, stylized speech bubble followed by SpeakBits
Trending
Top
New
Controversial
Search
Groups

Enjoying SpeakBits?

Support the development of it by donating to Patreon or Ko-Fi.
About
Rules
Terms
Privacy
EULA
Cookies
Blog
Have feedback? We'd love to hear it!

What happened to BERT & T5? On Transformer Encoders, PrefixLM and Denoising Objectives

yitay.net
submitted
a year ago
byboredgamertoprogramming

Summary

The people who worked on language and NLP about 5+ years ago are left scratching their heads about where all the encoder models went. Today I try to unpack all that is going on, in this new era of LLMs. I Hope this post will be helpful.

Denoising objective is any variation of the “span corruption” task. This is sometimes known as “infilling” or “fill in the blank” T5-11B works pretty well even after being aligned/SFT-ed.

During 2018-2021, there was an implicit paradigm shift of single task finetuning to massively multi-task models. BERT-style models are cumbersome, but the real deprecation of BERT models was because people wanted to do all tasks at once. This led to a better way of doing denoising - using autoregressive models.

Bidirectional attention is an interesting “inductive bias” for language models - one that is commonly conflated with objectives and model backbones. The usefulness of inductive biases changes at different compute regions and could have different effects on scaling curves at different regions.

Encoder-decoder and decoder-only models are both autoregressive models that have implementation-level differences and pros/cons. They are subtly different inductive biases. Optimal usage really depends on downstream use-case and pretty much application constraints. Meanwhile, for most LLM usage and niche use-cases, BERT style encoder models are mostly considered deprecated.

 jellyfish volcano geyser coral reef-0
15

0 Comments

There are no comments on this post yet.