A conventional neural network, i.e. one using a stack of dense layers, can't unr... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		craigacp on Feb 11, 2023 \| parent \| context \| favorite \| on: Understanding and coding the self-attention mechan... A conventional neural network, i.e. one using a stack of dense layers, can't unroll across a sequence in the way the transformer does. So while it could compute the relative importance and interaction of the features it sees it wouldn't be able to compute that across arbitrary length sequences without a mechanism for the sequence elements to interact, which is what self attention provides.

inciampati on Feb 11, 2023 [–]

Practical attention implementations don't work over arbitrary length sequences. The universal approximation theorem holds IMO. Information will mix as you go through fully connected MLP layers. Attention is apparently a prior structure that is needed to really reduce training costs.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact