Post

LLM์˜ ์—ญ์‚ฌ | A History of Large Language Models

๐Ÿ“š Distributed Representation์—์„œ Transformer, RLHF๊นŒ์ง€ ํ˜„์žฌ์˜ LLM์„ ๋งŒ๋“  ์—ฐ๊ตฌ๋“ค์„ ์ˆœ์„œ๋Œ€๋กœ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค.

LLM์˜ ์—ญ์‚ฌ | A History of Large Language Models

KEYWORDS
LLM, LLM ์—ญ์‚ฌ, LLM์ด๋ž€, LLM ์ธ๊ณต์ง€๋Šฅ ๋œป, ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ, Language Model, ๋ถ„์‚ฐ ํ‘œํ˜„, Word Embedding, Word2Vec, Attention, Transformer, Attention is all you need, RLHF, Bengio 2003, ์–ดํ…์…˜์ด๋ž€, ํŠธ๋žœ์Šคํฌ๋จธ ๋…ผ๋ฌธ๋ฆฌ๋ทฐ



Introduction

  • ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(Large Language Model, LLM)์€ ์–ด๋А ๋‚  ๊ฐ‘์ž๊ธฐ ๋“ฑ์žฅํ•œ ๊ธฐ์ˆ ์ด ์•„๋‹ˆ๋ผ, 1980๋…„๋Œ€ ๋ถ„์‚ฐ ํ‘œํ˜„ ์—ฐ๊ตฌ๋ถ€ํ„ฐ ์ด์–ด์ง„ 40์—ฌ ๋…„์˜ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋ฌผ์ž…๋‹ˆ๋‹ค 1.
    • LLM์„ ์ดํ•ดํ•˜๋ ค๋ฉด ์˜ค๋Š˜๋‚ ์˜ GPT, Claude๊ฐ€ ๋ฌด์—‡์„ ํ•˜๋Š”๊ฐ€๊ฐ€ ์•„๋‹ˆ๋ผ, ์–ด๋–ค ์•„์ด๋””์–ด๋“ค์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์—ฐ๊ฒฐ๋˜์–ด ์ง€๊ธˆ์˜ ๋ชจ๋ธ์„ ๋งŒ๋“ค์—ˆ๋Š”๊ฐ€๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.
  • ๋ณธ ๊ธ€์€ LLM์„ ๊ตฌ์„ฑํ•˜๋Š” ํ•ต์‹ฌ ์•„์ด๋””์–ด๋“ค์˜ ํ๋ฆ„์„ ์‹œ๊ฐ„์ˆœ์œผ๋กœ ์ •๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.

    • ๋ถ„์‚ฐ ํ‘œํ˜„(Distributed Representation) ใ…ฃ Bengio 2003

    • ์ž๊ธฐํšŒ๊ท€ ํ”„๋ ˆ์ž„์›Œํฌ(Autoregressive Framework)
    • Word2Vec๊ณผ ์–ธ์–ด ๊ทœ์น™์„ฑ(Linguistic Regularities)
    • Seq2Seq ๋ชจ๋ธ๊ณผ ์ ์‘์  ๋ฌธ๋งฅ(Adaptive Context)
    • Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ๋ถ„ํ™”
    • Transformer (Attention is all you need, 2017)
    • ์ƒ์„ฑ์  ์‚ฌ์ „ํ•™์Šต(Generative Pre-training)๊ณผ ์ •๋ ฌ(Alignment, RLHF)


LLM๊ณผ GenAI ํ›‘์–ด๋ณด๊ธฐ


๋ถ„์‚ฐ ํ‘œํ˜„ Distributed Representation

  • ํ•ต์‹ฌ ์งˆ๋ฌธ ใ…ฃ ์‚ฌ๋žŒ์˜ ์–ธ์–ด๋ฅผ ์–ด๋–ป๊ฒŒ ์ปดํ“จํ„ฐ๋กœ ๋ชจ๋ธ๋งํ•  ๊ฒƒ์ธ๊ฐ€?
    • 1980๋…„๋Œ€๊นŒ์ง€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(Natural Language Processing, NLP)๋Š” ์ˆ˜์ž‘์—…์œผ๋กœ ์„ค๊ณ„๋œ ๊ทœ์น™๊ณผ ํŠน์„ฑ(Feature) ๊ธฐ๋ฐ˜์ด์—ˆ์Šต๋‹ˆ๋‹ค.
    • 1990๋…„๋Œ€ ์ดˆ๋ถ€ํ„ฐ ํ†ต๊ณ„์  ๊ธฐ๊ณ„ํ•™์Šต ๋ฐฉ๋ฒ•์ด ๋„์ž…๋˜๊ธฐ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค 2.
  • ํ†ต๊ณ„์  NLP์˜ ํ•ต์‹ฌ์€ ์–ธ์–ด๋ฅผ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์‹œํ€€์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
    • ์ด ๋ถ„ํฌ๋Š” ๋ณดํ†ต ๊ฐ ๋‹จ์–ด๊ฐ€ ์•ž์„  ๋ชจ๋“  ๋‹จ์–ด์— ์˜์กดํ•˜๋„๋ก ๋ถ„ํ•ด๋ฉ๋‹ˆ๋‹ค:
\[p(w_{1:T}) = \prod_{t=1}^{T} p(w_t \mid w_{1:t-1})\]
  • ์ข‹์€ ์–ธ์–ด ๋ชจ๋ธ \(p(w_{1:T})\)๊ฐ€ ์žˆ์œผ๋ฉด ์‹œํ€€์Šค์˜ ๊ฐ€๋Šฅ๋„ ๋น„๊ต, ๋ฒˆ์—ญ, ์กฐ๊ฑด๋ถ€ ์ƒ์„ฑ ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.


์ฐจ์›์˜ ์ €์ฃผ Curse of Dimensionality

  • ์œ„ ํ™•๋ฅ ์„ ์ถ”์ •ํ•˜๋Š” ์ผ์€ ๋งค์šฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
    • ์˜์–ด ์–ดํœ˜๋Š” ๋Œ€๋žต ๋ฐฑ๋งŒ ๋‹จ์–ด ์ˆ˜์ค€์ด๋ฉฐ, ๋ฒˆ์—ญ์ฒ˜๋Ÿผ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์ด ๋งŽ์€ ์ž‘์—…์—์„œ๋Š” ๋ชจ๋“  ์กฐํ•ฉ์„ ๊ด€์ธกํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
    • ๋ฐ์ดํ„ฐ ํฌ์†Œ์„ฑ(Data Sparsity) ๋ฌธ์ œ๋กœ, ์‹ค์ œ ํ™•๋ฅ ์„ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์ด ์‚ฌ์‹ค์ƒ ๋ถˆ๊ฐ€๋Šฅํ•ด์ง‘๋‹ˆ๋‹ค.
  • ๊ฐ€์žฅ ์˜ค๋ž˜๋œ ์ ‘๊ทผ์€ Markov ๊ฐ€์ •์œผ๋กœ, ์ด๋Š” ๊ฐ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์ด ์ง์ „ \(N\)๊ฐœ ๋‹จ์–ด์—๋งŒ ์˜์กดํ•œ๋‹ค๊ณ  ๋‹จ์ˆœํ™”ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค 3:
\[p(w_{1:T}) \approx \prod_{t=1}^{T} p(w_t \mid w_{t-N+1:t-1})\]
  • ์ด๊ฒƒ์ด ์œ ๋ช…ํ•œ N-gram ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
    • \(N=2\) (bigram), \(N=3\) (trigram) ์ •๋„์—์„œ๋Š” ์ถ”์ •์ด ๊ฐ€๋Šฅํ•˜๋‚˜, Markov ๊ฐ€์ •์€ ๋ฌธ๋งฅ(Context)์„ ํŒŒ๊ดดํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž์—ฐ์–ด์˜ ๋ณต์žก๋„, ๋‰˜์•™์Šค๋ฅผ ์žฌํ˜„ํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
    • 2000๋…„ ๋ฌด๋ ต๊นŒ์ง€ ์ด๊ฒƒ์ด NLP์˜ ํ‘œ์ค€์ด์—ˆ์Šต๋‹ˆ๋‹ค.


์‹ ๊ฒฝ๋ง ์–ธ์–ด ๋ชจ๋ธ

  • 2003๋…„ Bengio์™€ ์—ฐ๊ตฌ์ง„์€ ๋ถ„์‚ฐ ํ‘œํ˜„์„ ์ด์šฉํ•œ ์‹ ๊ฒฝ ํ™•๋ฅ ์  ์–ธ์–ด ๋ชจ๋ธ์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค 4.
    • ์ด ๋ชจ๋ธ์€ ์•„๋ž˜์˜ ์„ธ๊ฐ€์ง€ ์•„์ด๋””์–ด๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค:
      • ๋‹จ์–ด๋ฅผ ์‹ค์ˆ˜ ๋ฒกํ„ฐ(embedding)๋กœ ํ‘œํ˜„
      • ํ™•๋ฅ  ํ•จ์ˆ˜๋ฅผ ํ•ด๋‹น ์ž„๋ฒ ๋”ฉ์˜ ํ•จ์ˆ˜๋กœ ํ‘œํ˜„
      • ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ์ž„๋ฒ ๋”ฉ๊ณผ ํ™•๋ฅ  ํ•จ์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™์‹œ์— ํ•™์Šต(Back-propagation)
  • ์–ดํœ˜ \(V = \{1, 2, \ldots, V\}\)์˜ ๊ฐ ๋‹จ์–ด๋ฅผ \(D\)์ฐจ์› ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋ฉด, ์ „์ฒด ์–ดํœ˜๋Š” ํ–‰๋ ฌ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
\[C \in \mathbb{R}^{V \times D}\]
  • \(i\)๋ฒˆ์งธ ํ–‰ \(c_i\)๋Š” \(i\)๋ฒˆ์งธ ๋‹จ์–ด์˜ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ(Word Embedding)์ž…๋‹ˆ๋‹ค.

fig1 ๋ถ„์‚ฐ ํ‘œํ˜„ ํ–‰๋ ฌ \(C\)์˜ ๊ตฌ์กฐ. ๊ฐ ํ–‰์ด ํ•œ ๋‹จ์–ด์˜ \(D\)์ฐจ์› ์ž„๋ฒ ๋”ฉ์ด๋‹ค 4.


  • ํ™•๋ฅ  ํ•จ์ˆ˜๋ฅผ feed-forward ์‹ ๊ฒฝ๋ง์œผ๋กœ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค:
\[f_\theta(w_{t-1}, \ldots, w_{t-N}) = g_\theta\bigl(c_{I(w_{t-1})}, \ldots, c_{I(w_{t-N})}\bigr)\]
  • ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ ์ง‘ํ•ฉ์€ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ \(C\)์™€ ์‹ ๊ฒฝ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ \(\theta\)์ž…๋‹ˆ๋‹ค:
\[\Theta := \{C, \theta\}\]


๋ชจ๋ธ ์ž‘๋™ ๋ฐฉ์‹

  • ์› ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๋…ผ์ฆ์€ ์ผ๋ฐ˜ํ™”(Generalization) ๊ฐ€๋Šฅ์„ฑ์— ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์˜๋ฏธ, ๋ฌธ๋ฒ•์ ์œผ๋กœ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋Š” ๋น„์Šทํ•œ ์ž„๋ฒ ๋”ฉ์„ ๊ฐ€์ง€๋ฉฐ, ํ™•๋ฅ  ํ•จ์ˆ˜๋Š” ์ด ์ž„๋ฒ ๋”ฉ์˜ ๋งค๋„๋Ÿฌ์šด(Smooth) ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.
    • ๋”ฐ๋ผ์„œ ์ž„๋ฒ ๋”ฉ์ด ์กฐ๊ธˆ ๋ณ€ํ•˜๋ฉด ํ™•๋ฅ ๋„ ์กฐ๊ธˆ ๋ณ€ํ•˜๋ฉฐ, ํ•™์Šต ๋ฐ์ดํ„ฐ์— ํ•œ ๋ฌธ์žฅ๋งŒ ์žˆ์–ด๋„ ๊ทธ ๋ฌธ์žฅ์˜ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„ ์ด์›ƒ ๋ฌธ์žฅ๋“ค์— ๋Œ€ํ•ด ํ™•๋ฅ ์ด ๋™์‹œ์— ์˜ฌ๋ผ๊ฐ‘๋‹ˆ๋‹ค.
  • ex. โ€œdogโ€๊ณผ โ€œcatโ€์˜ ์ž„๋ฒ ๋”ฉ์ด ๊ฐ€๊น๋‹ค๋ฉด, The cat is walking on the sidewalk๊ณผ The dog is walking on the sidewalk์€ ๋น„์Šทํ•œ ํ™•๋ฅ ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
    • ํ•™์Šต ๋ฐ์ดํ„ฐ์— ํ•œ ๋ฌธ์žฅ๋งŒ ์žˆ์–ด๋„ ๋‹ค๋ฅธ ๋ฌธ์žฅ์œผ๋กœ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.


์ž๊ธฐํšŒ๊ท€ ํ”„๋ ˆ์ž„์›Œํฌ Autoregressive Framework

  • ํ•ต์‹ฌ ์งˆ๋ฌธ ใ…ฃ Bengio ๋ชจ๋ธ์„ ์–ด๋–ป๊ฒŒ ํ•™์Šตํ•˜๊ณ , ์ƒˆ ๋ฌธ์žฅ์„ ์–ด๋–ป๊ฒŒ ์ƒ์„ฑํ•˜๋Š”๊ฐ€?
    • ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ํ˜„์žฌ์˜ LLM๊ณผ ๊ฐœ๋…์ ์œผ๋กœ ๋™์ผํ•œ ํ•™์Šต ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
  • ex. Virginia Woolf์˜ ๋ฌธ์žฅ "Intellectual freedom depends upon material things."์„ ํ•™์Šตํ•œ๋‹ค๊ณ  ํ•  ๋•Œ:
    • Context window \(N=2\)๋กœ ๋‘๋ฉด, ์ฒซ ๋ฒˆ์งธ Non-zero ์ž…๋ ฅ์€ โ€œintellectualโ€ ๋‹จ์–ด์˜ ์ž„๋ฒ ๋”ฉ \(c_{I(\text{intellectual})}\)์ž…๋‹ˆ๋‹ค.
    • ๋ชจ๋ธ์€ \(V\)์ฐจ์› ํ™•๋ฅ  ๋ถ„ํฌ \(p(w_2 \mid w_1)\)๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    • ์ •๋‹ต ๋‹จ์–ด โ€œfreedomโ€์— ๋Œ€์‘๋˜๋Š” One-hot ๋ฒกํ„ฐ์™€ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ(Cross-Entropy) ์†์‹ค์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • ํ•œ ๋‹จ์–ด์”ฉ shiftํ•˜๋ฉฐ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. \(N=2\) ์ œ์•ฝ ๋•Œ๋ฌธ์— ์„ธ ๋ฒˆ์งธ ์ž…๋ ฅ์—์„œ๋Š” โ€œintellectualโ€์ด ๋ฌธ๋งฅ์„ ๋ฒ—์–ด๋‚˜ ์†์‹ค๋ฉ๋‹ˆ๋‹ค.
    • ์ด๊ฒƒ์ด Context Window์˜ ๊ทผ๋ณธ์  ํ•œ๊ณ„์ด๋ฉฐ, ์ดํ›„ ์—ฐ๊ตฌ๋“ค์˜ ํ•ต์‹ฌ ๋™๊ธฐ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.


๋ชฉ์  ํ•จ์ˆ˜์™€ ์ƒ์„ฑ

  • ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์ตœ์†Œํ™”๋Š” ๋กœ๊ทธ ์šฐ๋„ ์ตœ๋Œ€ํ™”์™€ ๋™์น˜์ด๋ฏ€๋กœ, ํ•™์Šต์€ ๋‹ค์Œ์„ ํ‘ธ๋Š” ๋ฌธ์ œ๋กœ ์ผ๋ฐ˜ํ™”๋ฉ๋‹ˆ๋‹ค:
\[\Theta^{*} = \arg\max_{\Theta} \sum_{t=1}^{T} \log g_\theta\bigl(c_{I(w_{t-N})}, \ldots, c_{I(w_{t-1})}\bigr)\]
  • Back-propagation๊ณผ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•(Gradient Descent)์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ \(\Theta\)๋ฅผ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค.

  • ํ•™์Šต์ด ๋๋‚œ ํ›„ ๋ฌธ์žฅ ์ƒ์„ฑ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค:
    • ์ฒซ ๋‹จ์–ด \(w_1\)์„ ์–ดํœ˜์—์„œ ์ƒ˜ํ”Œ๋ง
    • ๋‘ ๋ฒˆ์งธ ๋‹จ์–ด๋ฅผ \(p(w_2 \mid w_1)\)์—์„œ ์ƒ˜ํ”Œ๋ง
    • ์„ธ ๋ฒˆ์งธ ๋‹จ์–ด๋ฅผ \(p(w_3 \mid w_{1:2})\)์—์„œ ์ƒ˜ํ”Œ๋ง
    • ์ข…๋ฃŒ ํ† ํฐ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต
  • LLM์ด ์ž์—ฐ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๋™์‹œ์— ์ƒ์„ฑํ•˜๋Š” ์ด์œ ๊ฐ€ ์—ฌ๊ธฐ์— ์žˆ์Šต๋‹ˆ๋‹ค. ์–ธ์–ด ๋ชจ๋ธ์€ ๊ธฐ์ˆ ์  ๋ชจ๋ธ(descriptive)์ธ ๋™์‹œ์— ์ƒ์„ฑ์ (generative) ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

  • ์ด๋Ÿฐ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ž๊ธฐํšŒ๊ท€ ๋ชจ๋ธ(Autoregressive Model)์ด๋ผ ๋ถ€๋ฆ…๋‹ˆ๋‹ค.
    • ํ†ต๊ณ„ํ•™์—์„œ ์ž๊ธฐํšŒ๊ท€๋ž€ ๋ณ€์ˆ˜๊ฐ€ ์ž์‹ ์˜ ์ด์ „ ๊ฐ’์œผ๋กœ ์˜ˆ์ธก๋˜๋Š” ๋ชจ๋ธ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.


๋ณ€ํ™”

  • ํ•ต์‹ฌ ์งˆ๋ฌธ ใ…ฃ Bengio 2003์ด ๋žœ๋“œ๋งˆํฌ๋ผ๋ฉด์„œ, ์™œ ๊ทธ ํ›„ 10๋…„ ๊ฐ€๊นŒ์ด ์‹ค์ œ๋กœ๋Š” N-gram์ด ์ฃผ๋ฅ˜์˜€๋Š”๊ฐ€?
    • ๋‹ต์€ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค. ์‹ ๊ฒฝ๋ง์„ ํ•™์Šตํ•˜๋Š” ์ผ์ด ๋‹น์‹œ์—” ๋„ˆ๋ฌด ์–ด๋ ค์› ์Šต๋‹ˆ๋‹ค.
    • Bengio ๋ชจ๋ธ์€ CPU ์ƒ์—์„œ, ์ž๋™ ๋ฏธ๋ถ„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋„ ์—†์ด ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.


AlexNet

  • 2012๋…„ ImageNet ๋Œ€ํšŒ์— ๋“ฑ์žฅํ•œ AlexNet 5์€ ์ปดํ“จํ„ฐ ๋น„์ „์˜ ํŠธ๋ Œ๋“œ๋ฅผ ํฌ๊ฒŒ ๋ฐ”๊พธ์—ˆ์Šต๋‹ˆ๋‹ค.
    • ILSVRC-2012 top-5 Test Error 15.3% (2๋“ฑ 26.2%). ์ƒ๋Œ€์  ์˜ค๋ฅ˜์œจ ๊ธฐ์ค€ 40% ๊ฐ์†Œ.
    • GPU ์ƒ์—์„œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹(ImageNet)์œผ๋กœ end-to-end ํ•™์Šต๋œ ์ตœ์ดˆ์˜ deep CNN.
  • โ€œ2003๋…„ Bengio๋Š” ๊ฐœ๋…์  ๋ฌด๋Œ€๋ฅผ ๋†“์•˜๊ณ , 2012๋…„ Krizhevsky๋Š” ๊ธฐ์ˆ ์  ๋ฌด๋Œ€๋ฅผ ๋†“์•˜๋‹ค.โ€
    • ์ดํ›„ NLP ์—ฐ๊ตฌ์ž๋“ค์ด ์‹ ๊ฒฝ๋ง์„ ๊ทœ๋ชจ ์žˆ๊ฒŒ ํ•™์Šตํ•˜๋ ค๋Š” ์‹œ๋„๋ฅผ ๋ณธ๊ฒฉํ™”ํ•ฉ๋‹ˆ๋‹ค.


Word2Vec

  • Mikolov ๋“ฑ์€ 2013๋…„์— ๋‘ ํŽธ์˜ ๋…ผ๋ฌธ์„ ๋ฐœํ‘œํ•˜๋ฉฐ ๋ถ„์‚ฐ ํ‘œํ˜„์˜ ํ™•์žฅ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค 6 7.

  • Bengio ๋ชจ๋ธ์˜ ๊ณ„์‚ฐ ๋น„์šฉ์„ ์‚ดํŽด๋ด…์‹œ๋‹ค. Bengio ๋ชจ๋ธ์˜ ๋‹จ์ผ ๋‹จ์–ด ์˜ˆ์ธก ๋ณต์žก๋„๋Š” ๋Œ€๋žต ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค:

\[\mathcal{O}(ND + VND + VH + HND)\]
  • \(V\)๋Š” ์–ดํœ˜ ํฌ๊ธฐ, \(N\)์€ Context Window, \(D\)๋Š” Embedding ์ฐจ์›, \(H\)๋Š” Hidden ์ฐจ์›.

    • ์–ดํœ˜ \(V\)๊ฐ€ ๋งค์šฐ ํฌ๊ธฐ ๋•Œ๋ฌธ์— \(VH\)๊ฐ€ ์ง€๋ฐฐ์ ์ด๋ฉฐ, ๊ฑฐ๊ธฐ์— Softmax ์ •๊ทœํ™”๊นŒ์ง€ ๋”ํ•ด์ ธ ํ•™์Šต์ด ๋งค์šฐ ๋А๋ ธ์Šต๋‹ˆ๋‹ค.
  • Mikolov ๋“ฑ์ด ์“ด ๋‘ ๊ฐ€์ง€ ๊ธฐ๋ฒ•:
    • Hierarchical Softmax ใ…ฃ ์ด์ง„ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์ •๊ทœํ™”. ๋ณต์žก๋„๋ฅผ \(\mathcal{O}(V)\)์—์„œ \(\mathcal{O}(\log_2 V)\)๋กœ ์ถ•์†Œ.
    • Negative Sampling ใ…ฃ ๋…ธ์ด์ฆˆ ๋ถ„ํฌ์—์„œ \(K\)๊ฐœ ์ƒ˜ํ”Œ์„ ๋ฝ‘์•„ ๊ด€์ธก์„ ๋…ธ์ด์ฆˆ์™€ ๊ตฌ๋ถ„ํ•˜๋„๋ก ํ•™์Šต. ์ •๊ทœํ™” ์ƒ์ˆ˜๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜์ง€ ์•Š์Œ.
  • ๋ชจ๋ธ ๊ตฌ์กฐ๋„ ๊ทน๋‹จ์ ์œผ๋กœ ๋‹จ์ˆœํ™”ํ•ฉ๋‹ˆ๋‹ค. Bengio์˜ ๋น„์„ ํ˜• Hidden Layer๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋กœ๊ทธ-์„ ํ˜•(Log-linear) ๋ชจ๋ธ๋งŒ ๋‚จ๊น๋‹ˆ๋‹ค.


CBOW์™€ Skip-gram

fig2 CBOW(์™ผ์ชฝ)์™€ Skip-gram(์˜ค๋ฅธ์ชฝ). ๋ชจ๋‘ ์–•์€ ๋กœ๊ทธ-์„ ํ˜• ๋ชจ๋ธ์ด๋‹ค 6.

  • CBOW(Continuous Bag-of-Words) ใ…ฃ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์ด ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธก
  • Skip-gram ใ…ฃ ์ค‘์‹ฌ ๋‹จ์–ด๊ฐ€ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์„ ์˜ˆ์ธก

  • Skip-gram์˜ ๋ชฉ์  ํ•จ์ˆ˜ (window \(N=2C\)):
\[\frac{1}{T} \sum_{t=1}^{T} \sum_{-C \leq j \leq C,\, j \neq 0} \log p(w_{t+j} \mid w_t)\]
  • ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์€ ๋กœ๊ทธ-์„ ํ˜•์œผ๋กœ ๋ชจ๋ธ๋ง๋ฉ๋‹ˆ๋‹ค:
\[p(w_{t+j} \mid w_t) = \frac{\exp\bigl(u_{I(w_{t+j})}^\top c_{I(w_t)}\bigr)}{\sum_{i \in V} \exp\bigl(u_i^\top c_{I(w_t)}\bigr)}\]
  • ๋กœ๊ทธ๋ฅผ ์ทจํ•˜๋ฉด ์„ ํ˜• ํ˜•ํƒœ๊ฐ€ ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค:
\[\log p(w_{t+j} \mid w_t) = u_{I(w_{t+j})}^\top c_{I(w_t)} - Z\]
  • \(Z\)๋Š” ์ •๊ทœํ™” ์ƒ์ˆ˜์ด๋ฉฐ, Negative Sampling์„ ์“ฐ๋ฉด ๋ช…์‹œ์  ๊ณ„์‚ฐ์ด ๋ถˆํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  • ์ค‘์š”ํ•œ subtlety ใ…ฃ CBOW/Skip-gram์€ ์™„์ „ํ•œ ์–ธ์–ด ๋ชจ๋ธ์ด ์•„๋‹™๋‹ˆ๋‹ค. ์ข‹์€ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋ณด์กฐ ๋ชฉ์ ์ผ ๋ฟ์ž…๋‹ˆ๋‹ค.

    • ๊ทธ๋Ÿฌ๋‚˜ ์ด ์–•์€ ๋ชจ๋ธ๋“ค์€ ๋Œ€๊ทœ๋ชจ ํ•™์Šต์ด ๊ฐ€๋Šฅํ–ˆ๊ณ , ๊ฒฐ๊ณผ๋Š” ๋†€๋ผ์› ์Šต๋‹ˆ๋‹ค.


์–ธ์–ด ๊ทœ์น™์„ฑ์˜ ๋ฐœํ˜„ Emergent Linguistic Regularities

  • ํ•ต์‹ฌ ์งˆ๋ฌธ ใ…ฃ ๋‹จ์ˆœํ•œ ์„ ํ˜• ๋ชจ๋ธ์ด ์™œ ์˜๋ฏธยท๋ฌธ๋ฒ• ๊ตฌ์กฐ๋ฅผ ํฌ์ฐฉํ•˜๋Š”๊ฐ€?

  • Mikolov ๋“ฑ์€ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์—์„œ ์œ ์˜๋ฏธํ•œ ์„ ํ˜• ์˜คํ”„์…‹์ด ๊ด€์ฐฐ๋จ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค 8.
    • ์ฆ‰, ๋งŽ์€ ์˜๋ฏธยท๋ฌธ๋ฒ• ๊ด€๊ณ„๊ฐ€ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์—์„œ ๊ฑฐ์˜ ์ผ์ •ํ•œ ๋ฒกํ„ฐ ์ฐจ์ด๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.
  • ex. โ€œking is to queen as man is to womanโ€:
\[\text{vec}(\text{"king"}) - \text{vec}(\text{"man"}) + \text{vec}(\text{"woman"}) \approx \text{vec}(\text{"queen"})\]

fig3 ๊ณ ์ฐจ์› ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์—์„œ๋Š” ํ•˜๋‚˜์˜ ๋‹จ์–ด๊ฐ€ ์„ฑ๋ณ„, ๋‹จ์ˆ˜ยท๋ณต์ˆ˜ ๋“ฑ ์—ฌ๋Ÿฌ ์˜๋ฏธ ๊ด€๊ณ„๋ฅผ ๋™์‹œ์— ๋ฒกํ„ฐ ๋ฐฉํ–ฅ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค 8.

  • ex:
\[\text{vec}(\text{"Russia"}) + \text{vec}(\text{"river"}) \approx \text{vec}(\text{"Volga River"})\]
  • ๋‹จ์–ด๋Š” ์ด์‚ฐ์ (Discrete) ๊ฐ์ฒด์ด๋ฉฐ, ๋‹จ์–ด์˜ ์ž‘์€ ๋ณ€ํ™”๋Š” ์ง๊ด€์ ์ผ ๋ฟ ์ˆ˜ํ•™์ ์œผ๋กœ ์ •์˜๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
    • ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์€ ์ด ์ง๊ด€์„ ๊ตฌ์ฒดํ™”ํ•ฉ๋‹ˆ๋‹ค. ์˜๋ฏธ๊ฐ€ ๊ฐ€๊นŒ์šด ๋‹จ์–ด๋ผ๋ฆฌ ๊ฐ€๊นŒ์šด ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ๊ฒƒ์€, ์—ฐ์†์ ์ธ ๋ฒกํ„ฐ ๊ณต๊ฐ„์—์„œ ์ด์‚ฐ์  ์˜๋ฏธ ๊ตฌ์กฐ๊ฐ€ ๊ฑฐ์˜ ์„ ํ˜•์ ์œผ๋กœ ์œ ์ง€๋จ์„ ๋œปํ•ฉ๋‹ˆ๋‹ค.
  • ์ดํ›„ ๋ฌธ๋งฅ ์˜์กด ์ž„๋ฒ ๋”ฉ(Contextualized Embedding) ๊ณ„์—ด์ด ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.
    • Peters ๋“ฑ 2018์˜ ELMO๋Š” Bidirectional LSTM์˜ Hidden State๋ฅผ ๋ฌธ๋งฅ ์˜์กด ์ž„๋ฒ ๋”ฉ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค 9.
    • ์‚ฌ์ „ํ•™์Šต ์ž„๋ฒ ๋”ฉ๊ณผ ์ง€๋„ํ•™์Šต Fine-tuning์„ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๋ฏธ๋ฆฌ ๋ณด์—ฌ์ค€ ์ž‘์—…์ž…๋‹ˆ๋‹ค.


์ ์‘์  ๋ฌธ๋งฅ Adaptive Context

  • ํ•ต์‹ฌ ์งˆ๋ฌธ ใ…ฃ ๊ณ ์ • ํฌ๊ธฐ Context Window๋ฅผ ๋„˜์–ด ์ž„์˜ ๊ธธ์ด ์‹œํ€€์Šค๋ฅผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฃฐ ๊ฒƒ์ธ๊ฐ€?

  • 2013๋…„๊ฒฝ๊นŒ์ง€ ์ž„๋ฒ ๋”ฉ์€ ์ž˜ ์ž‘๋™ํ–ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ๊ณ ์ • Window ์•ˆ์—์„œ๋งŒ ์œ ํšจํ–ˆ์Šต๋‹ˆ๋‹ค.

    • ์ด ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•œ ๋ชจ๋ธ์ด Sequence-to-Sequence(Seq2Seq) ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.


RNN ์ธ์ฝ”๋”-๋””์ฝ”๋”

  • Seq2Seq ๋ชจ๋ธ์˜ ๊ตฌ์กฐ:
    • Encoder ใ…ฃ ๊ฐ€๋ณ€ ๊ธธ์ด ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ๊ณ ์ • ๊ธธ์ด ๋ฒกํ„ฐ๋กœ ์••์ถ•
    • Decoder ใ…ฃ ์ด ๋ฒกํ„ฐ๋ฅผ ๋‹ค์‹œ ๊ฐ€๋ณ€ ๊ธธ์ด ์ถœ๋ ฅ ์‹œํ€€์Šค๋กœ ๋ณต์›
  • ๋Œ€ํ‘œ ๋…ผ๋ฌธ 3ํŽธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค 10 11 12:
    • Kalchbrenner & Blunsom (2013) ใ…ฃ CSM ์ธ์ฝ”๋” + RNN ๋””์ฝ”๋”
    • Cho et al. (2014) ใ…ฃ ๋‘ ๊ฐœ RNN ๊ตฌ์กฐ (encoder-decoder ๋ชจ๋‘ RNN)
    • Sutskever et al. (2014) ใ…ฃ LSTM ๊ธฐ๋ฐ˜ encoder-decoder, vanishing gradient ๋ฌธ์ œ ์™„ํ™”

fig4 RNN ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ. ์ธ์ฝ”๋”์˜ Hidden States \(H\)๊ฐ€ ๊ณ ์ • ๊ธธ์ด Context Vector \(c\)๋กœ ์••์ถ•๋œ ๋’ค ๋””์ฝ”๋”๋กœ ์ „๋‹ฌ๋œ๋‹ค 11.


RNN ์ƒํƒœ ์ˆ˜์‹

  • ๊ฐ€๋ณ€ ๊ธธ์ด ์ž…๋ ฅ \(X = \{x_1, x_2, \ldots, x_{T_x}\}\), ์ถœ๋ ฅ \(Y = \{y_1, y_2, \ldots, y_{T_y}\}\)๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
  • ์ธ์ฝ”๋”์˜ Hidden State๋Š” ์žฌ๊ท€์ ์œผ๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค:
\[h_t = f_{\text{enc}}(h_{t-1}, x_t)\]
  • ๊ฐ„๋‹จํ•œ RNN ์œ ๋‹›์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋น„์„ ํ˜• ํ•จ์ˆ˜๋กœ ๊ตฌ์ฒดํ™”๋ฉ๋‹ˆ๋‹ค:
\[h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t)\]
  • Context Vector \(c\)๋Š” Hidden State๋“ค์˜ ํ•จ์ˆ˜๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค:
\[c = q(H), \quad H = \{h_1, h_2, \ldots, h_{T_x}\}\]
  • ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ์„ ํƒ์€ \(c = h_{T_x}\)๋กœ, ๋งˆ์ง€๋ง‰ Hidden State๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ๋””์ฝ”๋”๋„ ์žฌ๊ท€ ๊ด€๊ณ„๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค:

\[s_t = f_{\text{dec}}(s_{t-1}, y_{t-1}, c)\]
  • ํ•™์Šต ๋ชฉ์ ์€ ๋กœ๊ทธ ์šฐ๋„ ์ตœ๋Œ€ํ™”์ž…๋‹ˆ๋‹ค:
\[\log p(Y) = \sum_{t=1}^{T_y} \log p(y_t \mid y_{1:t-1}) = \sum_{t=1}^{T_y} \log f_{\text{dec}}(s_{t-1}, y_{t-1}, c)\]


๊ณ ์ • ๋ฒกํ„ฐ์˜ ๋ณ‘๋ชฉ

  • RNN Encoder-decoder ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๊ฐ•๋ ฅํ–ˆ์ง€๋งŒ, ํฐ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
    • ๋ฌธ์žฅ์ด ๊ธธ์–ด์งˆ์ˆ˜๋ก ๊ณ ์ • ํฌ๊ธฐ Context Vector \(c\)์— ์ •๋ณด๋ฅผ ์••์ถ•ํ•ด์•ผ ํ–ˆ๊ณ , ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์ด ์†์‹ค๋์Šต๋‹ˆ๋‹ค.
    • Cho ๋“ฑ(2014)์€ BLEU ์ ์ˆ˜๊ฐ€ ๋ฌธ์žฅ ๊ธธ์ด์— ๋”ฐ๋ผ ๊ธ‰๊ฒฉํžˆ ์—ดํ™”๋จ์„ ์‹คํ—˜์œผ๋กœ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ด ๋ณ‘๋ชฉ์„ ๊นจ๋Š” ๋‹ต์ด Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด์—ˆ์Šต๋‹ˆ๋‹ค.


Attention

  • ํ•ต์‹ฌ ์งˆ๋ฌธ ใ…ฃ ๊ณ ์ • ๋ฒกํ„ฐ๋กœ ์••์ถ•ํ•˜์ง€ ์•Š๊ณ , ๋””์ฝ”๋”๊ฐ€ ํ•„์š”ํ•  ๋•Œ๋งˆ๋‹ค ์ธ์ฝ”๋”์˜ ํŠน์ • ๋ถ€๋ถ„์„ ์ฐธ์กฐํ•  ์ˆ˜๋Š” ์—†์„๊นŒ?


NMT์—์„œ์˜ Attention ๋“ฑ์žฅ

  • Bahdanau et al. (2014)์€ Neural Machine Translation by jointly learning to align and translate์—์„œ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ Attention ๋ ˆ์ด์–ด๋ฅผ NMT์— ์ตœ์ดˆ๋กœ ์„ฑ๊ณต์ ์œผ๋กœ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค 13.
    • ์›๋ฌธ ํ‘œํ˜„: โ€œparts of a source sentence that are relevant to predicting a target wordโ€๋ฅผ ์ž๋™์œผ๋กœ (Soft) ํƒ์ƒ‰.
  • ๊ฐ ๋””์ฝ”๋” Hidden State \(s_i\)๊ฐ€ ์ž์‹ ๋งŒ์˜ Context Vector \(c_i\)๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ์ด \(c_i\)๋Š” ๋ชจ๋“  ์ธ์ฝ”๋” Hidden State์˜ ๊ฐ€์ค‘ ํ•ฉ์ž…๋‹ˆ๋‹ค:
\[s_i = f_{\text{dec}}(s_{i-1}, y_{i-1}, c_i), \quad c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j\]
  • Attention ๊ฐ€์ค‘์น˜ \(\alpha_{ij}\)๋Š” Softmax ์ •๊ทœํ™”๋œ ์ •๋ ฌ(Alignment) ์ ์ˆ˜์ž…๋‹ˆ๋‹ค:
\[\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}, \quad e_{ij} = v_a^\top \tanh(W_a s_{i-1} + U_a h_j)\]
  • \(\alpha_i\)๋Š” ์ •๋ ฌ ๋ฒกํ„ฐ(Alignment Vector)๋กœ, ๋””์ฝ”๋”๊ฐ€ ์ธ์ฝ”๋”์˜ ์–ด๋А ๋ถ€๋ถ„์„ ์–ผ๋งˆ๋‚˜ ์ฐธ์กฐํ• ์ง€ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ \(v_a, W_a, U_a\)๋Š” End-to-End ๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค.


Attention์˜ ์ฐจ์› ์ •๋ฆฌ

  • Luong et al. (2015)์€ Bahdanau์˜ ์•„์ด๋””์–ด๋ฅผ ๋‹จ์ˆœํ™”ํ•˜๋ฉฐ Attention์˜ ์—ฌ๋Ÿฌ ํ˜•ํƒœ๋ฅผ ์ฒด๊ณ„ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค 14.

fig5 Attention ์œ ํ˜• 1. Global โ€” ๋ชจ๋“  ์†Œ์Šค ์ƒํƒœ ์ฐธ์กฐ 14

fig6 Attention ์œ ํ˜• 2. Local โ€” ์ผ๋ถ€ ์ƒํƒœ๋งŒ ์ฐธ์กฐ 14

  • ์ฐจ์› 1: ๋ฒ”์œ„ โ€” Global vs Local Attention
    • Global ใ…ฃ ๋ชจ๋“  ์ธ์ฝ”๋” Hidden State๋ฅผ ์ฐธ์กฐ (\(a=1, b=T_x\))
    • Local ใ…ฃ ์ผ์ • ์œˆ๋„์šฐ๋งŒ ์ฐธ์กฐ
  • ์ฐจ์› 2: ์ ์ˆ˜ ํ•จ์ˆ˜ โ€” Alignment Score Function
    • ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์ ์ˆ˜ ํ•จ์ˆ˜:
\[e_{ij} = \text{score}(h_j, s_{i-1}) = \begin{cases} h_j^\top s_{i-1} & \text{dot} \\ h_j^\top W_a s_{i-1} & \text{general} \\ v_a^\top \tanh(W_a h_j + U_a s_{i-1}) & \text{additive (Bahdanau)} \end{cases}\]
  • ์ดํ›„ Transformer๊ฐ€ ์„ ํƒํ•˜๋Š” ํ˜•ํƒœ๋Š” Dot-product Attention์ž…๋‹ˆ๋‹ค. ๋‘ ๋ฒกํ„ฐ์˜ ๋‚ด์ ์€ ์œ ์‚ฌ๋„(Similarity)์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ฒ™๋„์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

  • ์ฐจ์› 3: ๊ด€์‹ฌ ๋ณ€์ˆ˜์˜ ์ถœ์ฒ˜ โ€” Cross vs Self Attention
    • Query(Q), Key(K), Value(V) ๋Š” ์ •๋ณด ๊ฒ€์ƒ‰(Information Retrieval)์—์„œ ์ฐจ์šฉ๋œ ๊ฐœ๋…์ž…๋‹ˆ๋‹ค:
      • Query ใ…ฃ ์‚ฌ์šฉ์ž๊ฐ€ ์ฐพ๋Š” ๊ฒƒ
      • Key ใ…ฃ ๊ฒ€์ƒ‰ ๋Œ€์ƒ์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ
      • Value ใ…ฃ ์‹ค์ œ๋กœ ๋ฐ˜ํ™˜๋˜๋Š” ๋‚ด์šฉ
    • Cross-Attention ใ…ฃ Query๋Š” ํ•œ ์ง‘ํ•ฉ์—์„œ, KeyยทValue๋Š” ๋‹ค๋ฅธ ์ง‘ํ•ฉ์—์„œ (Bahdanau์™€ ๋™์ผ)
    • Self-Attention ใ…ฃ Query, Key, Value ๋ชจ๋‘ ๊ฐ™์€ ์ง‘ํ•ฉ์—์„œ
  • Self-Attention์„ NLP์— ์ตœ์ดˆ๋กœ ์ ์šฉํ•œ ๊ฒƒ์€ Cheng et al. (2016)์˜ โ€œLSTM-Networks for Machine Readingโ€์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค 15.
    • ์‹œํ€€์Šค๊ฐ€ ์ž๊ธฐ ์ž์‹ ์˜ ์–ด๋А ๋ถ€๋ถ„์— ์ฃผ๋ชฉํ• ์ง€๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.


Transformer

  • ํ•ต์‹ฌ ์งˆ๋ฌธ ใ…ฃ ์žฌ๊ท€ ์—ฐ์‚ฐ(RNN)์„ ์™„์ „ํžˆ ์ œ๊ฑฐํ•˜๊ณ  Attention๋งŒ์œผ๋กœ ์‹œํ€€์Šค ๋ชจ๋ธ๋ง์„ ํ•  ์ˆ˜ ์žˆ์„๊นŒ?

  • 2017๋…„ Vaswani et al.์€ Attention is all you need์—์„œ ์ •ํ™•ํžˆ ์ด ์ œ์•ˆ์„ ํ•ฉ๋‹ˆ๋‹ค 16.
    • ์›๋ฌธ ๊ทธ๋Œ€๋กœ: โ€œWe propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.โ€
  • ์™œ ์ข‹์€ ์•„์ด๋””์–ด์ธ๊ฐ€?
    • RNN์˜ ์ˆœ์ฐจ์  ์„ฑ์งˆ์€ ํ•™์Šต ๋ณ‘๋ ฌํ™”๋ฅผ ๋ง‰๋Š”๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
    • Attention์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๊ทœ๋ชจ ํ™•์žฅ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด, ์ถฉ๋ถ„ํžˆ ์ข‹์•„์งˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ „์ œ์ž…๋‹ˆ๋‹ค.
    • ์‹ค์ œ๋กœ 8๊ฐœ์˜ P100 GPU๋กœ 12์‹œ๊ฐ„ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‹น์‹œ ์ตœ์ฒจ๋‹จ ๋ฒˆ์—ญ ํ’ˆ์งˆ์— ๋„๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค.


์•„ํ‚คํ…์ฒ˜

fig7 Transformer ์•„ํ‚คํ…์ฒ˜. Encoder์™€ Decoder ๋ชจ๋‘ Positional Encoding๊ณผ Multi-head Self-attention์„ ์‚ฌ์šฉํ•œ๋‹ค 16.

  • Transformer๋Š” Encoder-decoder ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋˜, ๋‚ด๋ถ€๋ฅผ ๋ชจ๋‘ Attention์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค:
    • Positional Encoding ใ…ฃ Attention ์ž์ฒด์—” ์ˆœ์„œ ์ •๋ณด๊ฐ€ ์—†์œผ๋ฏ€๋กœ, ์ž…๋ ฅ ๋ฒกํ„ฐ์— ์œ„์น˜ ์˜์กด์  ์ •๋ณด๋ฅผ ๋”ํ•จ
    • Multi-Head Self-Attention (Encoder) ใ…ฃ ์ž…๋ ฅ ์‹œํ€€์Šค ๋‚ด๋ถ€์˜ ์˜์กด์„ฑ ํฌ์ฐฉ
    • Masked Multi-Head Self-Attention (Decoder) ใ…ฃ ๋””์ฝ”๋”ฉ ์‹œ ๋ฏธ๋ž˜ ํ† ํฐ์„ ๊ฐ€๋ ค Autoregressive ๊ตฌ์กฐ ์œ ์ง€
    • Cross-Attention (Encoderโ€“Decoder) ใ…ฃ Bahdanau ์Šคํƒ€์ผ๋กœ ์ธ์ฝ”๋” ์ถœ๋ ฅ์„ ๋””์ฝ”๋”์—์„œ ์ฐธ์กฐ
    • Layer Normalization + Residual Connection ใ…ฃ ๊ธฐ์กด ๊ธฐ๋ฒ•์„ ๊ทธ๋Œ€๋กœ ์ฐจ์šฉ

Positional Encoding ์ด๋ž€?


Scaled Dot-Product Attention

  • Transformer์˜ ํ•ต์‹ฌ ์—ฐ์‚ฐ์€ Scaled dot-product Attention์ž…๋‹ˆ๋‹ค.
    • Query ํ–‰๋ ฌ \(Q \in \mathbb{R}^{M \times D_k}\), Key ํ–‰๋ ฌ \(K \in \mathbb{R}^{N \times D_k}\), Value ํ–‰๋ ฌ \(V \in \mathbb{R}^{N \times D_v}\)์— ๋Œ€ํ•ด:
\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{D_k}}\right) V\]
  • ์ด ์‹์€ Luong์˜ Dot-product Attention์— ์Šค์ผ€์ผ๋ง ๊ณ„์ˆ˜ \(\sqrt{D_k}\) ๋งŒ ์ถ”๊ฐ€ํ•œ ๋’ค ํ–‰๋ ฌ ํ˜•ํƒœ๋กœ ํŒจํ‚ค์ง•ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
    • ๋™์ผํ•œ ์—ฐ์‚ฐ์„ ์—ฌ๋Ÿฌ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ํ•œ ๋ฒˆ์— ๋ณ‘๋ ฌ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค๋‹ˆ๋‹ค.
  • Transformer์—์„œ์˜ Q/K/V ํ•ด์„:
    • Encoder Self-attention ใ…ฃ Q, K, V ๋ชจ๋‘ ๊ฐ™์€ ์ž…๋ ฅ ์‹œํ€€์Šค์—์„œ ์œ ๋„
    • Decoder Self-attention ใ…ฃ Q, K, V ๋ชจ๋‘ ์ถœ๋ ฅ ์‹œํ€€์Šค์—์„œ ์œ ๋„
    • Encoderโ€“Decoder Attention ใ…ฃ Q๋Š” ๋””์ฝ”๋” ์ƒํƒœ, KยทV๋Š” ์ธ์ฝ”๋” ์ถœ๋ ฅ


Multi-Head Attention

  • ๋‹จ์ผ Attention ๋Œ€์‹  ์—ฌ๋Ÿฌ ๊ฐœ Attention์„ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ Head๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ง‘ํ•ฉ \(\{W_a, U_a, v_a\}_{a=1, \ldots, A}\)๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
    • ๊ฐ Head๋Š” ์–ธ์–ด์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ธก๋ฉด์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค (๊ฒฝํ—˜์ ์œผ๋กœ).
  • ์„ฑ๋Šฅ ๋ฐ ํšจ์œจ
    • ๊ธฐ์กด ConvS2S Ensemble์€ ์˜์–ดโ†’ํ”„๋ž‘์Šค์–ด ํ•™์Šต์— ์•ฝ \(1.2 \times 10^{21}\) FLOPs๊ฐ€ ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • Transformer๋Š” \(3.3 \times 10^{18}\) FLOPs๋กœ ๋™์ผ ์ˆ˜์ค€์˜ BLEU๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ์•ฝ 360๋ฐฐ ๊ณ„์‚ฐ ์ ˆ๊ฐ. ๋‹จ์ˆœํžˆ ์„ฑ๋Šฅ์ด ์ข‹์•„์ง„ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋ชจ๋ธ๋ง ์ •ํ™•๋„์™€ ํ™•์žฅ์„ฑ์˜ ํŒŒ๋ ˆํ†  ๊ฒฝ๊ณ„๋ฅผ ์˜ฎ๊ฒผ๋‹ค๋Š” ๋ฐ์— ์˜์˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.


์ƒ์„ฑ์  ์‚ฌ์ „ํ•™์Šต Generative Pre-training

  • Transformer ์•„ํ‚คํ…์ฒ˜๋งŒ์œผ๋กœ๋Š” ์˜ค๋Š˜์˜ LLM์ด ์™„์„ฑ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
    • ์› Transformer๋Š” ์ตœ๋Œ€ 2.13M ํŒŒ๋ผ๋ฏธํ„ฐ, WMT 2014 ๋ฐ์ดํ„ฐ์…‹(์•ฝ 3,600๋งŒ ๋ฌธ์žฅ) ๊ทœ๋ชจ์˜€์Šต๋‹ˆ๋‹ค.
    • ํ•™์Šต ๋ฐฉ์‹์˜ ์ง„ํ™”๊ฐ€ ์žˆ์–ด์•ผ ์˜ค๋Š˜๋‚  ์‚ฌ๋žŒ๋“ค์ด ๋งŒ๋‚˜๋Š” LLM์ด ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค.
  • OpenAI๊ฐ€ GPT ์‹œ๋ฆฌ์ฆˆ๋ฅผ ํ†ตํ•ด ์ œ์‹œํ•œ ์„ธ ๋‹จ๊ณ„ ํ›ˆ๋ จ ํŒŒ์ดํ”„๋ผ์ธ์ด ๋Œ€ํ‘œ์ ์ž…๋‹ˆ๋‹ค:
    • Generative Pre-training ใ…ฃ ๋Œ€๋Ÿ‰ Unlabeled ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ Next-word Prediction
    • Discriminative Fine-tuning ใ…ฃ ํŠน์ • Task์— ๋Œ€ํ•œ ์ง€๋„ ํ•™์Šต
    • RLHF ใ…ฃ ์ธ๊ฐ„ ํ”ผ๋“œ๋ฐฑ ๊ธฐ๋ฐ˜ ๊ฐ•ํ™” ํ•™์Šต


GPT | ์‚ฌ์ „ํ•™์Šต + ํŒŒ์ธํŠœ๋‹

  • 2018๋…„ OpenAI๋Š” Improving Language Understanding by Generative Pre-Training์„ ๋ฐœํ‘œํ•ฉ๋‹ˆ๋‹ค 17.
    • Transformer๋ฅผ ๊ฐ€๋Šฅํ•œ ํ•œ ๋งŽ์€ Unlabeled ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์ „ํ•™์Šตํ•œ ๋’ค, ์†Œ๋Ÿ‰์˜ Labeled ๋ฐ์ดํ„ฐ๋กœ Task์— ๋งž๊ฒŒ Fine-tune.
  • ์‚ฌ์ „ํ•™์Šต ๋ชฉ์  ํ•จ์ˆ˜ (์ž๊ธฐํšŒ๊ท€ ํ”„๋ ˆ์ž„์›Œํฌ ๊ทธ๋Œ€๋กœ)
\[L_{\text{GPT}}(\Theta) = \sum_{t=1}^{T} \log p_\Theta(w_t \mid w_{t-N:t-1})\]
  • ๋ผ๋ฒจ์ด ๋ถˆํ•„์š”ํ•˜๋ฏ€๋กœ ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.


BERT | Masked Language Model

  • GPT๋Š” ์ขŒโ†’์šฐ ์ž๊ธฐํšŒ๊ท€ ๋ฐฉ์‹์ด๋ผ ์–‘๋ฐฉํ–ฅ ๋ฌธ๋งฅ์ด ํ•„์š”ํ•œ Downstream Task์— ์•ฝ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • 2019๋…„ Google AI๋Š” BERT๋ฅผ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค 18. ํ•ต์‹ฌ์€ Masked Language Model(MLM) ๋ชฉ์  ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.
    • ์ž…๋ ฅ \(w_{1:T}\)์˜ ์œ„์น˜ ์ง‘ํ•ฉ \(M \subset \{1, \ldots, T\}\)์„ ๋ฌด์ž‘์œ„๋กœ ๊ฐ€๋ฆฌ๊ณ , ๊ฐ€๋ ค์ง€์ง€ ์•Š์€ ํ† ํฐ \(w_{\neg M}\)์„ ๋ณด๊ณ  ๊ฐ€๋ ค์ง„ ํ† ํฐ์„ ์˜ˆ์ธก:
\[L_{\text{MLM}}(\Theta) = \sum_{i \in M} \log p_\Theta(w_i \mid w_{\neg M})\]
  • ์ขŒ์šฐ ์–‘๋ฐฉํ–ฅ Context๋ฅผ ๋™์‹œ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.


Discriminative Fine-tuning

  • ์‚ฌ์ „ํ•™์Šต๋งŒ์œผ๋กœ๋Š” ๋ชจ๋ธ์„ ์‹ค์ œ ์‚ฌ๋ก€์— ์ ์šฉํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
    • ex. โ€œI am having trouble getting a date. Any advice?โ€๋ผ๋Š” ์งˆ๋ฌธ์— next-word ์˜ˆ์ธก๋งŒ ํ•˜๋Š” ๋ชจ๋ธ์€ โ€œYouโ€™ll never find true love!โ€ ๊ฐ™์€ ์ด์ƒํ•œ ๋‹ต์„ ๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ Task ํŠนํ™” ๋ฐ์ดํ„ฐ๋กœ ์ง€๋„ํ•™์Šต Fine-tuning์„ ์ด์–ด์„œ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:
\[L_{\text{DFT}}(\Theta) = \sum_{(x_{1:T}, y)} \log p_\Theta(y \mid x_{1:T})\]
  • ์‚ฌ์ „ํ•™์Šต๋œ ์ง€์‹์„ ์žƒ์ง€ ์•Š๋„๋ก ๋‘ ๋ชฉ์ ์„ ๊ฐ€์ค‘ ๊ฒฐํ•ฉํ•˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค:
\[L_{\text{final}} = L_{\text{DFT}} + \lambda L_{\text{GPT}}\]
  • ๋‹จ, GPT-2, GPT-3 ์ดํ›„๋กœ๋Š” Fine-tuning ์—†์ด๋„ Zero-shotยทFew-shot ์„ฑ๋Šฅ์ด ๋‚˜์˜ค๊ธฐ ์‹œ์ž‘ํ•˜๋ฉฐ, ์‚ฌ์ „ํ•™์Šต ์ž์ฒด์˜ ๊ทœ๋ชจ๊ฐ€ ํ›จ์”ฌ ๋” ์ค‘์š”ํ•ด์ง‘๋‹ˆ๋‹ค.


์ •๋ ฌ Alignment

  • ๊ฑฐ์ง“๋ง์„ ํ•˜์ง€ ์•Š๊ฑฐ๋‚˜, ์ธ์ข…์ฐจ๋ณ„์  ๋†๋‹ด์„ ํ•˜์ง€ ์•Š๊ฑฐ๋‚˜, ์„ฑ์ ์ธ ๋ฐœ์–ธ์„ ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ

    • ์ž๊ธฐํšŒ๊ท€ ํ”„๋ ˆ์ž„์›Œํฌ์—” ์ด๋Ÿฐ ์ œ์•ฝ์ด ๋‚ด์žฌ๋˜์–ด ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

    • ์ผ๋ถ€ ์†์„ฑ์€ Fine-Tuning์šฉ ๋ฐ์ดํ„ฐ์…‹(ETHICS, RealToxicityPrompts ๋“ฑ)์œผ๋กœ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์ง€๋งŒ, ์ƒ๋‹น์ˆ˜ ๊ฐ€์น˜๋Š” ์ •์˜ ์ž์ฒด๊ฐ€ ์–ด๋ ค์›Œ ๋ฐ์ดํ„ฐ์…‹ ์ œ์ž‘์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

  • ๊ทธ ํ•ด๋‹ต์œผ๋กœ ๋“ฑ์žฅํ•œ ๊ฒƒ์ด RLHF(Reinforcement Learning from Human Feedback)์ž…๋‹ˆ๋‹ค.

    • ์›๋ž˜๋Š” ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ๋ช…์„ธํ•˜๊ธฐ ์–ด๋ ค์šด RL ๋ฌธ์ œ๋ฅผ ์œ„ํ•œ ๋ฐฉ๋ฒ•์ด์—ˆ์Šต๋‹ˆ๋‹ค 19.


RLHF

  • RLHF์˜ 3๋‹จ๊ณ„
    • (1) ์„ ํ˜ธ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ใ…ฃ ์—ฌ๋Ÿฌ ํ›„๋ณด ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๊ณ , ์‚ฌ๋žŒ์ด ์„ ํ˜ธ๋ฅผ ์ˆœ์œ„๋กœ ๋ผ๋ฒจ๋ง
    • (2) ๋ณด์ƒ ๋ชจ๋ธ ํ•™์Šต ใ…ฃ ์–ด๋–ค ์‘๋‹ต์„ ์‚ฌ๋žŒ์ด ์„ ํ˜ธํ•˜๋Š”์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ํ•™์Šต
    • (3) RL ์ •์ฑ… Fine-Tuning ใ…ฃ ๋ณด์ƒ ๋ชจ๋ธ์˜ ์‹ ํ˜ธ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ PPO(Proximal Policy Optimization) 20 ๊ฐ™์€ ํ‘œ์ค€ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ LLM์„ Fine-tune
  • ์ ์šฉ
    • GPT-2 ใ…ฃ โ€œFine-tuning language models from human preferencesโ€
    • GPT-3 ใ…ฃ โ€œTraining language models to follow instructions with human feedbackโ€
    • GPT-4 ใ…ฃ ๊ณต์‹ whitepaper์—์„œ๋„ RLHF ์‚ฌ์šฉ ๋ช…์‹œ
  • Anthropic๋Š” โ€œhelpful, honest, harmlessโ€๋ผ๋Š” HHH ๊ธฐ์ค€์„ ์„ธ์šฐ๊ณ  imitation learning, binary discrimination, ranked preference modeling ๋“ฑ ๋‹ค์–‘ํ•œ ์ •๋ ฌ ๊ธฐ๋ฒ•์„ ํƒ๊ตฌํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๋‹ค๋งŒ ์ •๋ ฌ์€ ์—ฌ์ „ํžˆ ์—ด๋ฆฐ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.


Bitter Lesson | ๊ทœ๋ชจ๊ฐ€ ์ด๊ธด๋‹ค 21:

  • Richard Sutton์˜ ์œ ๋ช…ํ•œ ๋ธ”๋กœ๊ทธ ๊ธ€ Bitter Lesson์—์„œ AI ์—ญ์‚ฌ๋Š” ์ผ๋ฐ˜์ ยท๊ณ„์‚ฐ ํšจ์œจ์ ยทํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ฐฉ๋ฒ•์ด ๋„๋ฉ”์ธ ์ง€์‹์„ ์ด๊ธด๋‹ค๋Š” ๊ฒƒ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • Chain-of-Thought ์ถ”๋ก ์กฐ์ฐจ 100B ์ด์ƒ ๋ชจ๋ธ์—์„œ๋งŒ ํšจ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค 22.
  • ์ด๋Š” ์ „๋ฌธ๊ฐ€์˜ ๋„๋ฉ”์ธ ์ง€์‹ยท์ˆ˜์ž‘์—… Feature๊ฐ€ ์ˆœ์ˆ˜ ๊ณ„์‚ฐ๊ณผ ํ•™์Šต๋œ ํ‘œํ˜„์— ๋ฐ€๋ฆฐ๋‹ค๋Š” ๊ฒฝํ—˜์  ๊ด€์ฐฐ์ž…๋‹ˆ๋‹ค.

  • ๊ทธ๋ ‡๋‹ค๊ณ  ๋ชจ๋“  ์ „๋ฌธ์„ฑ์ด ๋ฌด์˜๋ฏธํ•œ ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค.
    • AlphaFold 23 ๋Š” Blackbox Deep Learning๊ณผ ์ƒ๋ฌผํ•™ ์‚ฌ์ „ ์ง€์‹(์ง„ํ™”์ ์œผ๋กœ ๊ฐ€๊นŒ์šด ์„œ์—ด, ๋™์กฑ ๋‹จ๋ฐฑ์งˆ์˜ 3D ์ขŒํ‘œ)์„ ๊ฒฐํ•ฉํ•ด ๋‹จ๋ฐฑ์งˆ ๊ตฌ์กฐ ์˜ˆ์ธก์—์„œ ๊ฑฐ์˜ ์‹คํ—˜ ์ˆ˜์ค€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ๊ฐ•๋ ฅํ•œ ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๋„๋ฉ”์ธ ์ „๋ฌธ์„ฑ์˜ ๊ฒฐํ•ฉ์ด ์•„์ง๋„ ํ•ฉ๋‹นํ•œ ์ „๋žต์ž…๋‹ˆ๋‹ค.
  • Hinton์€ 2024 BBC ์ธํ„ฐ๋ทฐ์—์„œ LLM์ด ์‹ค์ œ๋กœ ์ž์—ฐ์–ด๋ฅผ ์ดํ•ดํ•œ๋‹ค๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค.
    • ๊ทธ์˜ ๊ด€์ ์—์„œ LLM์€ ๋‡Œ๊ฐ€ ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐฉ์‹์— ๋Œ€ํ•œ ํ˜„์žฌ ์šฐ๋ฆฌ์˜ ์ตœ์„ ์˜ ์ด๋ก ์ด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.


Summary

  • 40์—ฌ ๋…„์˜ ํ•™์ˆ ์  ๊ณ„๋ณด๋ฅผ ํ•œ ์ค„๋กœ ์š”์•ฝํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
    • 1980s ใ…ฃ ๋ถ„์‚ฐ ํ‘œํ˜„ยท์—ญ์ „ํŒŒ(Rumelhart, Hinton)
    • 2003 ใ…ฃ Bengio โ€” ๋ถ„์‚ฐ ํ‘œํ˜„ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ ํ™•๋ฅ ์  ์–ธ์–ด ๋ชจ๋ธ
    • 2012 ใ…ฃ AlexNet โ€” ๋Œ€๊ทœ๋ชจ ์‹ ๊ฒฝ๋ง ํ•™์Šต์˜ ๊ฐœ๋ง‰
    • 2013 ใ…ฃ Word2Vec โ€” ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์ž„๋ฒ ๋”ฉ ํ•™์Šต
    • 2014 ใ…ฃ Seq2Seq, RNN encoder-decoder
    • 2014 ใ…ฃ Bahdanau โ€” NMT์—์„œ์˜ Attention
    • 2015 ใ…ฃ Luong โ€” Attention์˜ ํ˜•ํƒœ ์ •๋ฆฌ
    • 2017 ใ…ฃ Transformer โ€” โ€œAttention is all you needโ€
    • 2018 ใ…ฃ GPT โ€” ์ƒ์„ฑ์  ์‚ฌ์ „ํ•™์Šต
    • 2019 ใ…ฃ BERT โ€” ์–‘๋ฐฉํ–ฅ Masked Language Model
    • 2017~ ใ…ฃ RLHF โ€” ์ธ๊ฐ„ ์„ ํ˜ธ ๊ธฐ๋ฐ˜ ์ •๋ ฌ
  • ํ˜„์žฌ์˜ LLM
    • OpenAI GPT ๊ณ„์—ด (GPT-1 ~ GPT-4)
    • Google Gemini, PaLM, LaMDA, Gopher, BERT
    • Anthropic Claude (Haiku, Sonnet, Opus)
    • Meta LLaMA
    • Open-weight: DeepSeek-R1 ๋“ฑ
    • ๋ณธ์งˆ์€ ๋ชจ๋‘ ๋Œ€๊ทœ๋ชจ๋กœ ์‚ฌ์ „ํ•™์Šต๋œ Transformer ๊ณ„์—ด, next-word prediction ๊ธฐ๋ฐ˜
  • ํฌ๊ธฐ/๊ทœ๋ชจ ๋ณ€ํ™”
    • GPT-1 ใ…ฃ ~117M ํŒŒ๋ผ๋ฏธํ„ฐ
    • GPT-2 ใ…ฃ ~1.5B
    • GPT-3 ใ…ฃ ~175B
    • Gopher (2021) ใ…ฃ 280B
    • PaLM (2022) ใ…ฃ 540B
    • ํ˜„์žฌ์˜ ๋ชจ๋ธ์€ trillion ๊ทœ๋ชจ์— ๋„๋‹ฌํ•œ ๊ฒƒ์œผ๋กœ ์ถ”์ •๋ฉ๋‹ˆ๋‹ค.


Conclusion

  • LLM์€ ํ•˜๋‚˜์˜ ํ˜์‹ ์ด ์•„๋‹ˆ๋ผ 40์—ฌ ๋…„ ๊ฐ„์˜ ์—ฐ๊ตฌ๋“ค์ด ๋ˆ„์ ๋œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.
    • ๊ฐ ๋‹จ๊ณ„๋Š” ์•ž ๋‹จ๊ณ„์˜ ํ•œ๊ณ„๋ฅผ ํ’€๊ธฐ ์œ„ํ•œ ์ตœ์†Œ ๋ณ€๊ฒฝ์— ๊ฐ€๊นŒ์› ๊ณ , ๊ทœ๋ชจ(scale)์˜ ํž˜์ด ๊ฒฐํ•ฉ๋˜๋ฉฐ ํ˜„์žฌ์˜ ์„ฑ๋Šฅ์„ ๋งŒ๋“ค์–ด๋ƒˆ์Šต๋‹ˆ๋‹ค.
  • LLM์˜ ๋‘ ๊ฐ€์ง€ ์›๋ฆฌ:
    • Attention ใ…ฃ ๋ฌธ๋งฅ ์œˆ๋„์šฐ์˜ ํ•œ๊ณ„๋ฅผ ํ‘ธ๋Š” ๋ฐฉ๋ฒ•์˜ ์ง„ํ™” โ€” ๊ณ ์ • ๋ฒกํ„ฐ(RNN) โ†’ ๊ฐ€์ค‘ํ•ฉ(Bahdanau) โ†’ ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ dot-product(Transformer)
    • Bitter Lesson ใ…ฃ ๋‹จ์ˆœํ•œ ์•„์ด๋””์–ด๋ฅผ ๊ทœ๋ชจ๋กœ ํ•™์Šตํ•  ๋•Œ ์ด๊ธด๋‹ค
  • ํ˜„์žฌ์˜ LLM์„ ๋‹ค์‹œ ๋ฐ”๋ผ๋ณด๋ฉด:
    • ์ž๊ธฐํšŒ๊ท€ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” Bengio 2003 ๊ทธ๋Œ€๋กœ์ž…๋‹ˆ๋‹ค.
    • ๋‹ค๋ฅธ ์ ์€ ์Šค์ผ€์ผ, ๋ฐ์ดํ„ฐ, ์‚ฌ์ „ํ•™์Šต ์ ˆ์ฐจ, ๊ทธ๋ฆฌ๊ณ  ์ •๋ ฌ(RLHF)์ž…๋‹ˆ๋‹ค.
  • ๋‚จ์€ ๊ณผ์ œ๋“ค๋„ ์—ฌ์ „ํžˆ ๋งŽ์Šต๋‹ˆ๋‹ค:
    • ์™œ ์Šค์ผ€์ผ์ด ์ž‘๋™ํ•˜๋Š”๊ฐ€? ๋ช…ํ™•ํ•œ ์ด๋ก ์  ์„ค๋ช…์€ ์—†์Šต๋‹ˆ๋‹ค.
    • ์ •๋ ฌ์€ ์–ด๋–ป๊ฒŒ ์ผ๋ฐ˜ํ™”๋˜๋Š”๊ฐ€? ์—ด๋ฆฐ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.
    • ์–ด๋А ์ง€์ ์—์„œ ๋‹ค์Œ ํŒจ๋Ÿฌ๋‹ค์ž„์ด ํ•„์š”ํ• ๊นŒ? ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.



References

  1. A History of Large Language Models (2025).ย ↩︎

  2. Brown, Peter F., et al. โ€œA statistical approach to machine translation.โ€ Computational Linguistics 16.2 (1990): 79-85.ย ↩︎

  3. Markov, Andrey. โ€œExample of a statistical investigation of the text Eugene Onegin concerning the connection of samples in chains.โ€ (1913).ย ↩︎

  4. Bengio, Yoshua, et al. โ€œA neural probabilistic language model.โ€ Journal of Machine Learning Research 3 (2003): 1137-1155.ย ↩︎ย ↩︎2

  5. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. โ€œImageNet classification with deep convolutional neural networks.โ€ Advances in Neural Information Processing Systems 25 (2012).ย ↩︎

  6. Mikolov, Tomas, et al. โ€œEfficient estimation of word representations in vector space.โ€ arXiv preprint arXiv:1301.3781 (2013).ย ↩︎ย ↩︎2

  7. Mikolov, Tomas, et al. โ€œDistributed representations of words and phrases and their compositionality.โ€ Advances in Neural Information Processing Systems 26 (2013).ย ↩︎

  8. Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. โ€œLinguistic regularities in continuous space word representations.โ€ NAACL-HLT (2013).ย ↩︎ย ↩︎2

  9. Peters, Matthew E., et al. โ€œDeep contextualized word representations.โ€ NAACL-HLT (2018).ย ↩︎

  10. Kalchbrenner, Nal, and Phil Blunsom. โ€œRecurrent continuous translation models.โ€ EMNLP (2013).ย ↩︎

  11. Cho, Kyunghyun, et al. โ€œLearning phrase representations using RNN encoderโ€“decoder for statistical machine translation.โ€ EMNLP (2014).ย ↩︎ย ↩︎2

  12. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. โ€œSequence to sequence learning with neural networks.โ€ NeurIPS (2014).ย ↩︎

  13. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. โ€œNeural machine translation by jointly learning to align and translate.โ€ arXiv preprint arXiv:1409.0473 (2014).ย ↩︎

  14. Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. โ€œEffective approaches to attention-based neural machine translation.โ€ arXiv preprint arXiv:1508.04025 (2015).ย ↩︎ย ↩︎2ย ↩︎3

  15. Cheng, Jianpeng, Li Dong, and Mirella Lapata. โ€œLong short-term memory-networks for machine reading.โ€ EMNLP (2016).ย ↩︎

  16. Vaswani, Ashish, et al. โ€œAttention is all you need.โ€ Advances in Neural Information Processing Systems 30 (2017).ย ↩︎ย ↩︎2

  17. Radford, Alec, et al. โ€œImproving language understanding by generative pre-training.โ€ (2018).ย ↩︎

  18. Devlin, Jacob, et al. โ€œBERT: Pre-training of deep bidirectional transformers for language understanding.โ€ NAACL-HLT (2019).ย ↩︎

  19. Christiano, Paul F., et al. โ€œDeep reinforcement learning from human preferences.โ€ Advances in Neural Information Processing Systems 30 (2017).ย ↩︎

  20. Schulman, John, et al. โ€œProximal policy optimization algorithms.โ€ arXiv preprint arXiv:1707.06347 (2017).ย ↩︎

  21. Sutton, Richard. โ€œThe bitter lesson.โ€ (2019).ย ↩︎

  22. Wei, Jason, et al. โ€œChain-of-thought prompting elicits reasoning in large language models.โ€ Advances in Neural Information Processing Systems 35 (2022).ย ↩︎

  23. Jumper, John, et al. โ€œHighly accurate protein structure prediction with AlphaFold.โ€ Nature 596.7873 (2021): 583-589.ย ↩︎

This post is licensed under CC BY 4.0 by the author.