Step 13 Attention: FastText 파인튜닝

중요 단어 집중 + 부분단어 학습 (OOV 해결)

1. 모든 단어가 똑같이 중요할까?

LSTM은 문장 전체를 처리하지만, 모든 단어를 똑같이 취급합니다. 하지만 실제로는 일부 단어만 결정적으로 중요하죠!

❌ LSTM의 한계

not bad but amazing" The movie not amazing LSTM: 모든 단어 동등하게 처리 😢

"not", "amazing" 같은 핵심 단어에 집중해야 정확한 분류가 가능합니다!

2. Attention: 중요한 단어에 집중하자

Attention Mechanism은 각 단어에 가중치를 부여합니다. 중요한 단어는 크게, 덜 중요한 단어는 작게!

🎯 Attention의 핵심 아이디어

Attention 수식:

1. Score 계산: 각 hidden state의 중요도
$$ e_i = W \cdot h_i + b $$
2. Softmax 정규화: 합이 1이 되도록
$$ \alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)} $$
3. 가중 평균: Context vector 생성
$$ c = \sum_i \alpha_i h_i $$

3. FastText: 부분단어 학습

FastText는 Word2Vec/GloVe와 다르게 부분단어(subword)를 학습합니다. 이게 왜 좋을까요?

🔤 FastText의 핵심 아이디어

# FastText 사전학습 모델 불러오기
from gensim.models import FastText

# 사전학습 모델 로드
# 다운로드: https://fasttext.cc/docs/en/crawl-vectors.html
ft = FastText.load_facebook_vectors('cc.en.300.bin')

# 등록된 단어
print(ft['running'].shape)  # (300,)

# 미등록 단어도 OK!
print(ft['runninggggg'].shape)  # (300,) ← 부분단어로 추정!

# 유사 단어
print(ft.wv.most_similar('running'))
# [('runner', 0.72), ('run', 0.68), ...]

4. Attention + FastText (파인튜닝)

이제 Attention과 FastText를 결합하고, 임베딩을 파인튜닝해봅시다!

🏗️ 모델 아키텍처

import torch
import torch.nn as nn
import torch.nn.functional as F

class AttentionClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes, pretrained_embeddings=None):
        super().__init__()
        
        # FastText 임베딩 (fine-tune!)
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        if pretrained_embeddings is not None:
            self.embedding.weight.data.copy_(pretrained_embeddings)
            self.embedding.weight.requires_grad = True  # 🔓 Fine-tune!
        
        # Bi-directional LSTM
        self.lstm = nn.LSTM(
            embedding_dim, 
            hidden_dim, 
            bidirectional=True,  # 양방향!
            batch_first=True
        )
        
        # Attention
        self.attention = nn.Linear(hidden_dim * 2, 1)  # 양방향이라 *2
        
        # 분류기
        self.fc = nn.Linear(hidden_dim * 2, num_classes)
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        # x: [batch, seq_len]
        embedded = self.embedding(x)  # [batch, seq_len, embed_dim]
        
        # LSTM
        lstm_out, _ = self.lstm(embedded)  
        # lstm_out: [batch, seq_len, hidden_dim*2]
        
        # Attention 가중치 계산
        attn_scores = self.attention(lstm_out).squeeze(2)  
        # [batch, seq_len]
        attn_weights = F.softmax(attn_scores, dim=1).unsqueeze(2)  
        # [batch, seq_len, 1]
        
        # 가중 평균 (Context vector)
        context = torch.sum(lstm_out * attn_weights, dim=1)  
        # [batch, hidden_dim*2]
        
        return self.fc(self.dropout(context)), attn_weights.squeeze(2)

# 모델 생성
model = AttentionClassifier(
    vocab_size=20000,
    embedding_dim=300,
    hidden_dim=128,
    num_classes=4,
    pretrained_embeddings=fasttext_weights
)

✅ Attention + FastText의 장점

🎯 중요 단어 집중:
"not", "amazing" 같은 핵심 단어에 높은 가중치
🔤 OOV 해결:
미등록 단어도 부분단어로 표현 가능
🔓 파인튜닝:
임베딩을 AG News 데이터에 맞게 최적화
🔄 양방향 LSTM:
앞뒤 문맥을 모두 고려 → 더 정확한 이해

🤔 다음 단계는?

Attention으로 많이 개선했습니다! 하지만 RNN/LSTM의 근본적 한계가 있어요:

남은 문제:
• 순차 처리 → 병렬화 불가능 → 느림
• 여전히 장거리 의존성에서 약함
• Attention이 1개만 → 다양한 관점 부족

다음 Step 14에서 Transformer (학습가능 임베딩)로 Self-Attention을 배워봅시다! 🚀