Sessi 13 — Generative NLP: Dari n‑gram ke RNN‑LSTM, dan Sekilas LLM

Tujuan: memahami model bahasa (Language Model/LM) klasik (n‑gram) dan neural (RNN‑LSTM) untuk menghasilkan teks, mengukur kualitas dengan perplexity, serta menempatkan LLM modern dalam konteks pipeline NLP.

Learning Outcomes: (1) Membangun LM trigram dengan smoothing; (2) Melatih generator karakter/kata berbasis LSTM; (3) Mengukur perplexity & negative log‑likelihood; (4) Menjelaskan konsep inti LLM (transformer, self‑attention, pretraining, fine‑tuning) secara terstruktur.

1) Konsep Inti

Language Model: memodelkan \(P(w_1,\dots,w_T)\) atau \(P(w_t|w_{
n‑gram LM: aproksimasi \(P(w_t|w_{t-n+1}^{t-1})\). Perlu smoothing untuk kata/langkah jarang.
RNN‑LSTM: menangkap dependensi panjang melalui gates; cocok untuk karakter/word‑level.
Perplexity (PP): \(\mathrm{PP}=\exp\big(-\tfrac{1}{N}\sum\log p(w_i)\big)\) — lebih kecil lebih baik.
LLM (sekilas): arsitektur transformer, self‑attention, pretraining (causal/denoising), fine‑tuning instruksi, dan RLHF/RLAIF.

2) Praktik Google Colab — n‑gram LM & LSTM Generator

Proyek ini memakai korpus praproses Sessi 3/2. Kita buat LM trigram dengan Add‑k smoothing & backoff sederhana, lalu char‑level LSTM (ringan) untuk generasi teks.

A. Setup & Data

!pip -q install pandas numpy scikit-learn tensorflow==2.*

import numpy as np, pandas as pd, re, math, random

# 1) Muat korpus
try:
    df = pd.read_csv('corpus_sessi3_variants.csv')
    texts = df['v2_stop_stemID'].dropna().astype(str).tolist()
except:
    texts = pd.read_csv('corpus_sessi2_normalized.csv')['text'].dropna().astype(str).tolist()

# 2) Siapkan kalimat (token level) untuk n-gram & karakter untuk LSTM
sents = [t for t in texts if len(t.split())>3]
print('Dokumen untuk LM:', len(sents))

# Utility tokenisasi sederhana
BOS = ''
EOS = ''
UNK = ''

# Bangun vocab dari frekuensi minimal
from collections import Counter
cnt = Counter(w for t in sents for w in t.split())
vocab = {w for w,c in cnt.items() if c>=2}

# Ganti token langka dengan UNK
def to_sent_tokens(t):
    toks = [w if w in vocab else UNK for w in t.split()]
    return [BOS,BOS] + toks + [EOS]

corpus_tok = [to_sent_tokens(t) for t in sents]

B. Trigram LM + Add‑k Smoothing & Backoff

from collections import defaultdict
k = 0.5  # smoothing kecil

uni = Counter()
bi  = Counter()
tri = Counter()

for toks in corpus_tok:
    for i in range(2, len(toks)):
        uni[toks[i-1]] += 1
        bi[(toks[i-2], toks[i-1])] += 1
        tri[(toks[i-2], toks[i-1], toks[i])] += 1

V = len(vocab) + 3  # tambah BOS/EOS/UNK

def prob_trigram(w3, w1, w2):
    # P(w3|w1,w2) smoothed + backoff ke bigram/unigram bila konteks langka
    numer = tri[(w1,w2,w3)] + k
    denom = bi[(w1,w2)] + k*V
    if denom>0:
        return numer/denom
    # backoff → bigram
    numer = bi[(w2,w3)] + k
    denom = uni[w2] + k*V
    if denom>0:
        return numer/denom
    # backoff → unigram
    return (uni[w3] + k) / (sum(uni.values()) + k*V)

# Hitung perplexity pada holdout sederhana
random.seed(42)
random.shuffle(corpus_tok)
split = int(0.8*len(corpus_tok))
train_tok = corpus_tok[:split]
valid_tok = corpus_tok[split:]

# (re-hitung count pada train saja untuk evaluasi fair)
uni=Counter(); bi=Counter(); tri=Counter()
for toks in train_tok:
    for i in range(2, len(toks)):
        uni[toks[i-1]] += 1
        bi[(toks[i-2], toks[i-1])] += 1
        tri[(toks[i-2], toks[i-1], toks[i])] += 1

V = len(vocab) + 3

def perplexity(sent_tokens):
    ll=0; N=0
    for i in range(2,len(sent_tokens)):
        p = prob_trigram(sent_tokens[i], sent_tokens[i-2], sent_tokens[i-1])
        ll += math.log(p+1e-12)
        N += 1
    return math.exp(-ll/max(1,N))

pp = np.mean([perplexity(t) for t in valid_tok])
print('Valid perplexity (trigram+backoff):', round(pp,2))

# Sampling kalimat
def sample_sentence(max_len=30, temp=1.0):
    w1, w2 = BOS, BOS
    out=[]
    for _ in range(max_len):
        # skor kandidat dari vocab kecil (ambil 200 kata paling umum + EOS/UNK)
        cand = list(vocab)[:200] + [EOS, UNK]
        ps=[]
        for w3 in cand:
            ps.append(prob_trigram(w3,w1,w2))
        ps = np.array(ps)
        ps = ps ** (1.0/temp)
        ps = ps/ps.sum()
        w3 = np.random.choice(cand, p=ps)
        if w3==EOS: break
        out.append(w3)
        w1,w2 = w2,w3
    return ' '.join(out)

for _ in range(3):
    print('Trigram sample:', sample_sentence(temp=0.9))

C. Char‑level LSTM Generator (ringan)

import tensorflow as tf
from tensorflow.keras import layers

# Dataset karakter (gabungkan semua teks menjadi satu)
raw_text = '\n'.join(sents)
chars = sorted(list(set(raw_text)))
stoi = {c:i for i,c in enumerate(chars)}
itos = {i:c for c,i in stoi.items()}

seq_len = 120
step = 3
sequences = []
next_chars = []
for i in range(0, len(raw_text)-seq_len, step):
    sequences.append(raw_text[i:i+seq_len])
    next_chars.append(raw_text[i+seq_len])

X = np.zeros((len(sequences), seq_len, len(chars)), dtype=np.float32)
y = np.zeros((len(sequences), len(chars)), dtype=np.float32)
for i,seq in enumerate(sequences):
    for t,ch in enumerate(seq):
        X[i,t,stoi.get(ch,0)] = 1.0
    y[i, stoi.get(next_chars[i],0)] = 1.0

model = tf.keras.Sequential([
    layers.Input((seq_len, len(chars))),
    layers.LSTM(256),
    layers.Dense(len(chars), activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()

history = model.fit(X, y, epochs=5, batch_size=128, verbose=1)

# Sampling fungsi
def sample_lstm(seed, temp=1.0, length=280):
    out = seed
    seq = seed[-seq_len:]
    for _ in range(length):
        x = np.zeros((1, seq_len, len(chars)), dtype=np.float32)
        for t,ch in enumerate(seq):
            x[0,t, stoi.get(ch,0)] = 1.0
        p = model.predict(x, verbose=0)[0]
        p = np.asarray(p) ** (1.0/temp)
        p = p / np.sum(p)
        idx = np.random.choice(len(chars), p=p)
        ch = itos[idx]
        out += ch
        seq = (seq + ch)[-seq_len:]
    return out

print('\nLSTM sample (temp=0.8):\n', sample_lstm('pengalaman belanja ', temp=0.8)[:400])

D. Perbandingan Singkat & Catatan LLM

# Perbandingan kualitatif: trigram vs LSTM
print('\nRingkas: trigram cenderung lokal dan frasa umum; LSTM menangkap pola lebih panjang.')

# Catatan LLM (teori singkat):
notes = {
  'Transformer': 'Self-attention menghitung konteks global secara paralel.',
  'Pretraining': 'Causal LM (next-token) atau denoising; butuh data besar.',
  'Fine-tuning': 'Instruksi, task-driven; RLHF/RLAIF untuk penyelarasan.',
  'Inference': 'Decoding: greedy, beam, top-k, nucleus; kontrol via temperature.'
}
print(notes)

E. Simpan Artefak

import joblib
joblib.dump({'uni':uni,'bi':bi,'tri':tri,'vocab':list(vocab)}, 'trigram_lm_sessi13.joblib')
model.save('char_lstm_sessi13.h5')
print('Tersimpan: trigram_lm_sessi13.joblib, char_lstm_sessi13.h5')

3) Studi Kasus & Analisis

Kasus	Pendekatan	Catatan
Auto‑complete internal	Trigram LM + smoothing	Ringan & bisa dilatih cepat untuk domain sempit
Generator deskripsi produk	Char‑LSTM/word‑LSTM	Butuh kurasi & filter; cocok untuk eksperimen
LLM production	Transformer (sekilas)	Butuh infrastruktur & data lebih besar; di luar cakupan praktikum

4) Tugas Mini (Dinilai)

Bangun trigram LM (Add‑k, k∈{0.1,0.5,1.0}); laporkan perplexity validasi dan contoh 3 kalimat hasil sampling (temp∈{0.7,1.0}).
Latih char‑LSTM (256 unit, 5 epoch) dan tampilkan 2 sampel teks dengan temperatur berbeda; jelaskan perbedaannya.
Tulis ringkasan 1 halaman: perbedaan n‑gram vs LSTM vs LLM (transformer) dari sisi kapasitas, kebutuhan data, dan kontrol keluaran.

⟵ Sessi 12 Sessi 14 ⟶

5) Etika & Keamanan (Singkat)

Konten generatif harus diawasi: hindari bias, kebocoran data, dan informasi menyesatkan.
Gunakan filter & post‑processing (bad‑words list, regex domain) pada keluaran.
Jelaskan asal data pelatihan bila dipublikasikan; hormati hak cipta.