Sessi 10 — Sentimen Analitik: TF–IDF vs Word2Vec
Tujuan: membangun sistem analisis sentimen end‑to‑end dan membandingkan representasi berbasis sparsititas (TF–IDF) dengan representasi dens (Word2Vec) menggunakan mean‑pooling & TF–IDF weighted pooling.
Learning Outcomes: (1) Memahami perbedaan fitur sparse vs dense untuk teks; (2) Melatih Word2Vec kecil & menurunkan vektor dokumen; (3) Melatih klasifier (LogReg/KNN) dan mengevaluasi; (4) Melakukan analisis kesalahan & rekomendasi perbaikan.
1) Konsep Inti
- TF–IDF: fitur sparse, interpretabel, kuat untuk n‑gram pendek.
- Word2Vec: pembelajaran distribusional (CBOW/Skip‑gram) menghasilkan vektor kata dens yang menangkap kemiripan semantik.
- Pooling dokumen: rata‑rata sederhana (mean) atau rata‑rata berbobot TF–IDF untuk menyusun vektor dokumen dari vektor kata.
2) Praktik Google Colab — Pipeline TF–IDF & Word2Vec
Gunakan label dari Sessi 7/6 atau buat label lemah lalu koreksi cepat. Dataset minimal 30–50 contoh.
A. Setup & Data
!pip -q install pandas numpy scikit-learn gensim matplotlib
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, roc_auc_score, average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
# Load data berlabel
try:
df = pd.read_csv('logreg_dataset_sessi7.csv')
assert {'text','y'}.issubset(df.columns)
except:
# fallback: build from Sessi 3 variants + weak labels
base = pd.read_csv('corpus_sessi3_variants.csv')['v2_stop_stemID'].dropna().astype(str).tolist()
POS = {"bagus","mantap","menyenangkan","cepat","baik","excellent","impressive","friendly","tajam","bersih"}
NEG = {"buruk","lambat","telat","downtime","lemah","weak","late","dented","smelled","failed","delay"}
def weak_label(t):
w=set(t.split()); p=len(w & POS); n=len(w & NEG)
if p>n: return 1
if n>p: return 0
return None
df = pd.DataFrame({'text':base, 'y':[weak_label(t) for t in base]}).dropna()
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['y'].astype(int), test_size=0.25, stratify=df['y'].astype(int), random_state=42)
print('Dataset:', df.shape)
B. Baseline TF–IDF → LogReg
vec = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95, sublinear_tf=True, norm='l2')
Xtr = vec.fit_transform(X_train)
Xte = vec.transform(X_test)
lr = LogisticRegression(max_iter=200, class_weight='balanced')
lr.fit(Xtr, y_train)
print('TF–IDF + LogReg Test Report')
print(classification_report(y_test, lr.predict(Xte), digits=3))
print('AP=', round(average_precision_score(y_test, lr.predict_proba(Xte)[:,1]),3),
'ROC-AUC=', round(roc_auc_score(y_test, lr.predict_proba(Xte)[:,1]),3))
C. Melatih Word2Vec & Menurunkan Vektor Dokumen
from gensim.models import Word2Vec
# Tokenisasi sederhana (data sudah dipraproses pada Sessi 3)
train_tokens = [t.split() for t in X_train]
w2v = Word2Vec(
sentences=train_tokens,
vector_size=200,
window=5,
min_count=1,
sg=1, # skip-gram (umumnya lebih baik di data kecil)
negative=10,
workers=2,
epochs=20,
seed=42
)
# Pooling fungsi
import numpy as np
from collections import Counter
idf = None
# buat idf dari TF–IDF agar bisa tfidf-weighted pooling
from sklearn.feature_extraction.text import TfidfVectorizer
vec_idf = TfidfVectorizer(ngram_range=(1,1), min_df=1)
vec_idf.fit(X_train)
idf_map = dict(zip(vec_idf.get_feature_names_out(), vec_idf.idf_))
def doc_vector_mean(tokens):
vs=[w2v.wv[w] for w in tokens if w in w2v.wv]
return np.mean(vs, axis=0) if len(vs) else np.zeros(w2v.vector_size)
def doc_vector_tfidf(tokens):
ws=[]; ws_weight=[]
for w in tokens:
if w in w2v.wv and w in idf_map:
ws.append(w2v.wv[w])
ws_weight.append(idf_map[w])
if not ws:
return np.zeros(w2v.vector_size)
ws = np.array(ws); ws_weight = np.array(ws_weight)
return (ws * ws_weight[:,None]).sum(axis=0) / (ws_weight.sum()+1e-12)
Xtr_mean = np.vstack([doc_vector_mean(t) for t in train_tokens])
Xte_mean = np.vstack([doc_vector_mean(t.split()) for t in X_test])
Xtr_tfidf= np.vstack([doc_vector_tfidf(t) for t in train_tokens])
Xte_tfidf= np.vstack([doc_vector_tfidf(t.split()) for t in X_test])
print('Shapes:', Xtr_mean.shape, Xtr_tfidf.shape)
D. Klasifikasi pada Vektor Dense (Word2Vec)
# LogReg pada dense vectors
lr_mean = LogisticRegression(max_iter=500, class_weight='balanced').fit(Xtr_mean, y_train)
lr_wtd = LogisticRegression(max_iter=500, class_weight='balanced').fit(Xtr_tfidf, y_train)
from sklearn.metrics import accuracy_score
print('\nWord2Vec mean-pool Test Report')
print(classification_report(y_test, lr_mean.predict(Xte_mean), digits=3))
print('AP=', round(average_precision_score(y_test, lr_mean.predict_proba(Xte_mean)[:,1]),3),
'ROC-AUC=', round(roc_auc_score(y_test, lr_mean.predict_proba(Xte_mean)[:,1]),3))
print('\nWord2Vec tfidf-weighted Test Report')
print(classification_report(y_test, lr_wtd.predict(Xte_tfidf), digits=3))
print('AP=', round(average_precision_score(y_test, lr_wtd.predict_proba(Xte_tfidf)[:,1]),3),
'ROC-AUC=', round(roc_auc_score(y_test, lr_wtd.predict_proba(Xte_tfidf)[:,1]),3))
E. Perbandingan Hasil & Visual Ringkas
import pandas as pd
rows = [
['TF–IDF + LogReg', average_precision_score(y_test, lr.predict_proba(Xte)[:,1]), roc_auc_score(y_test, lr.predict_proba(Xte)[:,1])],
['W2V mean + LogReg', average_precision_score(y_test, lr_mean.predict_proba(Xte_mean)[:,1]), roc_auc_score(y_test, lr_mean.predict_proba(Xte_mean)[:,1])],
['W2V tfidf + LogReg', average_precision_score(y_test, lr_wtd.predict_proba(Xte_tfidf)[:,1]), roc_auc_score(y_test, lr_wtd.predict_proba(Xte_tfidf)[:,1])]
]
res = pd.DataFrame(rows, columns=['Model','AP','ROC-AUC'])
print(res)
plt.figure(figsize=(7,4))
plt.bar(res['Model'], res['AP'])
plt.title('Perbandingan Average Precision (AP)')
plt.xticks(rotation=10); plt.tight_layout(); plt.show()
F. (Opsional) KNN pada Vektor Dense
knn = KNeighborsClassifier(n_neighbors=5, metric='cosine')
knn.fit(Xtr_tfidf, y_train)
print('\nKNN(cosine) pada W2V tfidf-weighted — Test Report')
print(classification_report(y_test, knn.predict(Xte_tfidf), digits=3))
G. Simpan Artefak
import joblib
joblib.dump(vec, 'tfidf_vec_sessi10.joblib')
joblib.dump(lr, 'tfidf_logreg_sessi10.joblib')
joblib.dump(w2v, 'word2vec_sessi10.model')
joblib.dump(lr_wtd, 'w2v_tfidf_logreg_sessi10.joblib')
print('Artefak tersimpan: tfidf_vec_sessi10.joblib, tfidf_logreg_sessi10.joblib, word2vec_sessi10.model, w2v_tfidf_logreg_sessi10.joblib')
3) Studi Kasus & Analisis
| Kasus | Pendekatan | Catatan |
|---|---|---|
| Ulasan e‑commerce | TF–IDF(1,2)+LogReg | Stabil di data kecil, interpretabel (top koefisien) |
| Opini media sosial | W2V tfidf‑weighted + LogReg | Lebih tangguh terhadap sinonimi/variasi ejaan |
| Bahasa campuran/typo | Tambah char n‑gram (Sessi 4) atau fastText | Subword membantu (di luar cakupan sesi ini) |
4) Tugas Mini (Dinilai)
- Latih Word2Vec(200D) pada korpus Anda; bandingkan mean‑pool vs TF–IDF weighted pooling melawan TF–IDF baseline menggunakan **AP** & **ROC‑AUC**.
- Analisis 5 kesalahan terbesar pada model terbaik: tampilkan teks, prediksi, probabilitas, dan dugaan penyebab.
- (Opsional) Coba **KNN(cosine)** pada vektor dense dan bandingkan F1 dengan LogReg.