Sessi 9 — SVM untuk Klasifikasi Teks
Tujuan: memahami margin maksimum & hinge loss, menerapkan LinearSVC dan SVC (kernel RBF) yang sesuai untuk teks, menala hiperparameter C (& gamma untuk RBF), dan membandingkannya dengan baseline LogReg.
Learning Outcomes: (1) Menjelaskan konsep margin & support vectors; (2) Menerapkan LinearSVC untuk TF–IDF; (3) Menguji SVC‑RBF pada proyeksi LSA; (4) Evaluasi & interpretasi fitur penting (koefisien linier) dan kurva PR/ROC terkalibrasi.
1) Konsep Inti
- Margin maksimum: SVM mencari hiperbidang dengan margin terbesar memisahkan kelas.
- Hinge loss: \(\max(0, 1 - y\,w^Tx)\). Regularisasi dikendalikan oleh \(C\): kecil → margin lebar (lebih regularized), besar → fokus pada error training.
- Linear vs Kernel: Teks (TF–IDF) umumnya linier separable → LinearSVC sangat efektif & cepat. Kernel RBF berguna bila memakai fitur densitas rendah‑dimensi (mis. LSA 100D).
2) Praktik Google Colab — LinearSVC & SVC‑RBF
Gunakan dataset berlabel dari sesi sebelumnya. Kita bandingkan LinearSVC (fitur TF–IDF) vs SVC‑RBF (fitur LSA 100D).
A. Setup & Data
!pip -q install pandas numpy scikit-learn matplotlib
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score, average_precision_score
from sklearn.calibration import CalibratedClassifierCV
import matplotlib.pyplot as plt
# Muat data berlabel
df = None
candidates = ['knn_dataset_sessi6.csv', 'logreg_dataset_sessi7.csv']
for p in candidates:
try:
df = pd.read_csv(p)
if {'text','y'}.issubset(df.columns):
break
except: pass
if df is None:
# fallback dari weak labels (seperti sesi 6–7)
base = pd.read_csv('corpus_sessi3_variants.csv')['v2_stop_stemID'].dropna().astype(str).tolist()
POS = {"bagus","mantap","menyenangkan","cepat","baik","excellent","impressive","friendly","tajam","bersih"}
NEG = {"buruk","lambat","telat","downtime","lemah","weak","late","dented","smelled","failed","delay"}
def weak_label(t):
w = set(t.split()); p=len(w & POS); n=len(w & NEG)
if p>n: return 1
if n>p: return 0
return None
df = pd.DataFrame({'text':base, 'y':[weak_label(t) for t in base]}).dropna()
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['y'].astype(int), test_size=0.25, stratify=df['y'].astype(int), random_state=42)
print('Dataset:', df.shape)
B. LinearSVC pada TF–IDF
from sklearn.svm import LinearSVC
pipe_lin = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95, sublinear_tf=True, norm='l2')),
('clf', LinearSVC(dual=True)) # dual=True cocok untuk fitur >> sampel
])
param_grid = {
'clf__C': [0.25, 0.5, 1.0, 2.0],
'clf__loss': ['hinge','squared_hinge']
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
search_lin = GridSearchCV(pipe_lin, param_grid, cv=cv, scoring='f1', n_jobs=-1)
search_lin.fit(X_train, y_train)
print('LinearSVC best:', search_lin.best_params_, 'CV F1=', round(search_lin.best_score_,3))
# Kalibrasi probabilitas untuk PR/ROC
cal_lin = CalibratedClassifierCV(search_lin.best_estimator_, method='sigmoid', cv=5)
cal_lin.fit(X_train, y_train)
y_lin = cal_lin.predict(X_test)
proba_lin = cal_lin.predict_proba(X_test)[:,1]
print('LinearSVC Test Report')
print(classification_report(y_test, y_lin, digits=3))
print('ROC-AUC=', round(roc_auc_score(y_test, proba_lin),3), ' AP=', round(average_precision_score(y_test, proba_lin),3))
C. SVC (RBF) pada Fitur LSA 100D
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
pipe_rbf = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95, sublinear_tf=True, norm='l2')),
('svd', TruncatedSVD(n_components=100, random_state=42)),
('norm', Normalizer(copy=False)),
('clf', SVC(kernel='rbf', probability=True))
])
param_grid_rbf = {
'clf__C': [1.0, 2.0, 4.0],
'clf__gamma': ['scale', 0.1, 0.01]
}
search_rbf = GridSearchCV(pipe_rbf, param_grid_rbf, cv=cv, scoring='f1', n_jobs=-1)
search_rbf.fit(X_train, y_train)
print('RBF best:', search_rbf.best_params_, 'CV F1=', round(search_rbf.best_score_,3))
y_rbf = search_rbf.best_estimator_.predict(X_test)
proba_rbf = search_rbf.best_estimator_.predict_proba(X_test)[:,1]
print('RBF Test Report')
print(classification_report(y_test, y_rbf, digits=3))
print('ROC-AUC=', round(roc_auc_score(y_test, proba_rbf),3), ' AP=', round(average_precision_score(y_test, proba_rbf),3))
D. Interpretasi Fitur (LinearSVC)
# Tampilkan 20 fitur positif & negatif teratas dari model linier terkualibrasi
vec = cal_lin.base_estimator.named_steps['tfidf']
clf = cal_lin.base_estimator.named_steps['clf']
feat = vec.get_feature_names_out()
coef = clf.coef_.ravel()
ix_pos = coef.argsort()[::-1][:20]
ix_neg = coef.argsort()[:20]
print('\nTop fitur (+):')
print([ (feat[i], round(float(coef[i]),3)) for i in ix_pos ])
print('\nTop fitur (-):')
print([ (feat[i], round(float(coef[i]),3)) for i in ix_neg ])
E. Perbandingan & Visual Ringkas
import pandas as pd
rows = [
['LinearSVC (calibrated)', roc_auc_score(y_test, proba_lin), average_precision_score(y_test, proba_lin)],
['SVC-RBF (LSA 100D)', roc_auc_score(y_test, proba_rbf), average_precision_score(y_test, proba_rbf)]
]
res = pd.DataFrame(rows, columns=['Model','ROC-AUC','AP'])
print(res)
plt.figure(figsize=(6,4))
plt.bar(res['Model'], res['AP'])
plt.title('Perbandingan AP (Average Precision)')
plt.xticks(rotation=10)
plt.tight_layout(); plt.show()
F. Simpan Artefak
import joblib
joblib.dump(cal_lin, 'svm_linear_calibrated_sessi9.joblib')
joblib.dump(search_rbf.best_estimator_, 'svm_rbf_lsa100_sessi9.joblib')
print('Tersimpan: svm_linear_calibrated_sessi9.joblib, svm_rbf_lsa100_sessi9.joblib')
3) Studi Kasus & Analisis
| Kasus | Tujuan | Pendekatan | Catatan |
|---|---|---|---|
| Sentimen ulasan | Baseline kuat & cepat | TF–IDF (1,2) + LinearSVC(C tuned) | Kalibrasi untuk PR/ROC |
| Klasifikasi topik | Pemisahan linier skala besar | LinearSVC (hinge) | Dual=True jika fitur ≫ sampel |
| Data non-linier (pasca-reduksi) | Tangkap batas melengkung | LSA 100D + SVC‑RBF | Setel gamma & C via CV |
4) Tugas Mini (Dinilai)
- LinearSVC: Grid C∈{0.25,0.5,1,2} & loss∈{hinge, squared_hinge}; laporkan F1(CV=5) & test report.
- SVC‑RBF: Pakai LSA 100D; grid C∈{1,2,4}, gamma∈{scale,0.1,0.01}; bandingkan AP & ROC‑AUC vs LinearSVC.
- Tampilkan 20 fitur pro‑positif & pro‑negatif dari LinearSVC dan analisis 5 kesalahan prediksi terbesar.