Sessi 15 — Final Project: Panduan Lengkap
Di sesi ini, Anda akan menyiapkan proyek akhir NLP berbasis materi Sessi 1–14. Fokus pada integrasi konsep (praproses → representasi → model → evaluasi → etika) dan reproducibility.
Learning Outcomes: (1) Merancang eksperimen terstruktur; (2) Membangun baseline kuat & pembanding; (3) Melakukan evaluasi menyeluruh (metrik, ablation, fairness); (4) Menyusun laporan ilmiah ringkas & demo.
1) Topik & Scope yang Direkomendasikan
- Analisis sentimen multi-domain (e‑commerce, layanan publik, pendidikan).
- Klasifikasi topik tiket helpdesk (multi‑kelas, imbalance).
- Semantic search untuk FAQ/catatan kuliah (TF–IDF vs fastText/Doc2Vec/SIF).
- Deteksi spam/abuse sederhana (moderasi komentar).
- Clustering dokumen + labeling centroid (K‑Means/DBSCAN, LSA/W2V).
Catatan: Hindari data sensitif. Bila memakai data publik, sertakan sumber & lisensi.
3) Struktur Repository / Notebook
project/
├─ data/
│ ├─ raw.csv # data mentah (jika publik, sertakan URL/DOI)
│ ├─ processed.csv # setelah praproses
│ └─ meta_labels.csv # atribut sensitif/kelompok (opsional)
├─ notebooks/
│ ├─ 01_prep.ipynb # praproses (Sessi 2–3)
│ ├─ 02_baselines.ipynb# TF–IDF + (KNN/LogReg/SVM)
│ ├─ 03_dense.ipynb # W2V/fastText/Doc2Vec & pooling
│ ├─ 04_eval.ipynb # evaluasi (PR/ROC/confusion)
│ ├─ 05_fairness.ipynb # fairness & SHAP (Sessi 14)
│ └─ 06_report.ipynb # grafik akhir & tabel
├─ models/ # artefak .joblib/.model
└─ README.md # instruksi run & ringkasan
4) Template Notebook Colab — Orkestrasi Eksperimen
Gunakan template ini untuk melacak eksperimen tanpa layanan eksternal (log ke CSV).
!pip -q install pandas numpy scikit-learn gensim matplotlib shap
import os, time, json, numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, average_precision_score, roc_auc_score, f1_score
os.makedirs('logs', exist_ok=True)
LOG = 'logs/experiments.csv'
if not os.path.exists(LOG):
pd.DataFrame(columns=['ts','note','vec','model','params','cv_metric','test_metric','seed']).to_csv(LOG, index=False)
# === Load & Preprocess ===
df = pd.read_csv('data/processed.csv') # kolom: text,y
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['y'].astype(int), test_size=0.25, stratify=df['y'], random_state=42)
vec = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95, sublinear_tf=True, norm='l2')
Xtr = vec.fit_transform(X_train)
Xte = vec.transform(X_test)
# === Baselines: LogReg & LinearSVC ===
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# 1) LogReg
grid_lr = GridSearchCV(LogisticRegression(max_iter=300, class_weight='balanced', solver='liblinear'),
{'penalty':['l1','l2'], 'C':[0.5,1,2,4]}, cv=cv, scoring='f1', n_jobs=-1)
grid_lr.fit(Xtr, y_train)
proba = grid_lr.best_estimator_.predict_proba(Xte)[:,1]
res_lr = {
'ts': int(time.time()),
'note':'tfidf-logreg',
'vec':'tfidf(1,2)',
'model':'logreg',
'params': json.dumps(grid_lr.best_params_),
'cv_metric': round(grid_lr.best_score_,3),
'test_metric': round(f1_score(y_test, (proba>=0.5).astype(int)),3),
'seed': 42
}
print('LR best:', res_lr)
# 2) LinearSVC
grid_svm = GridSearchCV(LinearSVC(dual=True), {'C':[0.5,1,2,4], 'loss':['hinge','squared_hinge']}, cv=cv, scoring='f1', n_jobs=-1)
grid_svm.fit(Xtr, y_train)
yhat = grid_svm.best_estimator_.fit(Xtr, y_train).predict(Xte)
res_svm = {
'ts': int(time.time()),
'note':'tfidf-linearsvc',
'vec':'tfidf(1,2)',
'model':'linearSVC',
'params': json.dumps(grid_svm.best_params_),
'cv_metric': round(grid_svm.best_score_,3),
'test_metric': round(f1_score(y_test, yhat),3),
'seed': 42
}
print('SVM best:', res_svm)
pd.concat([pd.DataFrame([res_lr]), pd.DataFrame([res_svm])]).to_csv(LOG, mode='a', index=False, header=False)
print('Logs appended →', LOG)
5) Evaluasi Wajib & Analisis
- Metrik utama: F1 (macro), AP (Average Precision), ROC‑AUC (opsional), Confusion Matrix.
- Ablation study: bandingkan 2–3 representasi (TF–IDF vs W2V/fastText/Doc2Vec) & 2–3 model (KNN/LogReg/SVM).
- Fairness & Explainability: lapor per‑kelompok (Sessi 14), top‑fitur/SHAP, serta rekomendasi mitigasi.
- Reproducibility: catat random_state, versi paket, dan simpan artefak model + vectorizer.
6) Rubrik Penilaian Final Project
| Komponen | Bobot | Indikator |
|---|---|---|
| Masalah & Dataset | 15% | Definisi masalah jelas; data card lengkap; etika & lisensi jelas. |
| Metodologi | 25% | Pipeline rapi; baseline memadai; eksperimen terencana (CV, grid kecil). |
| Hasil & Evaluasi | 25% | Metrik komprehensif; ablation; analisis kesalahan; visualisasi efektif. |
| Fairness & Explainability | 15% | Metrik per‑kelompok; analisis SHAP/koefisien; rekomendasi mitigasi. |
| Repro & Kualitas Kode | 10% | Struktur repo; logging eksperimen; artefak; instruksi run lengkap. |
| Presentasi/Demo | 10% | Storytelling singkat; demo berjalan; tanya‑jawab. |
Bonus (maks 5%) jika menyertakan semantic search kecil/mini‑app atau menerapkan SIF/kalibrasi probabilitas.
7) Deliverables & Format Pengumpulan
- Repo/ZIP berisi: data (atau skrip unduh), notebook, model, dan
README.md. - Laporan PDF ≤ 8 halaman: latar belakang, metode, hasil (tabel/grafik), analisis, keterbatasan, etika.
- Slide presentasi 6–10 halaman + demo 3–5 menit.
8) Checklist Pra‑UAS (Self‑Review)
- ✔ Data card & lisensi tercantum.
- ✔ Pipeline end‑to‑end dapat dijalankan ulang tanpa error.
- ✔ Tabel perbandingan model/representasi + ablation.
- ✔ Laporan fairness per‑kelompok & explainability.
- ✔ Artefak model & vectorizer disimpan (
.joblib/.model). - ✔ README berisi cara menjalankan & versi paket.