Sessi 11 — Kemiripan Teks & Klaster Dokumen
Tujuan: membangun matriks kemiripan (cosine/Euclidean) dari TF–IDF & vektor dense (Word2Vec/LSA), menerapkan K‑Means & DBSCAN, dan mengevaluasi kualitas klaster (Silhouette/DB/CH; ARI jika ada label).
Learning Outcomes: (1) Menghitung similarity/ distance antar dokumen; (2) Menerapkan K‑Means & DBSCAN; (3) Menentukan jumlah klaster (elbow/silhouette) & parameter eps‑min_samples; (4) Mengevaluasi (internal & eksternal) dan menganalisis top terms per klaster.
1) Konsep Inti
- Similarity/Distance: cosine (arah vektor) lebih lazim pada TF–IDF; Euclidean cocok pada ruang ter‑normalisasi.
- K‑Means: meminimalkan jarak ke centroid; butuh jumlah klaster k.
- DBSCAN: pendeteksian kepadatan; tidak perlu k, mampu deteksi noise/outlier. Parameter kunci: eps, min_samples.
2) Praktik Google Colab — Similarity & Clustering
Pakai korpus dari Sessi 3/4/10. Sertakan label lemah (opsional) untuk evaluasi eksternal (ARI).
A. Setup & Data
!pip -q install pandas numpy scikit-learn matplotlib seaborn gensim
import numpy as np, pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score, adjusted_rand_score
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
# Muat korpus
try:
df = pd.read_csv('corpus_sessi3_variants.csv')
texts = df['v2_stop_stemID'].fillna('').astype(str).tolist()
except:
texts = pd.read_csv('corpus_sessi2_normalized.csv')['text'].fillna('').astype(str).tolist()
print('Dokumen:', len(texts))
B. TF–IDF & Matriks Kemiripan
vec = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95, sublinear_tf=True, norm='l2')
X = vec.fit_transform(texts)
S = cosine_similarity(X)
print('TF–IDF shape:', X.shape, ' | Similarity matrix:', S.shape)
# Heatmap kecil (10x10 pertama)
plt.figure(figsize=(5,4))
sns.heatmap(S[:10,:10], cmap='viridis')
plt.title('Cosine Similarity (subset)'); plt.tight_layout(); plt.show()
C. Reduksi Dimensi (LSA 100D) — Opsional
svd = TruncatedSVD(n_components=min(100, max(2, X.shape[1]-1)), random_state=42)
Z = svd.fit_transform(X)
Z = Normalizer(copy=False).fit_transform(Z)
print('LSA shape:', Z.shape, 'Explained variance:', svd.explained_variance_ratio_.sum().round(3))
D. K‑Means (di TF–IDF & LSA)
def run_kmeans(feats, k):
km = KMeans(n_clusters=k, n_init=20, random_state=42)
lab = km.fit_predict(feats)
sil = silhouette_score(feats, lab, metric='cosine') if feats.shape[1]>200 else silhouette_score(feats, lab)
db = davies_bouldin_score(feats, lab)
ch = calinski_harabasz_score(feats, lab)
return km, lab, sil, db, ch
for k in [2,3,4,5]:
kmX, labX, silX, dbX, chX = run_kmeans(X, k)
kmZ, labZ, silZ, dbZ, chZ = run_kmeans(Z, k)
print(f'k={k} | TF–IDF: Sil={silX:.3f} DB={dbX:.2f} CH={int(chX)} | LSA: Sil={silZ:.3f} DB={dbZ:.2f} CH={int(chZ)}')
# Visualisasi 2D sederhana (proj. SVD first 2 comps)
plt.figure(figsize=(6,5))
plt.scatter(Z[:,0], Z[:,1], c=labZ, s=18)
plt.title('K-Means pada LSA (k contoh)'); plt.xlabel('SVD1'); plt.ylabel('SVD2'); plt.tight_layout(); plt.show()
E. DBSCAN (di LSA)
# Cari eps dengan melihat distribusi tetangga terdekat (k-distance plot)
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5, metric='euclidean').fit(Z)
dists, _ = neigh.kneighbors(Z)
kdist = np.sort(dists[:,-1])
plt.figure(figsize=(6,3))
plt.plot(kdist); plt.title('k-distance plot (k=5)'); plt.tight_layout(); plt.show()
# Pilih eps dari knee (perkiraan manual), mis. eps≈kdist[int(0.9*len(kdist))]
eps = float(kdist[int(0.9*len(kdist))])
print('Perkiraan eps ≈', round(eps,3))
cl = DBSCAN(eps=eps, min_samples=5, metric='euclidean')
labD = cl.fit_predict(Z)
n_clusters = len(set(labD)) - (1 if -1 in labD else 0)
print('Clusters:', n_clusters, '| noise:', int(np.sum(labD==-1)))
if n_clusters>1:
silD = silhouette_score(Z, labD)
dbD = davies_bouldin_score(Z, labD)
chD = calinski_harabasz_score(Z, labD)
print(f'DBSCAN metrics → Sil={silD:.3f} DB={dbD:.2f} CH={int(chD)}')
F. Top Terms per Klaster (K‑Means)
def top_terms_per_cluster(km, X, feat_names, topn=12):
# centroid pada ruang TF–IDF
order = np.argsort(km.cluster_centers_, axis=1)[:, ::-1]
tops = []
for i in range(km.n_clusters):
terms = [(feat_names[j], round(float(km.cluster_centers_[i,j]),4)) for j in order[i,:topn]]
tops.append((i, terms))
return tops
feat = vec.get_feature_names_out()
kmX, labs, *_ = run_kmeans(X, 3)
for cid, terms in top_terms_per_cluster(kmX, X, feat):
print('\nCluster', cid, 'top terms:')
print(terms)
G. Evaluasi Eksternal (Opsional ARI)
# Jika Anda memiliki label (misal: sentimen/tiket kategori), hitung ARI
try:
y_true = pd.read_csv('labels_true.csv')['y'].astype(int).values
print('ARI(KMeans, k=3)=', round(adjusted_rand_score(y_true, labs),3))
except:
print('Label eksternal tidak tersedia → lewati ARI')
3) Studi Kasus & Analisis
| Kasus | Pendekatan | Catatan |
|---|---|---|
| Pengelompokan tiket helpdesk | TF–IDF(1,2) → LSA 100D → K‑Means | Labelkan centroid → nama klaster (top terms) |
| Kelompok ulasan produk | LSA 100D + DBSCAN | Temukan outlier (ulasan aneh/spam) |
| Eksplorasi kurikulum/catatan kelas | Cosine TF–IDF + K‑Means | Pemetaan tema dan bahan ajar serupa |
4) Tugas Mini (Dinilai)
- Tentukan k terbaik untuk K‑Means via silhouette (uji k∈{2..6}) pada TF–IDF & LSA; bandingkan metrik.
- Jalankan DBSCAN — pilih eps dari k‑distance plot; laporkan jumlah klaster & proporsi noise.
- Tampilkan top‑terms per klaster dan beri nama klaster; analisis 3 dokumen per klaster.