Sessi 11 — Kemiripan Teks & Klaster Dokumen

Tujuan: membangun matriks kemiripan (cosine/Euclidean) dari TF–IDF & vektor dense (Word2Vec/LSA), menerapkan K‑Means & DBSCAN, dan mengevaluasi kualitas klaster (Silhouette/DB/CH; ARI jika ada label).

Learning Outcomes: (1) Menghitung similarity/ distance antar dokumen; (2) Menerapkan K‑Means & DBSCAN; (3) Menentukan jumlah klaster (elbow/silhouette) & parameter eps‑min_samples; (4) Mengevaluasi (internal & eksternal) dan menganalisis top terms per klaster.

1) Konsep Inti

  • Similarity/Distance: cosine (arah vektor) lebih lazim pada TF–IDF; Euclidean cocok pada ruang ter‑normalisasi.
  • K‑Means: meminimalkan jarak ke centroid; butuh jumlah klaster k.
  • DBSCAN: pendeteksian kepadatan; tidak perlu k, mampu deteksi noise/outlier. Parameter kunci: eps, min_samples.

2) Praktik Google Colab — Similarity & Clustering

Pakai korpus dari Sessi 3/4/10. Sertakan label lemah (opsional) untuk evaluasi eksternal (ARI).

A. Setup & Data

!pip -q install pandas numpy scikit-learn matplotlib seaborn gensim

import numpy as np, pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score, adjusted_rand_score
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

# Muat korpus
try:
    df = pd.read_csv('corpus_sessi3_variants.csv')
    texts = df['v2_stop_stemID'].fillna('').astype(str).tolist()
except:
    texts = pd.read_csv('corpus_sessi2_normalized.csv')['text'].fillna('').astype(str).tolist()

print('Dokumen:', len(texts))

B. TF–IDF & Matriks Kemiripan

vec = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95, sublinear_tf=True, norm='l2')
X = vec.fit_transform(texts)

S = cosine_similarity(X)
print('TF–IDF shape:', X.shape, ' | Similarity matrix:', S.shape)

# Heatmap kecil (10x10 pertama)
plt.figure(figsize=(5,4))
sns.heatmap(S[:10,:10], cmap='viridis')
plt.title('Cosine Similarity (subset)'); plt.tight_layout(); plt.show()

C. Reduksi Dimensi (LSA 100D) — Opsional

svd = TruncatedSVD(n_components=min(100, max(2, X.shape[1]-1)), random_state=42)
Z = svd.fit_transform(X)
Z = Normalizer(copy=False).fit_transform(Z)
print('LSA shape:', Z.shape, 'Explained variance:', svd.explained_variance_ratio_.sum().round(3))

D. K‑Means (di TF–IDF & LSA)

def run_kmeans(feats, k):
    km = KMeans(n_clusters=k, n_init=20, random_state=42)
    lab = km.fit_predict(feats)
    sil = silhouette_score(feats, lab, metric='cosine') if feats.shape[1]>200 else silhouette_score(feats, lab)
    db  = davies_bouldin_score(feats, lab)
    ch  = calinski_harabasz_score(feats, lab)
    return km, lab, sil, db, ch

for k in [2,3,4,5]:
    kmX, labX, silX, dbX, chX = run_kmeans(X, k)
    kmZ, labZ, silZ, dbZ, chZ = run_kmeans(Z, k)
    print(f'k={k} | TF–IDF: Sil={silX:.3f} DB={dbX:.2f} CH={int(chX)} | LSA: Sil={silZ:.3f} DB={dbZ:.2f} CH={int(chZ)}')

# Visualisasi 2D sederhana (proj. SVD first 2 comps)
plt.figure(figsize=(6,5))
plt.scatter(Z[:,0], Z[:,1], c=labZ, s=18)
plt.title('K-Means pada LSA (k contoh)'); plt.xlabel('SVD1'); plt.ylabel('SVD2'); plt.tight_layout(); plt.show()

E. DBSCAN (di LSA)

# Cari eps dengan melihat distribusi tetangga terdekat (k-distance plot)
from sklearn.neighbors import NearestNeighbors

neigh = NearestNeighbors(n_neighbors=5, metric='euclidean').fit(Z)
dists, _ = neigh.kneighbors(Z)
kdist = np.sort(dists[:,-1])
plt.figure(figsize=(6,3))
plt.plot(kdist); plt.title('k-distance plot (k=5)'); plt.tight_layout(); plt.show()

# Pilih eps dari knee (perkiraan manual), mis. eps≈kdist[int(0.9*len(kdist))]
eps = float(kdist[int(0.9*len(kdist))])
print('Perkiraan eps ≈', round(eps,3))

cl = DBSCAN(eps=eps, min_samples=5, metric='euclidean')
labD = cl.fit_predict(Z)

n_clusters = len(set(labD)) - (1 if -1 in labD else 0)
print('Clusters:', n_clusters, '| noise:', int(np.sum(labD==-1)))

if n_clusters>1:
    silD = silhouette_score(Z, labD)
    dbD  = davies_bouldin_score(Z, labD)
    chD  = calinski_harabasz_score(Z, labD)
    print(f'DBSCAN metrics → Sil={silD:.3f} DB={dbD:.2f} CH={int(chD)}')

F. Top Terms per Klaster (K‑Means)

def top_terms_per_cluster(km, X, feat_names, topn=12):
    # centroid pada ruang TF–IDF
    order = np.argsort(km.cluster_centers_, axis=1)[:, ::-1]
    tops = []
    for i in range(km.n_clusters):
        terms = [(feat_names[j], round(float(km.cluster_centers_[i,j]),4)) for j in order[i,:topn]]
        tops.append((i, terms))
    return tops

feat = vec.get_feature_names_out()
kmX, labs, *_ = run_kmeans(X, 3)
for cid, terms in top_terms_per_cluster(kmX, X, feat):
    print('\nCluster', cid, 'top terms:')
    print(terms)

G. Evaluasi Eksternal (Opsional ARI)

# Jika Anda memiliki label (misal: sentimen/tiket kategori), hitung ARI
try:
    y_true = pd.read_csv('labels_true.csv')['y'].astype(int).values
    print('ARI(KMeans, k=3)=', round(adjusted_rand_score(y_true, labs),3))
except:
    print('Label eksternal tidak tersedia → lewati ARI')

3) Studi Kasus & Analisis

KasusPendekatanCatatan
Pengelompokan tiket helpdeskTF–IDF(1,2) → LSA 100D → K‑MeansLabelkan centroid → nama klaster (top terms)
Kelompok ulasan produkLSA 100D + DBSCANTemukan outlier (ulasan aneh/spam)
Eksplorasi kurikulum/catatan kelasCosine TF–IDF + K‑MeansPemetaan tema dan bahan ajar serupa

4) Tugas Mini (Dinilai)

  1. Tentukan k terbaik untuk K‑Means via silhouette (uji k∈{2..6}) pada TF–IDF & LSA; bandingkan metrik.
  2. Jalankan DBSCAN — pilih eps dari k‑distance plot; laporkan jumlah klaster & proporsi noise.
  3. Tampilkan top‑terms per klaster dan beri nama klaster; analisis 3 dokumen per klaster.