Sessi 5 — Cosine Similarity & Mini Search Engine
Tujuan: menguasai perhitungan cosine similarity pada ruang TF–IDF/LSA, membangun mini search engine (index → query → rank), dan mengevaluasinya dengan metrik retrieval (Precision@k, MAP, nDCG).
Learning Outcomes: (1) Memformulasikan dan menghitung cosine similarity; (2) Mendesain pipeline pencarian; (3) Menguji kualitas hasil dengan metrik baku; (4) Menyusun ground truth sederhana & melakukan analisis kesalahan.
1) Konsep Cosine Similarity
Dengan normalisasi L2, kemiripan dua vektor \(\vec{x},\vec{y}\) didefinisikan sebagai \(\cos(\theta) = \frac{\vec{x}\cdot\vec{y}}{\|\vec{x}\\|\,\|\vec{y}\|}\). Pada TF–IDF ter‑norm L2, cosine sebanding dengan dot‑product sehingga efisien dihitung.
Vector Space Model untuk Pencarian
- Indexing: bangun matriks fitur dokumen \(X\) (TF–IDF).
- Query: vektorkan query \(\vec{q}\) dengan vektorizer yang sama.
- Ranking: urutkan dokumen menurut skor \(s_i = \cos(\vec{x}_i, \vec{q})\).
2) Praktik Google Colab — Membangun Mini Search
Gunakan korpus dari sessi3 atau sessi2. Kode berikut membangun index TF–IDF, fungsi pencarian, dan evaluasi retrieval.
A. Setup & Data
!pip -q install pandas numpy scikit-learn matplotlib tabulate
import numpy as np, pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tabulate import tabulate
import matplotlib.pyplot as plt
# Muat korpus (prioritas varian v2 dari Sessi 3 untuk bahasa Indonesia yang sudah distem)
try:
df = pd.read_csv('corpus_sessi3_variants.csv')
corpus = df['v2_stop_stemID'].fillna('').astype(str).tolist()
print('Loaded corpus_sessi3_variants.csv (v2_stop_stemID):', len(corpus))
except:
df = pd.read_csv('corpus_sessi2_normalized.csv')
corpus = df['text'].fillna('').astype(str).tolist()
print('Loaded corpus_sessi2_normalized.csv:', len(corpus))
for i,t in enumerate(corpus[:5]):
print(i, t[:100])
B. Indexing TF–IDF
# Vektorizer TF–IDF
vec = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.9, sublinear_tf=True, norm='l2')
X = vec.fit_transform(corpus)
print('Index shape:', X.shape)
C. Fungsi Pencarian
def search(query, topk=5):
q = vec.transform([query])
sims = cosine_similarity(X, q).ravel()
order = np.argsort(sims)[::-1][:topk]
return [(i, float(sims[i]), corpus[i]) for i in order]
# Contoh query
for q in ['pengiriman cepat', 'screen glare', 'refund cepat', 'bluetooth stabil', 'login delay']:
print('\nQUERY:', q)
rows = search(q, topk=3)
print(tabulate([(i, f"{s:.3f}", d[:80]) for i,s,d in rows], headers=['doc_id','sim','preview']))
D. Ground Truth & Evaluasi Retrieval
# 1) Buat kumpulan query dan daftar dokumen relevan (GT).
# Anda dapat mulai dengan heuristik berbasis kata kunci lalu perbaiki manual.
QUERIES = {
'pengiriman cepat': ['pengiriman','cepat','kirim'],
'screen glare': ['screen','glare','outdoor'],
'refund cepat': ['refund','komplain','proses'],
'bluetooth stabil': ['bluetooth','stabil'],
'login delay': ['login','delay','lambat']
}
# Heuristik: dokumen relevan bila mengandung semua kata kunci (setidaknya 1) setelah stemming/normalisasi
GT = {}
for q, kws in QUERIES.items():
rel = []
for i,doc in enumerate(corpus):
ok = all(any(kw in tok for tok in doc.split()) for kw in kws)
if ok:
rel.append(i)
# Jika terlalu ketat dan kosong, longgarkan: minimal 1 keyword
if not rel:
rel = [i for i,doc in enumerate(corpus) if any(kw in doc.split() for kw in kws)]
GT[q] = sorted(set(rel))
print('GT preview:')
for q,(kws) in list(QUERIES.items())[:3]:
print(q, '→', GT[q][:5])
# 2) Metrik: Precision@k, Recall@k, AP (Average Precision), MAP, nDCG@k
def precision_at_k(ranked, relset, k=5):
hit = sum(1 for i in ranked[:k] if i in relset)
return hit / k
def recall_at_k(ranked, relset, k=5):
if not relset: return 0.0
hit = sum(1 for i in ranked[:k] if i in relset)
return hit / len(relset)
def average_precision(ranked, relset):
if not relset: return 0.0
hits, s = 0, 0.0
for j,i in enumerate(ranked,1):
if i in relset:
hits += 1
s += hits/j
return s / max(1,len(relset))
def ndcg_at_k(ranked, relset, k=5):
import math
gains = [1.0 if i in relset else 0.0 for i in ranked[:k]]
dcg = sum(g/ math.log2(idx+2) for idx,g in enumerate(gains))
ideal = sorted(gains, reverse=True)
idcg = sum(g/ math.log2(idx+2) for idx,g in enumerate(ideal)) or 1.0
return dcg/idcg
# 3) Evaluasi batch
import pandas as pd
rows = []
for q in QUERIES.keys():
res = search(q, topk=10)
ranked = [i for i,_,_ in res]
rel = set(GT[q])
p5 = precision_at_k(ranked, rel, 5)
r5 = recall_at_k(ranked, rel, 5)
ap = average_precision(ranked, rel)
nd = ndcg_at_k(ranked, rel, 5)
rows.append([q, len(rel), p5, r5, ap, nd])
report = pd.DataFrame(rows, columns=['query','|rel|','P@5','R@5','AP','nDCG@5'])
print(report)
# Visual ringkas P@5 vs nDCG@5
plt.figure(figsize=(6,4))
plt.scatter(report['P@5'], report['nDCG@5'])
for i,row in report.iterrows():
plt.text(row['P@5']+0.01, row['nDCG@5']+0.01, row['query'], fontsize=9)
plt.xlabel('Precision@5'); plt.ylabel('nDCG@5'); plt.title('Ringkasan Kualitas Retrieval'); plt.tight_layout(); plt.show()
E. Analisis Kesalahan
# Telusuri hasil query yang buruk dan lihat term yang mendominasi skor
from sklearn.inspection import permutation_importance
q = 'refund cepat'
res = search(q, topk=10)
print('Preview hasil untuk:', q)
for i,s,doc in res[:5]:
print(f' {i} sim={s:.3f} {doc[:120]}')
# Tampilkan fitur TF–IDF tertinggi pada beberapa dokumen top
feat = vec.get_feature_names_out()
import numpy as np
for i, s, doc in res[:3]:
row = X[i].toarray().ravel()
topix = row.argsort()[::-1][:10]
print('\nDoc', i, 'top features:')
print([ (feat[j], round(row[j],3)) for j in topix ])
F. (Opsional) LSA untuk Pencarian
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
svd = TruncatedSVD(n_components=min(100, X.shape[1]-1), random_state=42)
Xk = svd.fit_transform(X)
Xk = Normalizer(copy=False).fit_transform(Xk)
# Pencarian di ruang laten
from sklearn.metrics.pairwise import linear_kernel
Sk = linear_kernel(Xk, Xk) # cosine setelah normalisasi
def search_lsa(query, topk=5):
qv = vec.transform([query])
qk = svd.transform(qv)
qk = Normalizer(copy=False).fit_transform(qk)
sims = (Xk @ qk.T).ravel()
order = sims.argsort()[::-1][:topk]
return [(i, float(sims[i]), corpus[i]) for i in order]
print('\nBandingkan ruang asli vs LSA:')
for q in QUERIES.keys():
base = [i for i,_,_ in search(q, topk=5)]
lsa = [i for i,_,_ in search_lsa(q, topk=5)]
inter = len(set(base)&set(lsa))
print(f'{q:18s} overlap@5 =', inter)
G. Simpan Artefak
import joblib
joblib.dump(vec, 'tfidf_vec_sessi5.joblib')
joblib.dump({'GT': GT, 'QUERIES': QUERIES}, 'retrieval_gt_sessi5.joblib')
print('Tersimpan: tfidf_vec_sessi5.joblib, retrieval_gt_sessi5.joblib')
3) Studi Kasus
| Kasus | Tujuan | Pendekatan | Catatan |
|---|---|---|---|
| Pencarian ulasan e‑commerce | Temukan ulasan relevan untuk klaim garansi | TF–IDF + cosine; query ekspresi masalah | Perkuat sinonimi via LSA jika variasi ejaan tinggi |
| Helpdesk & FAQ | Retrieval jawaban dari basis pengetahuan | n‑gram bigram untuk menangkap frasa | Tambah kata kunci domain sebagai boost |
| Plagiarisme awal | Deteksi kemiripan dokumen | Cosine di TF–IDF/char n‑gram | Gunakan ambang & inspeksi manual untuk konfirmasi |
4) Tugas Mini (Dinilai)
- Bangun GT minimal 5 query (boleh dari contoh), revisi manual daftar dokumen relevan.
- Laporkan P@5, R@5, MAP, nDCG@5 untuk TF–IDF; bandingkan dengan LSA.
- Lakukan analisis kesalahan pada 2 query dengan skor terendah, tunjukkan fitur dominan & saran perbaikan (min_df, n‑gram, normalisasi).