Machine Learning (ML) adalah pendekatan data-driven untuk mempelajari pola dari data sehingga dapat memprediksi atau mengklasifikasi hal baru. Kita fokus pada pipeline dasar (persiapan data → split → pelatihan → validasi → pengujian), dua model inti (regresi linear & logistik), serta isu umum seperti overfitting.
Tujuan: memahami alur ML, membedakan regresi vs klasifikasi, dan mengevaluasi model sederhanaSesi 9 – Pengantar Machine Learning
Ulasan Materi: Definisi, Intuisi, dan Contoh
Apa itu ML?
ML memetakan fitur → target otomatis dari data. Regresi memprediksi nilai kontinu (mis. harga rumah), klasifikasi memprediksi label diskrit (mis. lulus/tidak).
Pipeline ML
- Persiapan: bersihkan data, fitur yang relevan, skala bila perlu.
- Split: bagi train/val/test untuk mencegah bias evaluasi.
- Latih: pilih model, latih parameter.
- Validasi: pilih hyperparameter & cegah overfitting.
- Uji: nilai performa akhir.
Rumus Inti
Ringkas Rumus & Intuisi:
1) Regresi Linear (1 fitur): y ≈ w*x + b.
• Loss: MSE = (1/m) ∑(y_i − (w x_i + b))^2.
• Gradien: ∂MSE/∂w = −(2/m)∑ x_i (y_i − y_hat_i), ∂MSE/∂b = −(2/m)∑ (y_i − y_hat_i).
• Update (GD): w ← w − α ∂MSE/∂w, b ← b − α ∂MSE/∂b.
2) Regresi Logistik (biner): P(y=1|x)=σ(z)=1/(1+e^{−z}), z = w·x + b.
• Prediksi kelas: y_hat = 1 jika σ(z) ≥ 0.5.
• Loss: Log Loss (cross-entropy).
3) Split Data: Train / Validation / Test.
• Train: melatih parameter.
• Val: pilih hyperparameter (mis. α, regularisasi).
• Test: evaluasi akhir — satu kali saja.
4) Overfitting vs Underfitting:
• Overfit: akurat di train, buruk di val/test (model terlalu kompleks).
• Underfit: buruk di train & val (model terlalu sederhana).
Studi Kasus
- Regresi Linear: Jam belajar → Nilai; Luas rumah → Harga.
- Regresi Logistik: Lulus/Tidak berdasarkan jam belajar & kehadiran.
Lab: Split Data, Linear & Logistik
# ====== Visual Split Train/Val/Test (synthetic) ======
import numpy as np, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1,1)
y = 3.5*X[:,0] + 5 + np.random.randn(100)*2.0
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.4, random_state=0)
X_val, X_test, y_val, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5, random_state=0)
plt.figure(figsize=(5,3))
plt.scatter(X_train, y_train, label='Train')
plt.scatter(X_val, y_val, label='Val')
plt.scatter(X_test, y_test, label='Test')
plt.legend(); plt.title('Pembagian Data'); plt.xlabel('X'); plt.ylabel('y');
plt.show()
print('Ukuran:', len(X_train), len(X_val), len(X_test))
# ====== Regresi Linear dari Nol (GD) ======
import numpy as np
np.random.seed(1)
# Dataset studi kasus: Jam belajar → Nilai ujian
X = np.linspace(0, 10, 80)
y_true = 8*X + 20
noise = np.random.randn(80)*8
y = y_true + noise
# Normalisasi sederhana (opsional)
Xn = (X - X.mean())/X.std()
# Inisialisasi parameter
w, b = 0.0, 0.0
alpha = 0.05
m = len(Xn)
for epoch in range(600):
y_hat = w*Xn + b
err = y - y_hat
dw = -(2/m)*np.sum(Xn*err)
db = -(2/m)*np.sum(err)
w -= alpha*dw
b -= alpha*db
if epoch % 100 == 0:
mse = np.mean(err**2)
print(f"epoch {epoch} mse={mse:.2f} w={w:.3f} b={b:.3f}")
print("Parameter akhir:", w, b)
# Prediksi nilai untuk jam=7
x_new = (7 - X.mean())/X.std()
pred = w*x_new + b
print(f"Prediksi nilai saat jam=7: {pred:.2f}")
# ====== Regresi Linear scikit-learn ======
# Jika modul belum ada: !pip install scikit-learn
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Studi kasus: Luas rumah (m^2) → Harga (juta)
np.random.seed(7)
X = np.random.uniform(30, 150, size=120).reshape(-1,1)
y = 2.2*X[:,0] + 50 + np.random.randn(120)*10
# Train/Test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('w, b =', model.coef_[0], model.intercept_)
print('MSE =', mean_squared_error(y_test, y_pred))
print('R^2 =', r2_score(y_test, y_pred))
Jika modul belum tersedia: jalankan !pip install scikit-learn di Colab.
# ====== Klasifikasi Biner: Regresi Logistik ======
# Contoh: Lulus (1) / Tidak (0) berdasarkan Jam Belajar & Kehadiran
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
np.random.seed(0)
N=200
hours = np.random.uniform(0,10,N)
attendance = np.random.uniform(60,100,N)
# label dengan probabilitas berbasis kombinasi fitur (sintetik)
logit = -6 + 0.7*hours + 0.06*attendance
prob = 1/(1+np.exp(-logit))
y = (prob>0.5).astype(int)
X = np.vstack([hours, attendance]).T
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)
sc = StandardScaler(); Xtr=sc.fit_transform(X_train); Xte=sc.transform(X_test)
clf = LogisticRegression()
clf.fit(Xtr,y_train)
y_pred = clf.predict(Xte)
print('Akurasi:', accuracy_score(y_test,y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test,y_pred))
print('Classification Report:\n', classification_report(y_test,y_pred))
# ====== Kuis 9 (cek mandiri) ======
qs=[
("Tujuan validation set adalah...",{"a":"Melatih parameter","b":"Memilih hyperparameter","c":"Mengukur performa final","d":"Menambah data latih"},"b"),
("Overfitting terjadi saat...",{"a":"Error train tinggi, val rendah","b":"Error train rendah, val tinggi","c":"Keduanya rendah","d":"Keduanya tinggi"},"b"),
("Dalam regresi logistik, fungsi aktivasi yang digunakan adalah...",{"a":"ReLU","b":"Tanh","c":"Sigmoid","d":"Softmax"},"c"),
("MSE paling umum dipakai untuk...",{"a":"Klasifikasi","b":"Regresi","c":"Clustering","d":"PCA"},"b"),
]
print('Kunci jawaban:')
for i,(_,__,ans) in enumerate(qs,1):
print(f'Q{i}: {ans}')
Penugasan & Referensi
Tugas Koding 5: (i) Pilih salah satu regresi (linear) atau klasifikasi (logistik). (ii) Buat dataset kecil (≥ 120 sampel) atau pakai dataset publik mini. (iii) Lakukan split train/val/test, latih model, laporkan metrik (MSE/R² untuk regresi; akurasi + confusion matrix untuk klasifikasi). (iv) Jelaskan indikasi over/underfitting dan langkah mitigasinya (regularisasi, tambah data, fitur lebih baik).
- Geron, A. (2019) — *Hands-On Machine Learning with Scikit-Learn & TensorFlow* (bab regresi/logistik).
- James et al. (2021) — *An Introduction to Statistical Learning* (ISL, bab 3–4).