๐ Pertemuan 11 - Klasifikasi Dasar
๐ฏ Tujuan Pembelajaran
- Memahami konsep supervised learning untuk klasifikasi
- Menguasai algoritma K-Nearest Neighbors (KNN)
- Mengimplementasikan Decision Tree classifier
- Memahami Logistic Regression untuk binary classification
- Mengevaluasi model dengan confusion matrix dan metrics
- Handling imbalanced dataset
๐งฉ Apa itu Klasifikasi?
Klasifikasi adalah supervised learning yang memprediksi kategori/label diskrit dari data.
- ๐ท๏ธ Binary Classification: 2 kelas (Yes/No, Pass/Fail)
- ๐จ Multiclass Classification: >2 kelas (A/B/C/D/E)
- ๐ Multilabel Classification: Multiple labels per instance
Aplikasi: Email spam detection, disease diagnosis, customer churn, image recognition
๐ค 2. K-Nearest Neighbors (KNN)
KNN Algorithm
Principle: "Birds of a feather flock together"
- Calculate distance to all training points
- Find K nearest neighbors
- Vote for most common class (classification)
Pros: Simple, no training, works well with local patterns
Cons: Slow for large datasets, sensitive to scale, curse of dimensionality
๐ณ 3. Decision Tree Classifier
Decision Tree Algorithm
Principle: Series of if-then rules learned from data
- Split data based on feature that gives best information gain
- Recursively build tree until stopping criteria
- Leaf nodes represent class predictions
Pros: Interpretable, handles non-linear, no scaling needed
Cons: Prone to overfitting, unstable, biased with imbalanced data
๐ 4. Logistic Regression
where z = ฮฒโ + ฮฒโxโ + ฮฒโxโ + ... + ฮฒโxโ
Decision Rule: Class 1 if p > 0.5, else Class 0
# Logistic Regression for Binary Classification
print("="*60)
print("LOGISTIC REGRESSION - STUDENT PERFORMANCE")
print("="*60)
# Create binary classification dataset
np.random.seed(42)
n_students = 300
# Features
study_hours = np.random.normal(5, 2, n_students)
attendance = np.random.uniform(60, 100, n_students)
assignments = np.random.uniform(50, 100, n_students)
# Create target (Pass/Fail)
score = (study_hours * 5 + attendance * 0.3 + assignments * 0.2 +
np.random.normal(0, 10, n_students))
pass_fail = (score > np.median(score)).astype(int)
# DataFrame
df_students = pd.DataFrame({
'Study_Hours': study_hours,
'Attendance': attendance,
'Assignments': assignments,
'Pass': pass_fail
})
print("\nDataset Overview:")
print(df_students.head())
print(f"\nClass distribution:")
print(f" Fail (0): {(pass_fail == 0).sum()} students")
print(f" Pass (1): {(pass_fail == 1).sum()} students")
# Prepare data
X = df_students[['Study_Hours', 'Attendance', 'Assignments']]
y = df_students['Pass']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)
# Predictions
y_pred_log = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]
print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred_log):.3f}")
# Visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Logistic Regression Analysis', fontsize=16, fontweight='bold')
# 1. ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
axes[0, 0].plot(fpr, tpr, color='darkorange', lw=2,
label=f'ROC curve (AUC = {roc_auc:.2f})')
axes[0, 0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0, 0].set_xlabel('False Positive Rate')
axes[0, 0].set_ylabel('True Positive Rate')
axes[0, 0].set_title('ROC Curve')
axes[0, 0].legend(loc="lower right")
axes[0, 0].grid(True, alpha=0.3)
# 2. Confusion Matrix
cm_log = confusion_matrix(y_test, y_pred_log)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_log,
display_labels=['Fail', 'Pass'])
disp.plot(ax=axes[0, 1], cmap='Blues')
axes[0, 1].set_title('Confusion Matrix')
# 3. Probability Distribution
axes[0, 2].hist(y_pred_proba[y_test == 0], bins=20, alpha=0.5,
label='Fail', color='red')
axes[0, 2].hist(y_pred_proba[y_test == 1], bins=20, alpha=0.5,
label='Pass', color='green')
axes[0, 2].axvline(0.5, color='black', linestyle='--', label='Threshold')
axes[0, 2].set_xlabel('Predicted Probability')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].set_title('Probability Distribution by Class')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)
# 4. Feature Coefficients
coefs = log_reg.coef_[0]
features = X.columns
axes[1, 0].bar(features, coefs, color=['green' if c > 0 else 'red' for c in coefs])
axes[1, 0].set_ylabel('Coefficient')
axes[1, 0].set_title('Feature Importance (Logistic Regression)')
axes[1, 0].grid(True, alpha=0.3)
# 5. Threshold Analysis
precisions = []
recalls = []
thresholds_range = np.linspace(0.1, 0.9, 20)
for thresh in thresholds_range:
y_pred_thresh = (y_pred_proba > thresh).astype(int)
precisions.append(precision_score(y_test, y_pred_thresh))
recalls.append(recall_score(y_test, y_pred_thresh))
axes[1, 1].plot(thresholds_range, precisions, 'b-', label='Precision')
axes[1, 1].plot(thresholds_range, recalls, 'r-', label='Recall')
axes[1, 1].set_xlabel('Threshold')
axes[1, 1].set_ylabel('Score')
axes[1, 1].set_title('Precision-Recall vs Threshold')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
# 6. Learning Curve
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
LogisticRegression(random_state=42), X_train_scaled, y_train,
cv=5, train_sizes=np.linspace(0.1, 1.0, 10)
)
axes[1, 2].plot(train_sizes, np.mean(train_scores, axis=1), 'o-',
color='blue', label='Training score')
axes[1, 2].plot(train_sizes, np.mean(val_scores, axis=1), 'o-',
color='red', label='Validation score')
axes[1, 2].set_xlabel('Training Set Size')
axes[1, 2].set_ylabel('Accuracy')
axes[1, 2].set_title('Learning Curves')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
๐ฏ 5. Studi Kasus: Customer Churn Prediction
๐ Prediksi Customer Churn
Membangun model untuk memprediksi customer yang akan churn (berhenti berlangganan) berdasarkan perilaku dan karakteristik mereka.
# COMPREHENSIVE CASE STUDY: CUSTOMER CHURN
print("="*60)
print("CASE STUDY: CUSTOMER CHURN PREDICTION")
print("="*60)
# Generate synthetic customer data
np.random.seed(42)
n_customers = 1000
# Features
data = {
'Tenure': np.random.uniform(0, 72, n_customers), # Months
'MonthlyCharges': np.random.uniform(20, 120, n_customers),
'TotalCharges': np.random.uniform(100, 8000, n_customers),
'NumServices': np.random.randint(1, 7, n_customers),
'Contract': np.random.choice(['Month', 'Year', '2Years'], n_customers, p=[0.5, 0.3, 0.2]),
'PaymentMethod': np.random.choice(['Bank', 'Credit', 'Electronic', 'Mail'],
n_customers, p=[0.3, 0.3, 0.3, 0.1]),
'SupportCalls': np.random.poisson(2, n_customers),
'SatisfactionScore': np.random.uniform(1, 5, n_customers)
}
df_churn = pd.DataFrame(data)
# Create churn based on logical rules
churn_prob = (
0.3 + # Base probability
(df_churn['Tenure'] < 12) * 0.2 + # New customers more likely
(df_churn['MonthlyCharges'] > 80) * 0.1 + # High charges
(df_churn['Contract'] == 'Month') * 0.2 + # Month-to-month risky
(df_churn['SupportCalls'] > 3) * 0.15 + # Many complaints
(df_churn['SatisfactionScore'] < 2.5) * 0.2 # Low satisfaction
)
df_churn['Churn'] = (np.random.random(n_customers) < churn_prob).astype(int)
print("\n๐ Dataset Overview:")
print(df_churn.head())
print(f"\nChurn Rate: {df_churn['Churn'].mean():.2%}")
# Encode categorical variables
from sklearn.preprocessing import LabelEncoder
le_contract = LabelEncoder()
le_payment = LabelEncoder()
df_encoded = df_churn.copy()
df_encoded['Contract_Encoded'] = le_contract.fit_transform(df_churn['Contract'])
df_encoded['PaymentMethod_Encoded'] = le_payment.fit_transform(df_churn['PaymentMethod'])
# Prepare features
features = ['Tenure', 'MonthlyCharges', 'TotalCharges', 'NumServices',
'Contract_Encoded', 'PaymentMethod_Encoded', 'SupportCalls',
'SatisfactionScore']
X = df_encoded[features]
y = df_encoded['Churn']
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train multiple models
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}
results = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else y_pred
results[name] = {
'Accuracy': accuracy_score(y_test, y_pred),
'Precision': precision_score(y_test, y_pred),
'Recall': recall_score(y_test, y_pred),
'F1': f1_score(y_test, y_pred),
'AUC': roc_auc_score(y_test, y_proba)
}
# Display results
results_df = pd.DataFrame(results).T.round(3)
print("\n๐ Model Comparison:")
print(results_df)
# Best model analysis
best_model_name = results_df['F1'].idxmax()
best_model = models[best_model_name]
print(f"\n๐ Best Model (by F1 Score): {best_model_name}")
# Comprehensive visualization
fig, axes = plt.subplots(3, 3, figsize=(20, 18))
fig.suptitle('Customer Churn Prediction Analysis', fontsize=18, fontweight='bold')
# 1. Model Performance Comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1']
x_pos = np.arange(len(models))
width = 0.2
for i, metric in enumerate(metrics):
values = [results[model][metric] for model in models.keys()]
axes[0, 0].bar(x_pos + i*width, values, width, label=metric)
axes[0, 0].set_xlabel('Models')
axes[0, 0].set_ylabel('Score')
axes[0, 0].set_title('Model Performance Metrics')
axes[0, 0].set_xticks(x_pos + width * 1.5)
axes[0, 0].set_xticklabels(models.keys(), rotation=45)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# 2. ROC Curves Comparison
for name, model in models.items():
if hasattr(model, 'predict_proba'):
y_proba = model.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
auc_score = auc(fpr, tpr)
axes[0, 1].plot(fpr, tpr, label=f'{name} (AUC={auc_score:.2f})')
axes[0, 1].plot([0, 1], [0, 1], 'k--')
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curves Comparison')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# 3. Feature Importance (Random Forest)
rf_model = models['Random Forest']
importances = rf_model.feature_importances_
indices = np.argsort(importances)[::-1]
axes[0, 2].barh(range(len(importances)), importances[indices], color='teal')
axes[0, 2].set_yticks(range(len(importances)))
axes[0, 2].set_yticklabels([features[i] for i in indices])
axes[0, 2].set_xlabel('Importance')
axes[0, 2].set_title('Feature Importance (Random Forest)')
axes[0, 2].grid(True, alpha=0.3)
# 4. Churn Distribution by Tenure
tenure_bins = pd.cut(df_churn['Tenure'], bins=[0, 12, 24, 48, 72],
labels=['0-1yr', '1-2yr', '2-4yr', '4-6yr'])
churn_by_tenure = df_churn.groupby(tenure_bins)['Churn'].agg(['mean', 'count'])
axes[1, 0].bar(range(len(churn_by_tenure)), churn_by_tenure['mean'],
color='coral', alpha=0.7)
axes[1, 0].set_xticks(range(len(churn_by_tenure)))
axes[1, 0].set_xticklabels(churn_by_tenure.index)
axes[1, 0].set_ylabel('Churn Rate')
axes[1, 0].set_title('Churn Rate by Tenure')
axes[1, 0].grid(True, alpha=0.3)
# 5. Confusion Matrix (Best Model)
y_pred_best = best_model.predict(X_test_scaled)
cm_best = confusion_matrix(y_test, y_pred_best)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_best,
display_labels=['Stay', 'Churn'])
disp.plot(ax=axes[1, 1], cmap='Blues')
axes[1, 1].set_title(f'Confusion Matrix ({best_model_name})')
# 6. Churn by Contract Type
contract_churn = df_churn.groupby('Contract')['Churn'].agg(['mean', 'count'])
axes[1, 2].bar(contract_churn.index, contract_churn['mean'], color='lightgreen')
axes[1, 2].set_ylabel('Churn Rate')
axes[1, 2].set_title('Churn Rate by Contract Type')
axes[1, 2].grid(True, alpha=0.3)
# 7. Cost-Benefit Analysis
# Assuming: Cost of retention = $50, Lost revenue from churn = $500
retention_cost = 50
churn_loss = 500
y_proba_best = best_model.predict_proba(X_test_scaled)[:, 1]
thresholds = np.linspace(0.1, 0.9, 20)
profits = []
for thresh in thresholds:
y_pred_thresh = (y_proba_best > thresh).astype(int)
tp = ((y_pred_thresh == 1) & (y_test == 1)).sum() # Correctly predicted churn
fp = ((y_pred_thresh == 1) & (y_test == 0)).sum() # False churn prediction
fn = ((y_pred_thresh == 0) & (y_test == 1)).sum() # Missed churn
profit = tp * (churn_loss - retention_cost) - fp * retention_cost - fn * churn_loss
profits.append(profit)
axes[2, 0].plot(thresholds, profits, 'g-o')
axes[2, 0].axhline(0, color='red', linestyle='--')
axes[2, 0].set_xlabel('Probability Threshold')
axes[2, 0].set_ylabel('Expected Profit ($)')
axes[2, 0].set_title('Profit vs Decision Threshold')
axes[2, 0].grid(True, alpha=0.3)
optimal_threshold = thresholds[np.argmax(profits)]
axes[2, 0].axvline(optimal_threshold, color='blue', linestyle='--',
label=f'Optimal: {optimal_threshold:.2f}')
axes[2, 0].legend()
# 8. Satisfaction Score vs Churn
axes[2, 1].boxplot([df_churn[df_churn['Churn']==0]['SatisfactionScore'],
df_churn[df_churn['Churn']==1]['SatisfactionScore']],
labels=['Stay', 'Churn'])
axes[2, 1].set_ylabel('Satisfaction Score')
axes[2, 1].set_title('Satisfaction Score by Churn Status')
axes[2, 1].grid(True, alpha=0.3)
# 9. Calibration Plot
from sklearn.calibration import calibration_curve
fraction_pos, mean_pred = calibration_curve(y_test, y_proba_best, n_bins=10)
axes[2, 2].plot(mean_pred, fraction_pos, 's-', label=best_model_name)
axes[2, 2].plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
axes[2, 2].set_xlabel('Mean Predicted Probability')
axes[2, 2].set_ylabel('Fraction of Positives')
axes[2, 2].set_title('Calibration Plot')
axes[2, 2].legend()
axes[2, 2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Business Insights
print("\n๐ก Business Insights:")
print(f"1. Optimal threshold for intervention: {optimal_threshold:.2f}")
print(f"2. Expected profit with optimal strategy: ${max(profits):,.0f}")
print("3. Key churn indicators: Low tenure, month-to-month contract")
print("4. High-risk segment: New customers with high support calls")
print("5. Retention focus: Customers with <12 months tenure")
๐งช Contoh Soal & Latihan
Latihan 1: Binary Classification
- Load dataset credit card fraud detection
- Handle class imbalance dengan SMOTE
- Train Logistic Regression, KNN, Decision Tree
- Optimize hyperparameters dengan GridSearchCV
- Evaluate dengan precision-recall curve
- Calculate cost-sensitive metrics
Latihan 2: Multiclass Classification
- Use dataset wine quality (3+ classes)
- Implement one-vs-rest strategy
- Compare dengan one-vs-one strategy
- Create confusion matrix heatmap
- Calculate macro vs micro averaged metrics
- Implement voting classifier ensemble
Latihan 3: Model Selection & Tuning
- Implement cross-validation properly
- Use RandomizedSearchCV for hyperparameter tuning
- Compare model performance dengan statistical tests
- Analyze learning curves untuk detect overfitting
- Implement early stopping untuk iterative algorithms
- Create automated model selection pipeline
๐ก Best Practices untuk Klasifikasi:
- Always check class balance: Imbalanced data needs special handling
- Use appropriate metrics: Accuracy bukan segalanya, lihat precision/recall
- Stratified splitting: Maintain class distribution in train/test
- Feature scaling: Important for distance-based algorithms
- Cross-validation: Untuk robust performance estimate
- Threshold optimization: Default 0.5 tidak selalu optimal
- Domain knowledge: Understand cost of false positives vs false negatives
โ๏ธ Kesimpulan
Pada pertemuan ini, mahasiswa telah mempelajari:
- โ Konsep dasar klasifikasi supervised learning
- โ Implementasi KNN dengan optimization
- โ Decision Tree dan interpretability
- โ Logistic Regression untuk binary classification
- โ Evaluation metrics dan confusion matrix
- โ ROC curve dan AUC analysis
- โ Real-world case study dengan business context
Pertemuan selanjutnya: Clustering (Unsupervised Learning)