📘 Pertemuan 11 - Klasifikasi Dasar

🎯 Tujuan Pembelajaran

Memahami konsep supervised learning untuk klasifikasi
Menguasai algoritma K-Nearest Neighbors (KNN)
Mengimplementasikan Decision Tree classifier
Memahami Logistic Regression untuk binary classification
Mengevaluasi model dengan confusion matrix dan metrics
Handling imbalanced dataset

🧩 Apa itu Klasifikasi?

Klasifikasi adalah supervised learning yang memprediksi kategori/label diskrit dari data.

🏷️ Binary Classification: 2 kelas (Yes/No, Pass/Fail)
🎨 Multiclass Classification: >2 kelas (A/B/C/D/E)
🏅 Multilabel Classification: Multiple labels per instance

Aplikasi: Email spam detection, disease diagnosis, customer churn, image recognition

🤖 2. K-Nearest Neighbors (KNN)

KNN Algorithm

Principle: "Birds of a feather flock together"

Calculate distance to all training points
Find K nearest neighbors
Vote for most common class (classification)

Pros: Simple, no training, works well with local patterns

Cons: Slow for large datasets, sensitive to scale, curse of dimensionality

🌳 3. Decision Tree Classifier

Decision Tree Algorithm

Principle: Series of if-then rules learned from data

Split data based on feature that gives best information gain
Recursively build tree until stopping criteria
Leaf nodes represent class predictions

Pros: Interpretable, handles non-linear, no scaling needed

Cons: Prone to overfitting, unstable, biased with imbalanced data

📈 4. Logistic Regression

Logistic Function: p = 1 / (1 + e^(-z))
where z = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
Decision Rule: Class 1 if p > 0.5, else Class 0

# Logistic Regression for Binary Classification
print("="*60)
print("LOGISTIC REGRESSION - STUDENT PERFORMANCE")
print("="*60)

# Create binary classification dataset
np.random.seed(42)
n_students = 300

# Features
study_hours = np.random.normal(5, 2, n_students)
attendance = np.random.uniform(60, 100, n_students)
assignments = np.random.uniform(50, 100, n_students)

# Create target (Pass/Fail)
score = (study_hours * 5 + attendance * 0.3 + assignments * 0.2 + 
         np.random.normal(0, 10, n_students))
pass_fail = (score > np.median(score)).astype(int)

# DataFrame
df_students = pd.DataFrame({
    'Study_Hours': study_hours,
    'Attendance': attendance,
    'Assignments': assignments,
    'Pass': pass_fail
})

print("\nDataset Overview:")
print(df_students.head())
print(f"\nClass distribution:")
print(f"  Fail (0): {(pass_fail == 0).sum()} students")
print(f"  Pass (1): {(pass_fail == 1).sum()} students")

# Prepare data
X = df_students[['Study_Hours', 'Attendance', 'Assignments']]
y = df_students['Pass']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Predictions
y_pred_log = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred_log):.3f}")

# Visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Logistic Regression Analysis', fontsize=16, fontweight='bold')

# 1. ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

axes[0, 0].plot(fpr, tpr, color='darkorange', lw=2, 
                label=f'ROC curve (AUC = {roc_auc:.2f})')
axes[0, 0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0, 0].set_xlabel('False Positive Rate')
axes[0, 0].set_ylabel('True Positive Rate')
axes[0, 0].set_title('ROC Curve')
axes[0, 0].legend(loc="lower right")
axes[0, 0].grid(True, alpha=0.3)

# 2. Confusion Matrix
cm_log = confusion_matrix(y_test, y_pred_log)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_log, 
                              display_labels=['Fail', 'Pass'])
disp.plot(ax=axes[0, 1], cmap='Blues')
axes[0, 1].set_title('Confusion Matrix')

# 3. Probability Distribution
axes[0, 2].hist(y_pred_proba[y_test == 0], bins=20, alpha=0.5, 
                label='Fail', color='red')
axes[0, 2].hist(y_pred_proba[y_test == 1], bins=20, alpha=0.5, 
                label='Pass', color='green')
axes[0, 2].axvline(0.5, color='black', linestyle='--', label='Threshold')
axes[0, 2].set_xlabel('Predicted Probability')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].set_title('Probability Distribution by Class')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# 4. Feature Coefficients
coefs = log_reg.coef_[0]
features = X.columns
axes[1, 0].bar(features, coefs, color=['green' if c > 0 else 'red' for c in coefs])
axes[1, 0].set_ylabel('Coefficient')
axes[1, 0].set_title('Feature Importance (Logistic Regression)')
axes[1, 0].grid(True, alpha=0.3)

# 5. Threshold Analysis
precisions = []
recalls = []
thresholds_range = np.linspace(0.1, 0.9, 20)

for thresh in thresholds_range:
    y_pred_thresh = (y_pred_proba > thresh).astype(int)
    precisions.append(precision_score(y_test, y_pred_thresh))
    recalls.append(recall_score(y_test, y_pred_thresh))

axes[1, 1].plot(thresholds_range, precisions, 'b-', label='Precision')
axes[1, 1].plot(thresholds_range, recalls, 'r-', label='Recall')
axes[1, 1].set_xlabel('Threshold')
axes[1, 1].set_ylabel('Score')
axes[1, 1].set_title('Precision-Recall vs Threshold')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# 6. Learning Curve
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
    LogisticRegression(random_state=42), X_train_scaled, y_train, 
    cv=5, train_sizes=np.linspace(0.1, 1.0, 10)
)

axes[1, 2].plot(train_sizes, np.mean(train_scores, axis=1), 'o-', 
                color='blue', label='Training score')
axes[1, 2].plot(train_sizes, np.mean(val_scores, axis=1), 'o-', 
                color='red', label='Validation score')
axes[1, 2].set_xlabel('Training Set Size')
axes[1, 2].set_ylabel('Accuracy')
axes[1, 2].set_title('Learning Curves')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

🎯 5. Studi Kasus: Customer Churn Prediction

📋 Prediksi Customer Churn

Membangun model untuk memprediksi customer yang akan churn (berhenti berlangganan) berdasarkan perilaku dan karakteristik mereka.

# COMPREHENSIVE CASE STUDY: CUSTOMER CHURN
print("="*60)
print("CASE STUDY: CUSTOMER CHURN PREDICTION")
print("="*60)

# Generate synthetic customer data
np.random.seed(42)
n_customers = 1000

# Features
data = {
    'Tenure': np.random.uniform(0, 72, n_customers),  # Months
    'MonthlyCharges': np.random.uniform(20, 120, n_customers),
    'TotalCharges': np.random.uniform(100, 8000, n_customers),
    'NumServices': np.random.randint(1, 7, n_customers),
    'Contract': np.random.choice(['Month', 'Year', '2Years'], n_customers, p=[0.5, 0.3, 0.2]),
    'PaymentMethod': np.random.choice(['Bank', 'Credit', 'Electronic', 'Mail'], 
                                     n_customers, p=[0.3, 0.3, 0.3, 0.1]),
    'SupportCalls': np.random.poisson(2, n_customers),
    'SatisfactionScore': np.random.uniform(1, 5, n_customers)
}

df_churn = pd.DataFrame(data)

# Create churn based on logical rules
churn_prob = (
    0.3 +  # Base probability
    (df_churn['Tenure'] < 12) * 0.2 +  # New customers more likely
    (df_churn['MonthlyCharges'] > 80) * 0.1 +  # High charges
    (df_churn['Contract'] == 'Month') * 0.2 +  # Month-to-month risky
    (df_churn['SupportCalls'] > 3) * 0.15 +  # Many complaints
    (df_churn['SatisfactionScore'] < 2.5) * 0.2  # Low satisfaction
)
df_churn['Churn'] = (np.random.random(n_customers) < churn_prob).astype(int)

print("\n📊 Dataset Overview:")
print(df_churn.head())
print(f"\nChurn Rate: {df_churn['Churn'].mean():.2%}")

# Encode categorical variables
from sklearn.preprocessing import LabelEncoder
le_contract = LabelEncoder()
le_payment = LabelEncoder()

df_encoded = df_churn.copy()
df_encoded['Contract_Encoded'] = le_contract.fit_transform(df_churn['Contract'])
df_encoded['PaymentMethod_Encoded'] = le_payment.fit_transform(df_churn['PaymentMethod'])

# Prepare features
features = ['Tenure', 'MonthlyCharges', 'TotalCharges', 'NumServices', 
           'Contract_Encoded', 'PaymentMethod_Encoded', 'SupportCalls', 
           'SatisfactionScore']
X = df_encoded[features]
y = df_encoded['Churn']

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    y_proba = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else y_pred
    
    results[name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'AUC': roc_auc_score(y_test, y_proba)
    }

# Display results
results_df = pd.DataFrame(results).T.round(3)
print("\n📊 Model Comparison:")
print(results_df)

# Best model analysis
best_model_name = results_df['F1'].idxmax()
best_model = models[best_model_name]
print(f"\n🏆 Best Model (by F1 Score): {best_model_name}")

# Comprehensive visualization
fig, axes = plt.subplots(3, 3, figsize=(20, 18))
fig.suptitle('Customer Churn Prediction Analysis', fontsize=18, fontweight='bold')

# 1. Model Performance Comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1']
x_pos = np.arange(len(models))
width = 0.2

for i, metric in enumerate(metrics):
    values = [results[model][metric] for model in models.keys()]
    axes[0, 0].bar(x_pos + i*width, values, width, label=metric)

axes[0, 0].set_xlabel('Models')
axes[0, 0].set_ylabel('Score')
axes[0, 0].set_title('Model Performance Metrics')
axes[0, 0].set_xticks(x_pos + width * 1.5)
axes[0, 0].set_xticklabels(models.keys(), rotation=45)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. ROC Curves Comparison
for name, model in models.items():
    if hasattr(model, 'predict_proba'):
        y_proba = model.predict_proba(X_test_scaled)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_proba)
        auc_score = auc(fpr, tpr)
        axes[0, 1].plot(fpr, tpr, label=f'{name} (AUC={auc_score:.2f})')

axes[0, 1].plot([0, 1], [0, 1], 'k--')
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curves Comparison')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Feature Importance (Random Forest)
rf_model = models['Random Forest']
importances = rf_model.feature_importances_
indices = np.argsort(importances)[::-1]

axes[0, 2].barh(range(len(importances)), importances[indices], color='teal')
axes[0, 2].set_yticks(range(len(importances)))
axes[0, 2].set_yticklabels([features[i] for i in indices])
axes[0, 2].set_xlabel('Importance')
axes[0, 2].set_title('Feature Importance (Random Forest)')
axes[0, 2].grid(True, alpha=0.3)

# 4. Churn Distribution by Tenure
tenure_bins = pd.cut(df_churn['Tenure'], bins=[0, 12, 24, 48, 72], 
                     labels=['0-1yr', '1-2yr', '2-4yr', '4-6yr'])
churn_by_tenure = df_churn.groupby(tenure_bins)['Churn'].agg(['mean', 'count'])

axes[1, 0].bar(range(len(churn_by_tenure)), churn_by_tenure['mean'], 
               color='coral', alpha=0.7)
axes[1, 0].set_xticks(range(len(churn_by_tenure)))
axes[1, 0].set_xticklabels(churn_by_tenure.index)
axes[1, 0].set_ylabel('Churn Rate')
axes[1, 0].set_title('Churn Rate by Tenure')
axes[1, 0].grid(True, alpha=0.3)

# 5. Confusion Matrix (Best Model)
y_pred_best = best_model.predict(X_test_scaled)
cm_best = confusion_matrix(y_test, y_pred_best)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_best, 
                              display_labels=['Stay', 'Churn'])
disp.plot(ax=axes[1, 1], cmap='Blues')
axes[1, 1].set_title(f'Confusion Matrix ({best_model_name})')

# 6. Churn by Contract Type
contract_churn = df_churn.groupby('Contract')['Churn'].agg(['mean', 'count'])
axes[1, 2].bar(contract_churn.index, contract_churn['mean'], color='lightgreen')
axes[1, 2].set_ylabel('Churn Rate')
axes[1, 2].set_title('Churn Rate by Contract Type')
axes[1, 2].grid(True, alpha=0.3)

# 7. Cost-Benefit Analysis
# Assuming: Cost of retention = $50, Lost revenue from churn = $500
retention_cost = 50
churn_loss = 500

y_proba_best = best_model.predict_proba(X_test_scaled)[:, 1]
thresholds = np.linspace(0.1, 0.9, 20)
profits = []

for thresh in thresholds:
    y_pred_thresh = (y_proba_best > thresh).astype(int)
    tp = ((y_pred_thresh == 1) & (y_test == 1)).sum()  # Correctly predicted churn
    fp = ((y_pred_thresh == 1) & (y_test == 0)).sum()  # False churn prediction
    fn = ((y_pred_thresh == 0) & (y_test == 1)).sum()  # Missed churn
    
    profit = tp * (churn_loss - retention_cost) - fp * retention_cost - fn * churn_loss
    profits.append(profit)

axes[2, 0].plot(thresholds, profits, 'g-o')
axes[2, 0].axhline(0, color='red', linestyle='--')
axes[2, 0].set_xlabel('Probability Threshold')
axes[2, 0].set_ylabel('Expected Profit ($)')
axes[2, 0].set_title('Profit vs Decision Threshold')
axes[2, 0].grid(True, alpha=0.3)

optimal_threshold = thresholds[np.argmax(profits)]
axes[2, 0].axvline(optimal_threshold, color='blue', linestyle='--', 
                   label=f'Optimal: {optimal_threshold:.2f}')
axes[2, 0].legend()

# 8. Satisfaction Score vs Churn
axes[2, 1].boxplot([df_churn[df_churn['Churn']==0]['SatisfactionScore'],
                    df_churn[df_churn['Churn']==1]['SatisfactionScore']],
                   labels=['Stay', 'Churn'])
axes[2, 1].set_ylabel('Satisfaction Score')
axes[2, 1].set_title('Satisfaction Score by Churn Status')
axes[2, 1].grid(True, alpha=0.3)

# 9. Calibration Plot
from sklearn.calibration import calibration_curve
fraction_pos, mean_pred = calibration_curve(y_test, y_proba_best, n_bins=10)
axes[2, 2].plot(mean_pred, fraction_pos, 's-', label=best_model_name)
axes[2, 2].plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
axes[2, 2].set_xlabel('Mean Predicted Probability')
axes[2, 2].set_ylabel('Fraction of Positives')
axes[2, 2].set_title('Calibration Plot')
axes[2, 2].legend()
axes[2, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Business Insights
print("\n💡 Business Insights:")
print(f"1. Optimal threshold for intervention: {optimal_threshold:.2f}")
print(f"2. Expected profit with optimal strategy: ${max(profits):,.0f}")
print("3. Key churn indicators: Low tenure, month-to-month contract")
print("4. High-risk segment: New customers with high support calls")
print("5. Retention focus: Customers with <12 months tenure")

🧪 Contoh Soal & Latihan

Latihan 1: Binary Classification

Load dataset credit card fraud detection
Handle class imbalance dengan SMOTE
Train Logistic Regression, KNN, Decision Tree
Optimize hyperparameters dengan GridSearchCV
Evaluate dengan precision-recall curve
Calculate cost-sensitive metrics

Latihan 2: Multiclass Classification

Use dataset wine quality (3+ classes)
Implement one-vs-rest strategy
Compare dengan one-vs-one strategy
Create confusion matrix heatmap
Calculate macro vs micro averaged metrics
Implement voting classifier ensemble

Latihan 3: Model Selection & Tuning

Implement cross-validation properly
Use RandomizedSearchCV for hyperparameter tuning
Compare model performance dengan statistical tests
Analyze learning curves untuk detect overfitting
Implement early stopping untuk iterative algorithms
Create automated model selection pipeline

💡 Best Practices untuk Klasifikasi:

Always check class balance: Imbalanced data needs special handling
Use appropriate metrics: Accuracy bukan segalanya, lihat precision/recall
Stratified splitting: Maintain class distribution in train/test
Feature scaling: Important for distance-based algorithms
Cross-validation: Untuk robust performance estimate
Threshold optimization: Default 0.5 tidak selalu optimal
Domain knowledge: Understand cost of false positives vs false negatives

✍️ Kesimpulan

Pada pertemuan ini, mahasiswa telah mempelajari:

✅ Konsep dasar klasifikasi supervised learning
✅ Implementasi KNN dengan optimization
✅ Decision Tree dan interpretability
✅ Logistic Regression untuk binary classification
✅ Evaluation metrics dan confusion matrix
✅ ROC curve dan AUC analysis
✅ Real-world case study dengan business context

Pertemuan selanjutnya: Clustering (Unsupervised Learning)