๐Ÿ“˜ Pertemuan 11 - Klasifikasi Dasar

๐ŸŽฏ Tujuan Pembelajaran

  1. Memahami konsep supervised learning untuk klasifikasi
  2. Menguasai algoritma K-Nearest Neighbors (KNN)
  3. Mengimplementasikan Decision Tree classifier
  4. Memahami Logistic Regression untuk binary classification
  5. Mengevaluasi model dengan confusion matrix dan metrics
  6. Handling imbalanced dataset

๐Ÿงฉ Apa itu Klasifikasi?

Klasifikasi adalah supervised learning yang memprediksi kategori/label diskrit dari data.

  • ๐Ÿท๏ธ Binary Classification: 2 kelas (Yes/No, Pass/Fail)
  • ๐ŸŽจ Multiclass Classification: >2 kelas (A/B/C/D/E)
  • ๐Ÿ… Multilabel Classification: Multiple labels per instance

Aplikasi: Email spam detection, disease diagnosis, customer churn, image recognition

๐Ÿค– 2. K-Nearest Neighbors (KNN)

KNN Algorithm

Principle: "Birds of a feather flock together"

  1. Calculate distance to all training points
  2. Find K nearest neighbors
  3. Vote for most common class (classification)

Pros: Simple, no training, works well with local patterns

Cons: Slow for large datasets, sensitive to scale, curse of dimensionality

๐ŸŒณ 3. Decision Tree Classifier

Decision Tree Algorithm

Principle: Series of if-then rules learned from data

  • Split data based on feature that gives best information gain
  • Recursively build tree until stopping criteria
  • Leaf nodes represent class predictions

Pros: Interpretable, handles non-linear, no scaling needed

Cons: Prone to overfitting, unstable, biased with imbalanced data

๐Ÿ“ˆ 4. Logistic Regression

Logistic Function: p = 1 / (1 + e^(-z))
where z = ฮฒโ‚€ + ฮฒโ‚xโ‚ + ฮฒโ‚‚xโ‚‚ + ... + ฮฒโ‚™xโ‚™
Decision Rule: Class 1 if p > 0.5, else Class 0
# Logistic Regression for Binary Classification
print("="*60)
print("LOGISTIC REGRESSION - STUDENT PERFORMANCE")
print("="*60)

# Create binary classification dataset
np.random.seed(42)
n_students = 300

# Features
study_hours = np.random.normal(5, 2, n_students)
attendance = np.random.uniform(60, 100, n_students)
assignments = np.random.uniform(50, 100, n_students)

# Create target (Pass/Fail)
score = (study_hours * 5 + attendance * 0.3 + assignments * 0.2 + 
         np.random.normal(0, 10, n_students))
pass_fail = (score > np.median(score)).astype(int)

# DataFrame
df_students = pd.DataFrame({
    'Study_Hours': study_hours,
    'Attendance': attendance,
    'Assignments': assignments,
    'Pass': pass_fail
})

print("\nDataset Overview:")
print(df_students.head())
print(f"\nClass distribution:")
print(f"  Fail (0): {(pass_fail == 0).sum()} students")
print(f"  Pass (1): {(pass_fail == 1).sum()} students")

# Prepare data
X = df_students[['Study_Hours', 'Attendance', 'Assignments']]
y = df_students['Pass']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Predictions
y_pred_log = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred_log):.3f}")

# Visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Logistic Regression Analysis', fontsize=16, fontweight='bold')

# 1. ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

axes[0, 0].plot(fpr, tpr, color='darkorange', lw=2, 
                label=f'ROC curve (AUC = {roc_auc:.2f})')
axes[0, 0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0, 0].set_xlabel('False Positive Rate')
axes[0, 0].set_ylabel('True Positive Rate')
axes[0, 0].set_title('ROC Curve')
axes[0, 0].legend(loc="lower right")
axes[0, 0].grid(True, alpha=0.3)

# 2. Confusion Matrix
cm_log = confusion_matrix(y_test, y_pred_log)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_log, 
                              display_labels=['Fail', 'Pass'])
disp.plot(ax=axes[0, 1], cmap='Blues')
axes[0, 1].set_title('Confusion Matrix')

# 3. Probability Distribution
axes[0, 2].hist(y_pred_proba[y_test == 0], bins=20, alpha=0.5, 
                label='Fail', color='red')
axes[0, 2].hist(y_pred_proba[y_test == 1], bins=20, alpha=0.5, 
                label='Pass', color='green')
axes[0, 2].axvline(0.5, color='black', linestyle='--', label='Threshold')
axes[0, 2].set_xlabel('Predicted Probability')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].set_title('Probability Distribution by Class')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# 4. Feature Coefficients
coefs = log_reg.coef_[0]
features = X.columns
axes[1, 0].bar(features, coefs, color=['green' if c > 0 else 'red' for c in coefs])
axes[1, 0].set_ylabel('Coefficient')
axes[1, 0].set_title('Feature Importance (Logistic Regression)')
axes[1, 0].grid(True, alpha=0.3)

# 5. Threshold Analysis
precisions = []
recalls = []
thresholds_range = np.linspace(0.1, 0.9, 20)

for thresh in thresholds_range:
    y_pred_thresh = (y_pred_proba > thresh).astype(int)
    precisions.append(precision_score(y_test, y_pred_thresh))
    recalls.append(recall_score(y_test, y_pred_thresh))

axes[1, 1].plot(thresholds_range, precisions, 'b-', label='Precision')
axes[1, 1].plot(thresholds_range, recalls, 'r-', label='Recall')
axes[1, 1].set_xlabel('Threshold')
axes[1, 1].set_ylabel('Score')
axes[1, 1].set_title('Precision-Recall vs Threshold')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# 6. Learning Curve
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
    LogisticRegression(random_state=42), X_train_scaled, y_train, 
    cv=5, train_sizes=np.linspace(0.1, 1.0, 10)
)

axes[1, 2].plot(train_sizes, np.mean(train_scores, axis=1), 'o-', 
                color='blue', label='Training score')
axes[1, 2].plot(train_sizes, np.mean(val_scores, axis=1), 'o-', 
                color='red', label='Validation score')
axes[1, 2].set_xlabel('Training Set Size')
axes[1, 2].set_ylabel('Accuracy')
axes[1, 2].set_title('Learning Curves')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

๐ŸŽฏ 5. Studi Kasus: Customer Churn Prediction

๐Ÿ“‹ Prediksi Customer Churn

Membangun model untuk memprediksi customer yang akan churn (berhenti berlangganan) berdasarkan perilaku dan karakteristik mereka.

# COMPREHENSIVE CASE STUDY: CUSTOMER CHURN
print("="*60)
print("CASE STUDY: CUSTOMER CHURN PREDICTION")
print("="*60)

# Generate synthetic customer data
np.random.seed(42)
n_customers = 1000

# Features
data = {
    'Tenure': np.random.uniform(0, 72, n_customers),  # Months
    'MonthlyCharges': np.random.uniform(20, 120, n_customers),
    'TotalCharges': np.random.uniform(100, 8000, n_customers),
    'NumServices': np.random.randint(1, 7, n_customers),
    'Contract': np.random.choice(['Month', 'Year', '2Years'], n_customers, p=[0.5, 0.3, 0.2]),
    'PaymentMethod': np.random.choice(['Bank', 'Credit', 'Electronic', 'Mail'], 
                                     n_customers, p=[0.3, 0.3, 0.3, 0.1]),
    'SupportCalls': np.random.poisson(2, n_customers),
    'SatisfactionScore': np.random.uniform(1, 5, n_customers)
}

df_churn = pd.DataFrame(data)

# Create churn based on logical rules
churn_prob = (
    0.3 +  # Base probability
    (df_churn['Tenure'] < 12) * 0.2 +  # New customers more likely
    (df_churn['MonthlyCharges'] > 80) * 0.1 +  # High charges
    (df_churn['Contract'] == 'Month') * 0.2 +  # Month-to-month risky
    (df_churn['SupportCalls'] > 3) * 0.15 +  # Many complaints
    (df_churn['SatisfactionScore'] < 2.5) * 0.2  # Low satisfaction
)
df_churn['Churn'] = (np.random.random(n_customers) < churn_prob).astype(int)

print("\n๐Ÿ“Š Dataset Overview:")
print(df_churn.head())
print(f"\nChurn Rate: {df_churn['Churn'].mean():.2%}")

# Encode categorical variables
from sklearn.preprocessing import LabelEncoder
le_contract = LabelEncoder()
le_payment = LabelEncoder()

df_encoded = df_churn.copy()
df_encoded['Contract_Encoded'] = le_contract.fit_transform(df_churn['Contract'])
df_encoded['PaymentMethod_Encoded'] = le_payment.fit_transform(df_churn['PaymentMethod'])

# Prepare features
features = ['Tenure', 'MonthlyCharges', 'TotalCharges', 'NumServices', 
           'Contract_Encoded', 'PaymentMethod_Encoded', 'SupportCalls', 
           'SatisfactionScore']
X = df_encoded[features]
y = df_encoded['Churn']

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    y_proba = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else y_pred
    
    results[name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'AUC': roc_auc_score(y_test, y_proba)
    }

# Display results
results_df = pd.DataFrame(results).T.round(3)
print("\n๐Ÿ“Š Model Comparison:")
print(results_df)

# Best model analysis
best_model_name = results_df['F1'].idxmax()
best_model = models[best_model_name]
print(f"\n๐Ÿ† Best Model (by F1 Score): {best_model_name}")

# Comprehensive visualization
fig, axes = plt.subplots(3, 3, figsize=(20, 18))
fig.suptitle('Customer Churn Prediction Analysis', fontsize=18, fontweight='bold')

# 1. Model Performance Comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1']
x_pos = np.arange(len(models))
width = 0.2

for i, metric in enumerate(metrics):
    values = [results[model][metric] for model in models.keys()]
    axes[0, 0].bar(x_pos + i*width, values, width, label=metric)

axes[0, 0].set_xlabel('Models')
axes[0, 0].set_ylabel('Score')
axes[0, 0].set_title('Model Performance Metrics')
axes[0, 0].set_xticks(x_pos + width * 1.5)
axes[0, 0].set_xticklabels(models.keys(), rotation=45)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. ROC Curves Comparison
for name, model in models.items():
    if hasattr(model, 'predict_proba'):
        y_proba = model.predict_proba(X_test_scaled)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_proba)
        auc_score = auc(fpr, tpr)
        axes[0, 1].plot(fpr, tpr, label=f'{name} (AUC={auc_score:.2f})')

axes[0, 1].plot([0, 1], [0, 1], 'k--')
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curves Comparison')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Feature Importance (Random Forest)
rf_model = models['Random Forest']
importances = rf_model.feature_importances_
indices = np.argsort(importances)[::-1]

axes[0, 2].barh(range(len(importances)), importances[indices], color='teal')
axes[0, 2].set_yticks(range(len(importances)))
axes[0, 2].set_yticklabels([features[i] for i in indices])
axes[0, 2].set_xlabel('Importance')
axes[0, 2].set_title('Feature Importance (Random Forest)')
axes[0, 2].grid(True, alpha=0.3)

# 4. Churn Distribution by Tenure
tenure_bins = pd.cut(df_churn['Tenure'], bins=[0, 12, 24, 48, 72], 
                     labels=['0-1yr', '1-2yr', '2-4yr', '4-6yr'])
churn_by_tenure = df_churn.groupby(tenure_bins)['Churn'].agg(['mean', 'count'])

axes[1, 0].bar(range(len(churn_by_tenure)), churn_by_tenure['mean'], 
               color='coral', alpha=0.7)
axes[1, 0].set_xticks(range(len(churn_by_tenure)))
axes[1, 0].set_xticklabels(churn_by_tenure.index)
axes[1, 0].set_ylabel('Churn Rate')
axes[1, 0].set_title('Churn Rate by Tenure')
axes[1, 0].grid(True, alpha=0.3)

# 5. Confusion Matrix (Best Model)
y_pred_best = best_model.predict(X_test_scaled)
cm_best = confusion_matrix(y_test, y_pred_best)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_best, 
                              display_labels=['Stay', 'Churn'])
disp.plot(ax=axes[1, 1], cmap='Blues')
axes[1, 1].set_title(f'Confusion Matrix ({best_model_name})')

# 6. Churn by Contract Type
contract_churn = df_churn.groupby('Contract')['Churn'].agg(['mean', 'count'])
axes[1, 2].bar(contract_churn.index, contract_churn['mean'], color='lightgreen')
axes[1, 2].set_ylabel('Churn Rate')
axes[1, 2].set_title('Churn Rate by Contract Type')
axes[1, 2].grid(True, alpha=0.3)

# 7. Cost-Benefit Analysis
# Assuming: Cost of retention = $50, Lost revenue from churn = $500
retention_cost = 50
churn_loss = 500

y_proba_best = best_model.predict_proba(X_test_scaled)[:, 1]
thresholds = np.linspace(0.1, 0.9, 20)
profits = []

for thresh in thresholds:
    y_pred_thresh = (y_proba_best > thresh).astype(int)
    tp = ((y_pred_thresh == 1) & (y_test == 1)).sum()  # Correctly predicted churn
    fp = ((y_pred_thresh == 1) & (y_test == 0)).sum()  # False churn prediction
    fn = ((y_pred_thresh == 0) & (y_test == 1)).sum()  # Missed churn
    
    profit = tp * (churn_loss - retention_cost) - fp * retention_cost - fn * churn_loss
    profits.append(profit)

axes[2, 0].plot(thresholds, profits, 'g-o')
axes[2, 0].axhline(0, color='red', linestyle='--')
axes[2, 0].set_xlabel('Probability Threshold')
axes[2, 0].set_ylabel('Expected Profit ($)')
axes[2, 0].set_title('Profit vs Decision Threshold')
axes[2, 0].grid(True, alpha=0.3)

optimal_threshold = thresholds[np.argmax(profits)]
axes[2, 0].axvline(optimal_threshold, color='blue', linestyle='--', 
                   label=f'Optimal: {optimal_threshold:.2f}')
axes[2, 0].legend()

# 8. Satisfaction Score vs Churn
axes[2, 1].boxplot([df_churn[df_churn['Churn']==0]['SatisfactionScore'],
                    df_churn[df_churn['Churn']==1]['SatisfactionScore']],
                   labels=['Stay', 'Churn'])
axes[2, 1].set_ylabel('Satisfaction Score')
axes[2, 1].set_title('Satisfaction Score by Churn Status')
axes[2, 1].grid(True, alpha=0.3)

# 9. Calibration Plot
from sklearn.calibration import calibration_curve
fraction_pos, mean_pred = calibration_curve(y_test, y_proba_best, n_bins=10)
axes[2, 2].plot(mean_pred, fraction_pos, 's-', label=best_model_name)
axes[2, 2].plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
axes[2, 2].set_xlabel('Mean Predicted Probability')
axes[2, 2].set_ylabel('Fraction of Positives')
axes[2, 2].set_title('Calibration Plot')
axes[2, 2].legend()
axes[2, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Business Insights
print("\n๐Ÿ’ก Business Insights:")
print(f"1. Optimal threshold for intervention: {optimal_threshold:.2f}")
print(f"2. Expected profit with optimal strategy: ${max(profits):,.0f}")
print("3. Key churn indicators: Low tenure, month-to-month contract")
print("4. High-risk segment: New customers with high support calls")
print("5. Retention focus: Customers with <12 months tenure")

๐Ÿงช Contoh Soal & Latihan

Latihan 1: Binary Classification

  1. Load dataset credit card fraud detection
  2. Handle class imbalance dengan SMOTE
  3. Train Logistic Regression, KNN, Decision Tree
  4. Optimize hyperparameters dengan GridSearchCV
  5. Evaluate dengan precision-recall curve
  6. Calculate cost-sensitive metrics

Latihan 2: Multiclass Classification

  1. Use dataset wine quality (3+ classes)
  2. Implement one-vs-rest strategy
  3. Compare dengan one-vs-one strategy
  4. Create confusion matrix heatmap
  5. Calculate macro vs micro averaged metrics
  6. Implement voting classifier ensemble

Latihan 3: Model Selection & Tuning

  1. Implement cross-validation properly
  2. Use RandomizedSearchCV for hyperparameter tuning
  3. Compare model performance dengan statistical tests
  4. Analyze learning curves untuk detect overfitting
  5. Implement early stopping untuk iterative algorithms
  6. Create automated model selection pipeline

๐Ÿ’ก Best Practices untuk Klasifikasi:

  • Always check class balance: Imbalanced data needs special handling
  • Use appropriate metrics: Accuracy bukan segalanya, lihat precision/recall
  • Stratified splitting: Maintain class distribution in train/test
  • Feature scaling: Important for distance-based algorithms
  • Cross-validation: Untuk robust performance estimate
  • Threshold optimization: Default 0.5 tidak selalu optimal
  • Domain knowledge: Understand cost of false positives vs false negatives

โœ๏ธ Kesimpulan

Pada pertemuan ini, mahasiswa telah mempelajari:

  • โœ… Konsep dasar klasifikasi supervised learning
  • โœ… Implementasi KNN dengan optimization
  • โœ… Decision Tree dan interpretability
  • โœ… Logistic Regression untuk binary classification
  • โœ… Evaluation metrics dan confusion matrix
  • โœ… ROC curve dan AUC analysis
  • โœ… Real-world case study dengan business context

Pertemuan selanjutnya: Clustering (Unsupervised Learning)