๐Ÿ“˜ Pertemuan 12 - Clustering (Unsupervised Learning)

๐ŸŽฏ Tujuan Pembelajaran

  1. Memahami konsep unsupervised learning dan clustering
  2. Menguasai algoritma K-Means clustering
  3. Mengimplementasikan Hierarchical clustering
  4. Memahami DBSCAN untuk density-based clustering
  5. Mengevaluasi clustering dengan silhouette score dan elbow method
  6. Aplikasi clustering untuk customer segmentation

๐Ÿงฉ Apa itu Clustering?

Clustering adalah teknik unsupervised learning untuk mengelompokkan data berdasarkan kemiripan tanpa label yang sudah ditentukan.

  • ๐ŸŽฏ Tujuan: Menemukan struktur tersembunyi dalam data
  • ๐Ÿ“Š No Labels: Tidak memerlukan target variable
  • ๐Ÿ” Exploratory: Untuk memahami pola dalam data

Aplikasi: Customer segmentation, image segmentation, anomaly detection, document clustering

๐ŸŽจ 1. K-Means Clustering

K-Means Algorithm

Steps:

  1. Initialize K centroids randomly
  2. Assign each point to nearest centroid
  3. Update centroids as mean of assigned points
  4. Repeat until convergence

Pros: Fast, scalable, simple to implement

Cons: Need to specify K, sensitive to initialization, assumes spherical clusters

Objective Function (Within-Cluster Sum of Squares):
WCSS = ฮฃแตข ฮฃโ‚“โˆˆCแตข ||x - ฮผแตข||ยฒ
where ฮผแตข is the centroid of cluster Cแตข
# K-Means Clustering Implementation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.datasets import make_blobs, make_moons
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("="*60)
print("K-MEANS CLUSTERING ANALYSIS")
print("="*60)

# Generate synthetic data for demonstration
np.random.seed(42)
X_blobs, y_true = make_blobs(n_samples=300, centers=4, n_features=2, 
                             cluster_std=0.5, random_state=42)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_blobs)

print(f"Dataset shape: {X_scaled.shape}")
print(f"True number of clusters: {len(np.unique(y_true))}")

# Elbow Method - Find optimal K
wcss = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

# Visualization
fig, axes = plt.subplots(3, 3, figsize=(20, 18))
fig.suptitle('K-Means Clustering Analysis', fontsize=18, fontweight='bold')

# 1. Elbow Method
axes[0, 0].plot(K_range, wcss, 'bo-')
axes[0, 0].set_xlabel('Number of Clusters (K)')
axes[0, 0].set_ylabel('Within-Cluster Sum of Squares')
axes[0, 0].set_title('Elbow Method')
axes[0, 0].grid(True, alpha=0.3)

# Mark elbow point
elbow_point = 4  # Visual inspection
axes[0, 0].axvline(x=elbow_point, color='red', linestyle='--', label=f'Elbow at K={elbow_point}')
axes[0, 0].legend()

# 2. Silhouette Score
axes[0, 1].plot(K_range, silhouette_scores, 'go-')
axes[0, 1].set_xlabel('Number of Clusters (K)')
axes[0, 1].set_ylabel('Silhouette Score')
axes[0, 1].set_title('Silhouette Score vs K')
axes[0, 1].grid(True, alpha=0.3)

best_k = K_range[np.argmax(silhouette_scores)]
axes[0, 1].axvline(x=best_k, color='red', linestyle='--', label=f'Best K={best_k}')
axes[0, 1].legend()

# 3. Original Data
axes[0, 2].scatter(X_scaled[:, 0], X_scaled[:, 1], c='gray', alpha=0.5)
axes[0, 2].set_xlabel('Feature 1')
axes[0, 2].set_ylabel('Feature 2')
axes[0, 2].set_title('Original Data (No Clustering)')
axes[0, 2].grid(True, alpha=0.3)

# Train K-Means with optimal K
optimal_k = 4
kmeans_optimal = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
clusters = kmeans_optimal.fit_predict(X_scaled)

print(f"\n๐Ÿ“Š K-Means with K={optimal_k}:")
print(f"Silhouette Score: {silhouette_score(X_scaled, clusters):.3f}")
print(f"Davies-Bouldin Score: {davies_bouldin_score(X_scaled, clusters):.3f}")
print(f"Calinski-Harabasz Score: {calinski_harabasz_score(X_scaled, clusters):.1f}")

# 4. K-Means Clustering Result
scatter = axes[1, 0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, 
                             cmap='viridis', alpha=0.6, edgecolors='black')
axes[1, 0].scatter(kmeans_optimal.cluster_centers_[:, 0], 
                  kmeans_optimal.cluster_centers_[:, 1],
                  c='red', marker='*', s=300, edgecolors='black', linewidths=2,
                  label='Centroids')
axes[1, 0].set_xlabel('Feature 1')
axes[1, 0].set_ylabel('Feature 2')
axes[1, 0].set_title(f'K-Means Clustering (K={optimal_k})')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 5. Cluster Sizes
cluster_counts = pd.Series(clusters).value_counts().sort_index()
axes[1, 1].bar(cluster_counts.index, cluster_counts.values, color='skyblue')
axes[1, 1].set_xlabel('Cluster ID')
axes[1, 1].set_ylabel('Number of Points')
axes[1, 1].set_title('Cluster Size Distribution')
axes[1, 1].grid(True, alpha=0.3)

# Add percentages
for i, v in enumerate(cluster_counts.values):
    axes[1, 1].text(i, v + 1, f'{v/len(clusters)*100:.1f}%', ha='center')

# 6. Iterations Visualization
# Show K-means progress
kmeans_steps = KMeans(n_clusters=optimal_k, max_iter=1, n_init=1, random_state=42)
colors_map = plt.cm.get_cmap('viridis', optimal_k)

for step in range(5):
    kmeans_steps.fit(X_scaled)
    if step == 0:
        axes[1, 2].scatter(X_scaled[:, 0], X_scaled[:, 1], c='gray', alpha=0.3)
    
    # Plot centroids movement
    axes[1, 2].scatter(kmeans_steps.cluster_centers_[:, 0],
                      kmeans_steps.cluster_centers_[:, 1],
                      c='red', marker='*', s=100, alpha=(step+1)/5,
                      edgecolors='black', linewidths=1)

axes[1, 2].set_xlabel('Feature 1')
axes[1, 2].set_ylabel('Feature 2')
axes[1, 2].set_title('K-Means Iterations (Centroid Movement)')
axes[1, 2].grid(True, alpha=0.3)

# 7. Compare different K values
for idx, k in enumerate([2, 3, 5]):
    ax = axes[2, idx]
    kmeans_k = KMeans(n_clusters=k, random_state=42)
    labels_k = kmeans_k.fit_predict(X_scaled)
    
    ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels_k, cmap='viridis', alpha=0.6)
    ax.scatter(kmeans_k.cluster_centers_[:, 0], kmeans_k.cluster_centers_[:, 1],
              c='red', marker='*', s=200, edgecolors='black', linewidths=2)
    
    silhouette = silhouette_score(X_scaled, labels_k)
    ax.set_title(f'K={k} (Silhouette: {silhouette:.3f})')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

๐ŸŒฒ 2. Hierarchical Clustering

Hierarchical Clustering Types

Agglomerative (Bottom-up):

  • Start with each point as separate cluster
  • Merge closest clusters iteratively
  • Stop when desired number of clusters reached

Linkage Criteria: Single, Complete, Average, Ward

Pros: No need to specify K, dendrogram visualization

Cons: Computationally expensive O(nยฒ), sensitive to outliers

# Hierarchical Clustering
print("\n" + "="*60)
print("HIERARCHICAL CLUSTERING ANALYSIS")
print("="*60)

from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering

# Create sample data for better dendrogram visualization
np.random.seed(42)
n_samples = 50
X_hier = np.random.randn(n_samples, 2)
X_hier[:15] += [2, 2]  # Cluster 1
X_hier[15:30] += [-2, 2]  # Cluster 2
X_hier[30:40] += [2, -2]  # Cluster 3
X_hier[40:] += [-2, -2]  # Cluster 4

# Standardize
X_hier_scaled = scaler.fit_transform(X_hier)

# Perform hierarchical clustering
linkage_matrix = linkage(X_hier_scaled, method='ward')

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Hierarchical Clustering Analysis', fontsize=16, fontweight='bold')

# 1. Dendrogram
dendrogram(linkage_matrix, ax=axes[0, 0], truncate_mode='level', p=5)
axes[0, 0].set_title('Dendrogram (Ward Linkage)')
axes[0, 0].set_xlabel('Sample Index')
axes[0, 0].set_ylabel('Distance')

# 2. Different Linkage Methods
linkage_methods = ['single', 'complete', 'average', 'ward']
silhouette_hier = []

for i, method in enumerate(linkage_methods):
    agg_clustering = AgglomerativeClustering(n_clusters=4, linkage=method)
    labels = agg_clustering.fit_predict(X_hier_scaled)
    silhouette = silhouette_score(X_hier_scaled, labels)
    silhouette_hier.append(silhouette)

axes[0, 1].bar(linkage_methods, silhouette_hier, color='coral')
axes[0, 1].set_ylabel('Silhouette Score')
axes[0, 1].set_title('Comparison of Linkage Methods')
axes[0, 1].grid(True, alpha=0.3)

# 3. Best linkage result
best_linkage = linkage_methods[np.argmax(silhouette_hier)]
agg_best = AgglomerativeClustering(n_clusters=4, linkage=best_linkage)
labels_hier = agg_best.fit_predict(X_hier_scaled)

axes[0, 2].scatter(X_hier_scaled[:, 0], X_hier_scaled[:, 1], 
                  c=labels_hier, cmap='viridis', alpha=0.6, edgecolors='black')
axes[0, 2].set_title(f'Hierarchical Clustering ({best_linkage} linkage)')
axes[0, 2].set_xlabel('Feature 1')
axes[0, 2].set_ylabel('Feature 2')
axes[0, 2].grid(True, alpha=0.3)

print(f"Best linkage method: {best_linkage}")
print(f"Silhouette Score: {max(silhouette_hier):.3f}")

# 4-6. Compare different number of clusters
for idx, n_clust in enumerate([2, 3, 5]):
    ax = axes[1, idx]
    agg_n = AgglomerativeClustering(n_clusters=n_clust, linkage='ward')
    labels_n = agg_n.fit_predict(X_hier_scaled)
    
    ax.scatter(X_hier_scaled[:, 0], X_hier_scaled[:, 1], 
              c=labels_n, cmap='viridis', alpha=0.6)
    
    if n_clust > 1:
        silhouette = silhouette_score(X_hier_scaled, labels_n)
        ax.set_title(f'{n_clust} Clusters (Silhouette: {silhouette:.3f})')
    else:
        ax.set_title(f'{n_clust} Cluster')
    
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

๐ŸŒ€ 3. DBSCAN (Density-Based Clustering)

DBSCAN Algorithm

Parameters:

  • eps (ฮต): Maximum distance between two points to be neighbors
  • min_samples: Minimum points to form a dense region

Point Types: Core points, Border points, Noise points

Pros: No need to specify K, finds arbitrary shapes, handles noise

Cons: Sensitive to parameters, struggles with varying densities

# DBSCAN Clustering
print("\n" + "="*60)
print("DBSCAN CLUSTERING ANALYSIS")
print("="*60)

# Generate non-spherical data
X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)

# Also create data with noise
X_noise = np.vstack([X_moons, np.random.uniform(-2, 2, (20, 2))])

# Standardize
X_dbscan_scaled = scaler.fit_transform(X_noise)

# Parameter tuning
eps_values = np.linspace(0.1, 1.0, 10)
min_samples_values = [3, 5, 10]

# Visualization
fig, axes = plt.subplots(3, 3, figsize=(18, 18))
fig.suptitle('DBSCAN Clustering Analysis', fontsize=16, fontweight='bold')

# 1. Original data
axes[0, 0].scatter(X_dbscan_scaled[:, 0], X_dbscan_scaled[:, 1], 
                  c='gray', alpha=0.5)
axes[0, 0].set_title('Original Data with Noise')
axes[0, 0].set_xlabel('Feature 1')
axes[0, 0].set_ylabel('Feature 2')
axes[0, 0].grid(True, alpha=0.3)

# 2. K-means on non-spherical data (for comparison)
kmeans_moons = KMeans(n_clusters=2, random_state=42)
labels_kmeans_moons = kmeans_moons.fit_predict(X_dbscan_scaled)

axes[0, 1].scatter(X_dbscan_scaled[:, 0], X_dbscan_scaled[:, 1], 
                  c=labels_kmeans_moons, cmap='viridis', alpha=0.6)
axes[0, 1].set_title('K-Means on Non-Spherical Data')
axes[0, 1].set_xlabel('Feature 1')
axes[0, 1].set_ylabel('Feature 2')
axes[0, 1].grid(True, alpha=0.3)

# 3. DBSCAN with optimal parameters
dbscan_optimal = DBSCAN(eps=0.3, min_samples=5)
labels_dbscan = dbscan_optimal.fit_predict(X_dbscan_scaled)

# Plot with noise points highlighted
unique_labels = set(labels_dbscan)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))

for k, col in zip(unique_labels, colors):
    if k == -1:
        # Noise points in black
        col = 'black'
        marker = 'x'
    else:
        marker = 'o'
    
    class_member_mask = (labels_dbscan == k)
    xy = X_dbscan_scaled[class_member_mask]
    axes[0, 2].scatter(xy[:, 0], xy[:, 1], c=[col], marker=marker, 
                      alpha=0.6, label=f'Cluster {k}' if k != -1 else 'Noise')

axes[0, 2].set_title('DBSCAN Clustering (Optimal)')
axes[0, 2].set_xlabel('Feature 1')
axes[0, 2].set_ylabel('Feature 2')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

n_clusters = len(set(labels_dbscan)) - (1 if -1 in labels_dbscan else 0)
n_noise = list(labels_dbscan).count(-1)
print(f"\nDBSCAN Results (eps=0.3, min_samples=5):")
print(f"Number of clusters: {n_clusters}")
print(f"Number of noise points: {n_noise}")

# 4-9. Parameter sensitivity analysis
param_combinations = [
    (0.1, 5), (0.3, 5), (0.5, 5),
    (0.3, 3), (0.3, 5), (0.3, 10)
]

for idx, (eps, min_samp) in enumerate(param_combinations):
    ax = axes[1 + idx//3, idx%3]
    
    dbscan = DBSCAN(eps=eps, min_samples=min_samp)
    labels = dbscan.fit_predict(X_dbscan_scaled)
    
    # Count clusters and noise
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = list(labels).count(-1)
    
    # Plot
    unique_labels = set(labels)
    colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
    
    for k, col in zip(unique_labels, colors):
        if k == -1:
            col = 'black'
        class_member_mask = (labels == k)
        xy = X_dbscan_scaled[class_member_mask]
        ax.scatter(xy[:, 0], xy[:, 1], c=[col], alpha=0.6, s=30)
    
    ax.set_title(f'eps={eps}, min_samples={min_samp}\n'
                f'Clusters: {n_clusters}, Noise: {n_noise}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

๐ŸŽฏ 4. Studi Kasus: Customer Segmentation

๐Ÿ“‹ RFM Analysis & Customer Segmentation

Segmentasi pelanggan berdasarkan perilaku pembelian menggunakan RFM (Recency, Frequency, Monetary) analysis dan clustering algorithms.

# CUSTOMER SEGMENTATION CASE STUDY
print("\n" + "="*60)
print("CASE STUDY: CUSTOMER SEGMENTATION")
print("="*60)

# Generate synthetic customer data
np.random.seed(42)
n_customers = 500

# Create realistic customer segments
# Segment 1: High-value frequent buyers (Champions)
seg1 = pd.DataFrame({
    'Recency': np.random.uniform(1, 30, 100),  # Recent purchases
    'Frequency': np.random.uniform(10, 20, 100),  # Many purchases
    'Monetary': np.random.uniform(1000, 3000, 100)  # High spending
})

# Segment 2: New customers
seg2 = pd.DataFrame({
    'Recency': np.random.uniform(1, 15, 100),  # Very recent
    'Frequency': np.random.uniform(1, 3, 100),  # Few purchases
    'Monetary': np.random.uniform(50, 300, 100)  # Low spending
})

# Segment 3: Lost customers
seg3 = pd.DataFrame({
    'Recency': np.random.uniform(180, 365, 100),  # Long time ago
    'Frequency': np.random.uniform(1, 5, 100),  # Few purchases
    'Monetary': np.random.uniform(100, 500, 100)  # Low-medium spending
})

# Segment 4: Regular customers
seg4 = pd.DataFrame({
    'Recency': np.random.uniform(30, 90, 100),  # Moderate recency
    'Frequency': np.random.uniform(5, 10, 100),  # Moderate frequency
    'Monetary': np.random.uniform(300, 800, 100)  # Moderate spending
})

# Segment 5: Big spenders (but infrequent)
seg5 = pd.DataFrame({
    'Recency': np.random.uniform(60, 120, 100),
    'Frequency': np.random.uniform(2, 5, 100),  # Low frequency
    'Monetary': np.random.uniform(1500, 4000, 100)  # Very high spending
})

# Combine all segments
df_customers = pd.concat([seg1, seg2, seg3, seg4, seg5], ignore_index=True)
df_customers['CustomerID'] = range(1, len(df_customers) + 1)

print("\n๐Ÿ“Š Customer Data Overview:")
print(df_customers.describe())

# Feature engineering
df_customers['AvgOrderValue'] = df_customers['Monetary'] / df_customers['Frequency']
df_customers['DaysSinceLastPurchase'] = df_customers['Recency']

# Prepare for clustering
features = ['Recency', 'Frequency', 'Monetary']
X_customers = df_customers[features]

# Scale features (important for clustering)
scaler_rfm = StandardScaler()
X_customers_scaled = scaler_rfm.fit_transform(X_customers)

# Find optimal number of clusters using multiple methods
print("\n๐Ÿ” Finding Optimal Number of Clusters:")

# Method 1: Elbow Method
wcss_rfm = []
k_range_rfm = range(2, 11)

for k in k_range_rfm:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_customers_scaled)
    wcss_rfm.append(kmeans.inertia_)

# Method 2: Silhouette Score
silhouette_rfm = []
for k in k_range_rfm:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_customers_scaled)
    silhouette_rfm.append(silhouette_score(X_customers_scaled, labels))

optimal_k_rfm = k_range_rfm[np.argmax(silhouette_rfm)]
print(f"Optimal K by Silhouette: {optimal_k_rfm}")

# Apply K-Means with optimal K
kmeans_final = KMeans(n_clusters=5, random_state=42, n_init=10)
df_customers['Cluster'] = kmeans_final.fit_predict(X_customers_scaled)

# Analyze clusters
cluster_summary = df_customers.groupby('Cluster')[features].mean()
cluster_sizes = df_customers['Cluster'].value_counts().sort_index()

print("\n๐Ÿ“Š Cluster Characteristics (Mean Values):")
print(cluster_summary)
print("\n๐Ÿ“Š Cluster Sizes:")
print(cluster_sizes)

# Comprehensive Visualization
fig, axes = plt.subplots(3, 4, figsize=(24, 18))
fig.suptitle('Customer Segmentation Analysis', fontsize=20, fontweight='bold')

# 1. Elbow Method
axes[0, 0].plot(k_range_rfm, wcss_rfm, 'bo-')
axes[0, 0].set_xlabel('Number of Clusters')
axes[0, 0].set_ylabel('WCSS')
axes[0, 0].set_title('Elbow Method')
axes[0, 0].grid(True, alpha=0.3)

# 2. Silhouette Analysis
axes[0, 1].plot(k_range_rfm, silhouette_rfm, 'go-')
axes[0, 1].set_xlabel('Number of Clusters')
axes[0, 1].set_ylabel('Silhouette Score')
axes[0, 1].set_title('Silhouette Analysis')
axes[0, 1].axvline(optimal_k_rfm, color='red', linestyle='--')
axes[0, 1].grid(True, alpha=0.3)

# 3. 3D Scatter plot of clusters
from mpl_toolkits.mplot3d import Axes3D
ax3d = fig.add_subplot(3, 4, 3, projection='3d')
scatter = ax3d.scatter(df_customers['Recency'], 
                      df_customers['Frequency'], 
                      df_customers['Monetary'],
                      c=df_customers['Cluster'], 
                      cmap='viridis', alpha=0.6)
ax3d.set_xlabel('Recency')
ax3d.set_ylabel('Frequency')
ax3d.set_zlabel('Monetary')
ax3d.set_title('3D Customer Segments')

# 4. Cluster sizes
axes[0, 3].bar(cluster_sizes.index, cluster_sizes.values, 
              color=plt.cm.viridis(np.linspace(0, 1, len(cluster_sizes))))
axes[0, 3].set_xlabel('Cluster')
axes[0, 3].set_ylabel('Number of Customers')
axes[0, 3].set_title('Cluster Size Distribution')
axes[0, 3].grid(True, alpha=0.3)

# 5-7. RFM distributions by cluster
for idx, feature in enumerate(features):
    ax = axes[1, idx]
    for cluster in range(5):
        cluster_data = df_customers[df_customers['Cluster'] == cluster][feature]
        ax.hist(cluster_data, alpha=0.5, label=f'Cluster {cluster}', bins=20)
    ax.set_xlabel(feature)
    ax.set_ylabel('Frequency')
    ax.set_title(f'{feature} Distribution by Cluster')
    ax.legend()
    ax.grid(True, alpha=0.3)

# 8. Cluster characteristics heatmap
axes[1, 3].axis('off')
cluster_norm = cluster_summary.div(cluster_summary.max())
sns.heatmap(cluster_norm.T, annot=True, fmt='.2f', cmap='YlOrRd', 
           ax=axes[2, 0], cbar_kws={'label': 'Normalized Value'})
axes[2, 0].set_title('Cluster Characteristics Heatmap')
axes[2, 0].set_xlabel('Cluster')

# 9. PCA visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_customers_scaled)

axes[2, 1].scatter(X_pca[:, 0], X_pca[:, 1], c=df_customers['Cluster'], 
                  cmap='viridis', alpha=0.6)
axes[2, 1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
axes[2, 1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
axes[2, 1].set_title('PCA Visualization of Clusters')
axes[2, 1].grid(True, alpha=0.3)

# 10. Customer Value Matrix
axes[2, 2].scatter(df_customers['Frequency'], df_customers['Monetary'], 
                  c=df_customers['Cluster'], cmap='viridis', alpha=0.6)
axes[2, 2].set_xlabel('Frequency')
axes[2, 2].set_ylabel('Monetary Value')
axes[2, 2].set_title('Customer Value Matrix')
axes[2, 2].grid(True, alpha=0.3)

# Add quadrant lines
axes[2, 2].axhline(df_customers['Monetary'].median(), color='red', 
                   linestyle='--', alpha=0.5)
axes[2, 2].axvline(df_customers['Frequency'].median(), color='red', 
                   linestyle='--', alpha=0.5)

# 11. Business Interpretation
segment_names = {
    0: 'Champions',
    1: 'New Customers',
    2: 'At Risk',
    3: 'Regular',
    4: 'Big Spenders'
}

# Assign names based on characteristics
cluster_interpretation = []
for i in range(5):
    cluster_data = cluster_summary.loc[i]
    if cluster_data['Recency'] < 30 and cluster_data['Frequency'] > 10:
        name = 'Champions'
    elif cluster_data['Recency'] < 30 and cluster_data['Frequency'] < 5:
        name = 'New Customers'
    elif cluster_data['Recency'] > 150:
        name = 'Lost Customers'
    elif cluster_data['Monetary'] > 1500:
        name = 'Big Spenders'
    else:
        name = 'Regular Customers'
    cluster_interpretation.append(name)

# Display interpretation
axes[2, 3].axis('off')
interpretation_text = "SEGMENT INTERPRETATION\n" + "="*30 + "\n\n"
for i, name in enumerate(cluster_interpretation):
    interpretation_text += f"Cluster {i}: {name}\n"
    interpretation_text += f"  Size: {cluster_sizes[i]} customers\n"
    interpretation_text += f"  Recency: {cluster_summary.loc[i, 'Recency']:.0f} days\n"
    interpretation_text += f"  Frequency: {cluster_summary.loc[i, 'Frequency']:.0f} orders\n"
    interpretation_text += f"  Monetary: ${cluster_summary.loc[i, 'Monetary']:.0f}\n\n"

axes[2, 3].text(0.1, 0.9, interpretation_text, transform=axes[2, 3].transAxes,
               fontsize=10, verticalalignment='top', fontfamily='monospace')

plt.tight_layout()
plt.show()

# Business Recommendations
print("\n๐Ÿ’ก Business Recommendations by Segment:")
print("="*50)
recommendations = {
    'Champions': 'Reward loyalty, VIP treatment, early access to new products',
    'New Customers': 'Welcome series, onboarding support, first-purchase discounts',
    'Lost Customers': 'Win-back campaigns, special offers, feedback surveys',
    'Regular Customers': 'Upsell/cross-sell, loyalty programs, personalized offers',
    'Big Spenders': 'Premium services, exclusive events, personalized account management'
}

for segment, recommendation in recommendations.items():
    print(f"\n{segment}:")
    print(f"  Strategy: {recommendation}")

๐Ÿงช Contoh Soal & Latihan

Latihan 1: K-Means Optimization

  1. Generate dataset dengan make_blobs (varied cluster_std)
  2. Implement elbow method untuk find optimal K
  3. Calculate silhouette score untuk each K
  4. Visualize gap statistic
  5. Compare K-means++ vs random initialization
  6. Analyze effect of outliers on clustering

Latihan 2: Hierarchical Clustering

  1. Create dendrogram dengan different linkage methods
  2. Cut dendrogram at different heights
  3. Compare agglomerative vs divisive clustering
  4. Implement cophenetic correlation coefficient
  5. Visualize cluster hierarchy as circular dendrogram

Latihan 3: Advanced Clustering

  1. Implement Gaussian Mixture Models (GMM)
  2. Compare hard vs soft clustering
  3. Use OPTICS for varying density clusters
  4. Implement Mean Shift clustering
  5. Apply clustering to image segmentation
  6. Combine clustering dengan dimensionality reduction (t-SNE)

๐Ÿ’ก Best Practices untuk Clustering:

  • Scale your features: Clustering sensitive to scale differences
  • Choose right algorithm: Consider data shape and density
  • Validate results: Use multiple metrics (silhouette, Davies-Bouldin)
  • Domain knowledge: Interpret clusters in business context
  • Stability check: Run multiple times with different seeds
  • Handle outliers: Consider DBSCAN atau preprocessing
  • Visualization: Use PCA/t-SNE for high-dimensional data
Dynamic Minimum Spanning Tree; Swarm Intelligence; Particle Swarm Optimization; Ant Colony Optimization; Artificial Bee Colony; Comparative Performance Evaluation of MQTT, WebSocket, and Firebase Communication Protocols on ESP32 Platforms The proliferation of Internet of Things deployments on resource-constrained microcontrollers necessitates careful communication protocol selection to optimize latency, reliability, and operational efficiency. However, systematic empirical comparisons of mainstream protocols under identical testing conditions remain conspicuously absent, particularly regarding database integration impacts on end-to-end performance. This research conducts a comprehensive comparative evaluation of Message Queuing Telemetry Transport (MQTT), WebSocket, and Firebase Realtime Database protocols on ESP32 platforms through rigorous experimental methodology. The study implements 24 distinct experimental scenarios encompassing variations in payload size (64B, 256B), transmission frequency (1Hz, 5Hz), and database integration modes (enabled, disabled), generating 12,000 empirical measurements to enable robust statistical analysis. Experimental results demonstrate that WebSocket achieves the lowest average round-trip latency at 34.73 ms with 99.27% success rate, MQTT exhibits competitive performance at 37.99 ms average latency with 98.42% reliability, while Firebase shows substantially higher latency of 104.79 ms balanced by remarkably low database integration overhead of only 10.86% compared to 60-70% penalties incurred by MQTT and WebSocket when interfacing with external MySQL storage. Statistical validation through ANOVA confirms significant inter-protocol performance differences (F=3,247.12, p<0.001), while payload and frequency scalability analyses reveal favorable sub-linear scaling characteristics across all protocols. These findings establish evidence-based protocol selection frameworks for ESP32-based IoT deployments, demonstrating that optimal protocol choice depends critically on application-specific requirements including latency sensitivity, database integration needs, and development velocity constraints. Keywords: Internet of Things; ESP32; MQTT; WebSocket; Firebase; Communication Protocol; Performance Evaluation; Database Integration.

โœ๏ธ Kesimpulan

Pada pertemuan ini, mahasiswa telah mempelajari:

  • โœ… Konsep unsupervised learning dan clustering
  • โœ… K-Means clustering dengan elbow method
  • โœ… Hierarchical clustering dan dendrogram
  • โœ… DBSCAN untuk non-spherical clusters
  • โœ… Evaluation metrics untuk clustering
  • โœ… Customer segmentation case study
  • โœ… Business interpretation of clusters

Pertemuan selanjutnya: Analisis Teks Sederhana