📘 Pertemuan 10 - Regresi Linear Sederhana

🎯 Tujuan Pembelajaran

  1. Memahami konsep dan teori regresi linear
  2. Mengimplementasikan regresi linear dengan Python
  3. Mengevaluasi model dengan berbagai metrik
  4. Memvalidasi asumsi-asumsi regresi linear
  5. Melakukan prediksi dan interpretasi hasil
  6. Menangani overfitting dan underfitting

🧩 Apa itu Regresi Linear?

Regresi linear adalah metode statistik untuk memodelkan hubungan linear antara variabel dependen (Y) dan satu atau lebih variabel independen (X).

  • 📈 Simple Linear Regression: Satu variabel independen (Y = β₀ + β₁X)
  • 📊 Multiple Linear Regression: Beberapa variabel independen (Y = β₀ + β₁X₁ + β₂X₂ + ...)
  • 🎯 Tujuan: Prediksi nilai kontinu dan memahami hubungan antar variabel

📐 1. Teori Regresi Linear

Model Regresi Linear: Y = β₀ + β₁X + ε

β₀ = Intercept (konstanta)
β₁ = Slope (koefisien)
ε = Error term (residual)

Metode Ordinary Least Squares (OLS)

OLS meminimalkan jumlah kuadrat residual (SSE):

  • SSE = Σ(yᵢ - ŷᵢ)²
  • β₁ = Σ((xᵢ - x̄)(yᵢ - ȳ)) / Σ(xᵢ - x̄)²
  • β₀ = ȳ - β₁x̄

💻 2. Implementasi Regresi Linear

A. Setup dan Data Preparation

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Generate sample data - House Price Prediction
np.random.seed(42)
n_samples = 200

# Features
size = np.random.uniform(50, 300, n_samples)  # House size in m²
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.uniform(0, 50, n_samples)
location_score = np.random.uniform(1, 10, n_samples)

# Target variable with realistic relationship
noise = np.random.normal(0, 20000, n_samples)
price = 50000 + size * 1500 + bedrooms * 10000 - age * 800 + location_score * 5000 + noise

# Create DataFrame
df = pd.DataFrame({
    'Size_m2': size,
    'Bedrooms': bedrooms,
    'Age_years': age,
    'Location_Score': location_score,
    'Price': price
})

print("="*60)
print("HOUSE PRICE DATASET")
print("="*60)
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"Price range: ${df['Price'].min():,.0f} - ${df['Price'].max():,.0f}")

B. Simple Linear Regression - Size vs Price

# Simple Linear Regression
print("\n" + "="*60)
print("SIMPLE LINEAR REGRESSION: Size vs Price")
print("="*60)

# Prepare data
X_simple = df[['Size_m2']]
y_simple = df['Price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_simple, y_simple, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Train model
model_simple = LinearRegression()
model_simple.fit(X_train, y_train)

# Make predictions
y_train_pred = model_simple.predict(X_train)
y_test_pred = model_simple.predict(X_test)

# Model parameters
print(f"\n📈 Model Equation:")
print(f"Price = {model_simple.intercept_:,.2f} + {model_simple.coef_[0]:,.2f} × Size_m2")
print(f"\nInterpretation:")
print(f"- Base price (0 m²): ${model_simple.intercept_:,.0f}")
print(f"- Price per m²: ${model_simple.coef_[0]:,.0f}")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# 1. Scatter plot with regression line
axes[0].scatter(X_train, y_train, alpha=0.5, label='Training data')
axes[0].scatter(X_test, y_test, alpha=0.5, label='Test data')
axes[0].plot(X_train, y_train_pred, color='red', linewidth=2, label='Regression line')
axes[0].set_xlabel('Size (m²)')
axes[0].set_ylabel('Price ($)')
axes[0].set_title('Linear Regression: Size vs Price')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Add confidence interval
from sklearn.preprocessing import PolynomialFeatures
confidence = 0.95
predict_mean_se = np.sqrt(mean_squared_error(y_train, y_train_pred))
margin = confidence * predict_mean_se
axes[0].fill_between(X_train.values.ravel(), 
                      y_train_pred - margin, 
                      y_train_pred + margin,
                      color='red', alpha=0.2, label='95% CI')

# 2. Residual plot
residuals_train = y_train - y_train_pred
axes[1].scatter(y_train_pred, residuals_train, alpha=0.5)
axes[1].axhline(y=0, color='red', linestyle='--')
axes[1].set_xlabel('Predicted Price')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot (Training)')
axes[1].grid(True, alpha=0.3)

# Add ±2 std lines
std_residuals = np.std(residuals_train)
axes[1].axhline(y=2*std_residuals, color='orange', linestyle='--', alpha=0.5)
axes[1].axhline(y=-2*std_residuals, color='orange', linestyle='--', alpha=0.5)
axes[1].text(0.02, 0.98, f'Std: {std_residuals:.0f}', transform=axes[1].transAxes)

# 3. Actual vs Predicted
axes[2].scatter(y_test, y_test_pred, alpha=0.5)
axes[2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', label='Perfect prediction')
axes[2].set_xlabel('Actual Price')
axes[2].set_ylabel('Predicted Price')
axes[2].set_title('Actual vs Predicted (Test)')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

📊 3. Model Evaluation Metrics

Key Metrics for Regression

Metric Formula Range Interpretation
MAE Σ|yᵢ - ŷᵢ|/n [0, ∞) Average absolute error
MSE Σ(yᵢ - ŷᵢ)²/n [0, ∞) Penalizes large errors
RMSE √MSE [0, ∞) Same unit as target
1 - (SSE/SST) [-∞, 1] Variance explained
Adj R² 1 - [(1-R²)(n-1)/(n-p-1)] [-∞, 1] Penalizes complexity
# Comprehensive Model Evaluation
print("="*60)
print("MODEL EVALUATION METRICS")
print("="*60)

def evaluate_model(y_true, y_pred, X, model_name="Model"):
    """Calculate and display regression metrics"""
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    
    # Adjusted R²
    n = len(y_true)
    p = X.shape[1] if len(X.shape) > 1 else 1
    adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - p - 1))
    
    # MAPE
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    
    print(f"\n{model_name} Performance:")
    print(f"  MAE        : ${mae:,.2f}")
    print(f"  MSE        : {mse:,.2f}")
    print(f"  RMSE       : ${rmse:,.2f}")
    print(f"  R² Score   : {r2:.4f} ({r2*100:.2f}% variance explained)")
    print(f"  Adj R²     : {adj_r2:.4f}")
    print(f"  MAPE       : {mape:.2f}%")
    
    return {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R2': r2, 'Adj_R2': adj_r2, 'MAPE': mape}

# Evaluate on both sets
train_metrics = evaluate_model(y_train, y_train_pred, X_train, "Training Set")
test_metrics = evaluate_model(y_test, y_test_pred, X_test, "Test Set")

# Check for overfitting
print("\n" + "="*40)
print("OVERFITTING CHECK")
print("="*40)
r2_diff = train_metrics['R2'] - test_metrics['R2']
print(f"R² difference (Train - Test): {r2_diff:.4f}")
if r2_diff > 0.1:
    print("⚠️ Warning: Possible overfitting detected")
elif r2_diff < -0.05:
    print("⚠️ Warning: Possible underfitting detected")
else:
    print("✅ Model generalizes well")

# Visualize metrics comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Metrics comparison
metrics_names = ['MAE', 'RMSE', 'MAPE']
train_values = [train_metrics[m] for m in metrics_names]
test_values = [test_metrics[m] for m in metrics_names]

x_pos = np.arange(len(metrics_names))
width = 0.35

axes[0].bar(x_pos - width/2, train_values, width, label='Training', color='skyblue')
axes[0].bar(x_pos + width/2, test_values, width, label='Test', color='lightcoral')
axes[0].set_xlabel('Metrics')
axes[0].set_ylabel('Value')
axes[0].set_title('Error Metrics Comparison')
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(metrics_names)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# R² comparison
axes[1].bar(['Training', 'Test'], 
           [train_metrics['R2'], test_metrics['R2']], 
           color=['skyblue', 'lightcoral'])
axes[1].set_ylabel('R² Score')
axes[1].set_title('R² Score Comparison')
axes[1].set_ylim([0, 1])
axes[1].grid(True, alpha=0.3)

for i, v in enumerate([train_metrics['R2'], test_metrics['R2']]):
    axes[1].text(i, v + 0.01, f'{v:.3f}', ha='center')

plt.tight_layout()
plt.show()

🔍 4. Validating Regression Assumptions

Four Key Assumptions of Linear Regression

  1. Linearity: Linear relationship between X and Y
  2. Independence: Residuals are independent
  3. Homoscedasticity: Constant variance of residuals
  4. Normality: Residuals are normally distributed
# Assumption Testing
print("="*60)
print("REGRESSION ASSUMPTIONS VALIDATION")
print("="*60)

residuals = y_test - y_test_pred

# Create diagnostic plots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Regression Diagnostics', fontsize=16, fontweight='bold')

# 1. Linearity - Component + Residual Plot
axes[0, 0].scatter(X_test, y_test, alpha=0.5, label='Actual')
axes[0, 0].plot(X_test, y_test_pred, 'r-', label='Fitted', linewidth=2)
axes[0, 0].set_xlabel('Size (m²)')
axes[0, 0].set_ylabel('Price ($)')
axes[0, 0].set_title('1. Linearity Check')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Independence - Residuals vs Order
axes[0, 1].plot(residuals, marker='o', linestyle='-', alpha=0.5)
axes[0, 1].axhline(y=0, color='red', linestyle='--')
axes[0, 1].set_xlabel('Observation Order')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('2. Independence Check')
axes[0, 1].grid(True, alpha=0.3)

# Durbin-Watson test
from statsmodels.stats.stattools import durbin_watson
dw = durbin_watson(residuals)
axes[0, 1].text(0.02, 0.98, f'Durbin-Watson: {dw:.2f}', 
                transform=axes[0, 1].transAxes, fontsize=10)

# 3. Homoscedasticity - Scale-Location Plot
axes[0, 2].scatter(y_test_pred, np.sqrt(np.abs(residuals)), alpha=0.5)
z = np.polyfit(y_test_pred, np.sqrt(np.abs(residuals)), 1)
p = np.poly1d(z)
axes[0, 2].plot(y_test_pred, p(y_test_pred), "r--", alpha=0.8)
axes[0, 2].set_xlabel('Fitted Values')
axes[0, 2].set_ylabel('√|Residuals|')
axes[0, 2].set_title('3. Homoscedasticity Check')
axes[0, 2].grid(True, alpha=0.3)

# 4. Normality - Q-Q Plot
stats.probplot(residuals, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('4. Normality Check (Q-Q Plot)')
axes[1, 0].grid(True, alpha=0.3)

# 5. Normality - Histogram
axes[1, 1].hist(residuals, bins=20, density=True, alpha=0.7, color='skyblue', edgecolor='black')
mu, std = residuals.mean(), residuals.std()
xmin, xmax = axes[1, 1].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)
axes[1, 1].plot(x, p, 'r-', linewidth=2, label='Normal distribution')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('5. Residual Distribution')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# 6. Leverage - Cook's Distance
from scipy.stats import f
n = len(y_test)
k = X_test.shape[1]
hat_matrix = X_test @ np.linalg.inv(X_test.T @ X_test) @ X_test.T
leverage = np.diag(hat_matrix)
standardized_residuals = residuals / (np.std(residuals) * np.sqrt(1 - leverage))
cooks_distance = (standardized_residuals**2 / k) * (leverage / (1 - leverage))

axes[1, 2].stem(range(len(cooks_distance)), cooks_distance, markerfmt='o')
axes[1, 2].axhline(y=4/n, color='red', linestyle='--', label=f'Threshold (4/n={4/n:.3f})')
axes[1, 2].set_xlabel('Observation Index')
axes[1, 2].set_ylabel("Cook's Distance")
axes[1, 2].set_title("6. Influential Points (Cook's Distance)")
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical tests
print("\n📊 Statistical Tests for Assumptions:")

# Normality test
shapiro_stat, shapiro_p = stats.shapiro(residuals)
print(f"\n1. Normality (Shapiro-Wilk Test):")
print(f"   Statistic: {shapiro_stat:.4f}, p-value: {shapiro_p:.4f}")
if shapiro_p > 0.05:
    print("   ✅ Residuals appear normally distributed")
else:
    print("   ⚠️ Residuals may not be normally distributed")

# Homoscedasticity test (Breusch-Pagan)
from scipy.stats import chi2
residuals_squared = residuals**2
_, bp_p = stats.spearmanr(y_test_pred, residuals_squared)
print(f"\n2. Homoscedasticity (Spearman correlation test):")
print(f"   p-value: {bp_p:.4f}")
if bp_p > 0.05:
    print("   ✅ Constant variance assumption appears satisfied")
else:
    print("   ⚠️ Heteroscedasticity may be present")

# Independence test
print(f"\n3. Independence (Durbin-Watson Test):")
print(f"   Statistic: {dw:.4f}")
if 1.5 < dw < 2.5:
    print("   ✅ No significant autocorrelation")
else:
    print("   ⚠️ Autocorrelation may be present")

🚀 5. Multiple Linear Regression

# Multiple Linear Regression
print("\n" + "="*60)
print("MULTIPLE LINEAR REGRESSION")
print("="*60)

# Prepare data with all features
X_multi = df[['Size_m2', 'Bedrooms', 'Age_years', 'Location_Score']]
y_multi = df['Price']

# Split data
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
    X_multi, y_multi, test_size=0.2, random_state=42
)

# Train model
model_multi = LinearRegression()
model_multi.fit(X_train_m, y_train_m)

# Predictions
y_train_pred_m = model_multi.predict(X_train_m)
y_test_pred_m = model_multi.predict(X_test_m)

# Model equation
print("📈 Model Equation:")
print(f"Price = {model_multi.intercept_:,.2f}")
for i, col in enumerate(X_multi.columns):
    sign = "+" if model_multi.coef_[i] >= 0 else ""
    print(f"        {sign} {model_multi.coef_[i]:,.2f} × {col}")

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X_multi.columns,
    'Coefficient': model_multi.coef_,
    'Abs_Coefficient': np.abs(model_multi.coef_)
}).sort_values('Abs_Coefficient', ascending=False)

print("\n📊 Feature Importance (by coefficient magnitude):")
print(feature_importance.to_string(index=False))

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. Feature coefficients
axes[0, 0].barh(feature_importance['Feature'], feature_importance['Coefficient'], 
                color=['green' if x > 0 else 'red' for x in feature_importance['Coefficient']])
axes[0, 0].set_xlabel('Coefficient Value')
axes[0, 0].set_title('Feature Coefficients')
axes[0, 0].grid(True, alpha=0.3)

# 2. Actual vs Predicted
axes[0, 1].scatter(y_test_m, y_test_pred_m, alpha=0.5)
axes[0, 1].plot([y_test_m.min(), y_test_m.max()], 
                [y_test_m.min(), y_test_m.max()], 'r--')
axes[0, 1].set_xlabel('Actual Price')
axes[0, 1].set_ylabel('Predicted Price')
axes[0, 1].set_title('Multiple Regression: Actual vs Predicted')
axes[0, 1].grid(True, alpha=0.3)

# 3. Residuals
residuals_m = y_test_m - y_test_pred_m
axes[0, 2].scatter(y_test_pred_m, residuals_m, alpha=0.5)
axes[0, 2].axhline(y=0, color='red', linestyle='--')
axes[0, 2].set_xlabel('Predicted Price')
axes[0, 2].set_ylabel('Residuals')
axes[0, 2].set_title('Residual Plot')
axes[0, 2].grid(True, alpha=0.3)

# 4-6. Partial regression plots
for i, col in enumerate(['Size_m2', 'Bedrooms', 'Location_Score']):
    ax = axes[1, i]
    ax.scatter(X_test_m[col], y_test_m, alpha=0.5)
    # Simple linear fit for visualization
    z = np.polyfit(X_test_m[col], y_test_m, 1)
    p = np.poly1d(z)
    ax.plot(X_test_m[col], p(X_test_m[col]), "r--", alpha=0.8)
    ax.set_xlabel(col)
    ax.set_ylabel('Price')
    ax.set_title(f'Partial: {col} vs Price')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Model comparison
print("\n" + "="*40)
print("MODEL COMPARISON: Simple vs Multiple")
print("="*40)

# Evaluate multiple regression
multi_metrics = evaluate_model(y_test_m, y_test_pred_m, X_test_m, "Multiple Regression")

# Compare with simple regression
print(f"\n📊 Performance Improvement:")
print(f"  R² improvement: {(multi_metrics['R2'] - test_metrics['R2'])*100:.2f}%")
print(f"  RMSE reduction: ${(test_metrics['RMSE'] - multi_metrics['RMSE']):,.2f}")

if multi_metrics['Adj_R2'] > test_metrics['Adj_R2']:
    print("  ✅ Multiple regression performs better (higher Adj R²)")
else:
    print("  ⚠️ Simple regression might be sufficient")

🎯 6. Studi Kasus: Prediksi Harga Rumah

📋 Real Estate Price Prediction

Membangun model prediksi harga rumah berdasarkan berbagai fitur properti untuk membantu agen real estate dalam menentukan harga yang kompetitif.

# COMPREHENSIVE CASE STUDY
print("="*60)
print("CASE STUDY: REAL ESTATE PRICE PREDICTION")
print("="*60)

# Generate more realistic dataset
np.random.seed(42)
n_houses = 500

# Create features
data_case = {
    'Size_m2': np.random.gamma(4, 30, n_houses),
    'Bedrooms': np.random.choice([1, 2, 3, 4, 5], n_houses, p=[0.1, 0.3, 0.35, 0.2, 0.05]),
    'Bathrooms': np.random.choice([1, 2, 3], n_houses, p=[0.4, 0.5, 0.1]),
    'Age_years': np.random.uniform(0, 50, n_houses),
    'Garage': np.random.choice([0, 1, 2], n_houses, p=[0.3, 0.5, 0.2]),
    'Location_Score': np.random.beta(5, 2, n_houses) * 10,
    'School_Distance': np.random.exponential(2, n_houses),
    'Crime_Rate': np.random.beta(2, 5, n_houses) * 10
}

df_case = pd.DataFrame(data_case)

# Create realistic price
df_case['Price'] = (
    80000 +
    df_case['Size_m2'] * 1200 +
    df_case['Bedrooms'] * 8000 +
    df_case['Bathrooms'] * 5000 +
    df_case['Garage'] * 10000 +
    df_case['Location_Score'] * 6000 -
    df_case['Age_years'] * 500 -
    df_case['School_Distance'] * 2000 -
    df_case['Crime_Rate'] * 3000 +
    np.random.normal(0, 15000, n_houses)
)

print("\n📊 Dataset Overview:")
print(df_case.head())
print(f"\nDataset shape: {df_case.shape}")
print(f"Price range: ${df_case['Price'].min():,.0f} - ${df_case['Price'].max():,.0f}")
print(f"Average price: ${df_case['Price'].mean():,.0f}")

# Feature engineering
df_case['Total_Rooms'] = df_case['Bedrooms'] + df_case['Bathrooms']
df_case['Size_per_Room'] = df_case['Size_m2'] / df_case['Total_Rooms']
df_case['Age_Category'] = pd.cut(df_case['Age_years'], 
                                  bins=[0, 10, 25, 50], 
                                  labels=['New', 'Medium', 'Old'])
df_case['Location_Category'] = pd.cut(df_case['Location_Score'], 
                                       bins=[0, 3, 7, 10], 
                                       labels=['Poor', 'Average', 'Excellent'])

# Prepare for modeling
features = ['Size_m2', 'Bedrooms', 'Bathrooms', 'Age_years', 'Garage', 
           'Location_Score', 'School_Distance', 'Crime_Rate']
X = df_case[features]
y = df_case['Price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple models
from sklearn.linear_model import Ridge, Lasso, ElasticNet

models = {
    'Linear Regression': LinearRegression(),
    'Ridge (α=1)': Ridge(alpha=1),
    'Lasso (α=1)': Lasso(alpha=1),
    'ElasticNet (α=1)': ElasticNet(alpha=1)
}

results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    results[name] = {
        'R2': r2_score(y_test, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
        'MAE': mean_absolute_error(y_test, y_pred)
    }

# Results comparison
results_df = pd.DataFrame(results).T
print("\n📊 Model Comparison:")
print(results_df.round(4))

# Best model analysis
best_model_name = results_df['R2'].idxmax()
print(f"\n🏆 Best Model: {best_model_name}")
print(f"   R² Score: {results_df.loc[best_model_name, 'R2']:.4f}")
print(f"   RMSE: ${results_df.loc[best_model_name, 'RMSE']:,.2f}")

# Feature importance for best model
best_model = models[best_model_name]
best_model.fit(X_train_scaled, y_train)

# Visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Real Estate Price Prediction Analysis', fontsize=16, fontweight='bold')

# 1. Model comparison
axes[0, 0].bar(results_df.index, results_df['R2'], color='skyblue')
axes[0, 0].set_ylabel('R² Score')
axes[0, 0].set_title('Model Performance Comparison')
axes[0, 0].set_xticklabels(results_df.index, rotation=45)
axes[0, 0].grid(True, alpha=0.3)

# 2. Feature importance
if hasattr(best_model, 'coef_'):
    feature_imp = pd.DataFrame({
        'Feature': features,
        'Importance': np.abs(best_model.coef_)
    }).sort_values('Importance', ascending=True)
    
    axes[0, 1].barh(feature_imp['Feature'], feature_imp['Importance'], color='coral')
    axes[0, 1].set_xlabel('Absolute Coefficient')
    axes[0, 1].set_title('Feature Importance')

# 3. Prediction accuracy
y_pred_best = best_model.predict(X_test_scaled)
axes[0, 2].scatter(y_test, y_pred_best, alpha=0.5)
axes[0, 2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
axes[0, 2].set_xlabel('Actual Price')
axes[0, 2].set_ylabel('Predicted Price')
axes[0, 2].set_title(f'Best Model: {best_model_name}')
axes[0, 2].grid(True, alpha=0.3)

# 4. Error distribution
errors = y_test - y_pred_best
axes[1, 0].hist(errors, bins=30, color='lightgreen', edgecolor='black', alpha=0.7)
axes[1, 0].axvline(errors.mean(), color='red', linestyle='--', 
                   label=f'Mean: ${errors.mean():,.0f}')
axes[1, 0].set_xlabel('Prediction Error ($)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Error Distribution')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 5. Price distribution by location
axes[1, 1].boxplot([df_case[df_case['Location_Category'] == cat]['Price'].values 
                    for cat in ['Poor', 'Average', 'Excellent']],
                   labels=['Poor', 'Average', 'Excellent'])
axes[1, 1].set_xlabel('Location Category')
axes[1, 1].set_ylabel('Price ($)')
axes[1, 1].set_title('Price by Location Quality')
axes[1, 1].grid(True, alpha=0.3)

# 6. Learning curves
from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
    best_model, X_train_scaled, y_train, cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='r2'
)

axes[1, 2].plot(train_sizes, np.mean(train_scores, axis=1), 
                'o-', color='blue', label='Training score')
axes[1, 2].plot(train_sizes, np.mean(val_scores, axis=1), 
                'o-', color='red', label='Cross-validation score')
axes[1, 2].set_xlabel('Training Set Size')
axes[1, 2].set_ylabel('R² Score')
axes[1, 2].set_title('Learning Curves')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Business insights
print("\n💡 Business Insights:")
print("1. Size and Location are the strongest price predictors")
print("2. Each additional bedroom adds ~$8,000 to price")
print("3. Houses lose ~$500 in value per year of age")
print("4. Properties near schools command premium prices")
print("5. Crime rate significantly impacts property values")

🧪 Contoh Soal & Latihan

Latihan 1: Simple Linear Regression

  1. Load dataset advertising (TV, Radio, Newspaper, Sales)
  2. Buat model regresi TV spending vs Sales
  3. Hitung dan interpretasikan slope dan intercept
  4. Evaluasi dengan R², RMSE, MAE
  5. Buat prediction interval 95%
  6. Check all regression assumptions

Latihan 2: Multiple Regression

  1. Gunakan semua features (TV, Radio, Newspaper)
  2. Compare dengan simple regression
  3. Identify most important feature
  4. Check for multicollinearity (VIF)
  5. Apply feature scaling
  6. Compare Ridge, Lasso, ElasticNet

Latihan 3: Model Selection

  1. Implement forward selection
  2. Implement backward elimination
  3. Use cross-validation for model selection
  4. Plot learning curves
  5. Analyze bias-variance tradeoff
  6. Create ensemble model

💡 Best Practices:

  • Always split data: Train/test atau cross-validation
  • Scale features: Jika range berbeda signifikan
  • Check assumptions: Sebelum interpretasi hasil
  • Handle outliers: Bisa sangat mempengaruhi hasil
  • Feature engineering: Bisa meningkatkan performa
  • Regularization: Untuk mencegah overfitting
  • Domain knowledge: Penting untuk interpretasi

✍️ Kesimpulan

Pada pertemuan ini, mahasiswa telah mempelajari:

  • ✅ Teori dan konsep regresi linear
  • ✅ Implementasi dengan scikit-learn
  • ✅ Evaluasi model dengan berbagai metrik
  • ✅ Validasi asumsi regresi
  • ✅ Multiple regression dan feature importance
  • ✅ Regularization techniques
  • ✅ Real-world case study

Pertemuan selanjutnya: Klasifikasi Dasar