📘 Pertemuan 10 - Regresi Linear Sederhana

🎯 Tujuan Pembelajaran

Memahami konsep dan teori regresi linear
Mengimplementasikan regresi linear dengan Python
Mengevaluasi model dengan berbagai metrik
Memvalidasi asumsi-asumsi regresi linear
Melakukan prediksi dan interpretasi hasil
Menangani overfitting dan underfitting

🧩 Apa itu Regresi Linear?

Regresi linear adalah metode statistik untuk memodelkan hubungan linear antara variabel dependen (Y) dan satu atau lebih variabel independen (X).

📈 Simple Linear Regression: Satu variabel independen (Y = β₀ + β₁X)
📊 Multiple Linear Regression: Beberapa variabel independen (Y = β₀ + β₁X₁ + β₂X₂ + ...)
🎯 Tujuan: Prediksi nilai kontinu dan memahami hubungan antar variabel

📐 1. Teori Regresi Linear

Model Regresi Linear: Y = β₀ + β₁X + ε

β₀ = Intercept (konstanta)
β₁ = Slope (koefisien)
ε = Error term (residual)

Metode Ordinary Least Squares (OLS)

OLS meminimalkan jumlah kuadrat residual (SSE):

SSE = Σ(yᵢ - ŷᵢ)²
β₁ = Σ((xᵢ - x̄)(yᵢ - ȳ)) / Σ(xᵢ - x̄)²
β₀ = ȳ - β₁x̄

💻 2. Implementasi Regresi Linear

A. Setup dan Data Preparation

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Generate sample data - House Price Prediction
np.random.seed(42)
n_samples = 200

# Features
size = np.random.uniform(50, 300, n_samples)  # House size in m²
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.uniform(0, 50, n_samples)
location_score = np.random.uniform(1, 10, n_samples)

# Target variable with realistic relationship
noise = np.random.normal(0, 20000, n_samples)
price = 50000 + size * 1500 + bedrooms * 10000 - age * 800 + location_score * 5000 + noise

# Create DataFrame
df = pd.DataFrame({
    'Size_m2': size,
    'Bedrooms': bedrooms,
    'Age_years': age,
    'Location_Score': location_score,
    'Price': price
})

print("="*60)
print("HOUSE PRICE DATASET")
print("="*60)
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"Price range: ${df['Price'].min():,.0f} - ${df['Price'].max():,.0f}")

B. Simple Linear Regression - Size vs Price

# Simple Linear Regression
print("\n" + "="*60)
print("SIMPLE LINEAR REGRESSION: Size vs Price")
print("="*60)

# Prepare data
X_simple = df[['Size_m2']]
y_simple = df['Price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_simple, y_simple, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Train model
model_simple = LinearRegression()
model_simple.fit(X_train, y_train)

# Make predictions
y_train_pred = model_simple.predict(X_train)
y_test_pred = model_simple.predict(X_test)

# Model parameters
print(f"\n📈 Model Equation:")
print(f"Price = {model_simple.intercept_:,.2f} + {model_simple.coef_[0]:,.2f} × Size_m2")
print(f"\nInterpretation:")
print(f"- Base price (0 m²): ${model_simple.intercept_:,.0f}")
print(f"- Price per m²: ${model_simple.coef_[0]:,.0f}")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# 1. Scatter plot with regression line
axes[0].scatter(X_train, y_train, alpha=0.5, label='Training data')
axes[0].scatter(X_test, y_test, alpha=0.5, label='Test data')
axes[0].plot(X_train, y_train_pred, color='red', linewidth=2, label='Regression line')
axes[0].set_xlabel('Size (m²)')
axes[0].set_ylabel('Price ($)')
axes[0].set_title('Linear Regression: Size vs Price')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Add confidence interval
from sklearn.preprocessing import PolynomialFeatures
confidence = 0.95
predict_mean_se = np.sqrt(mean_squared_error(y_train, y_train_pred))
margin = confidence * predict_mean_se
axes[0].fill_between(X_train.values.ravel(), 
                      y_train_pred - margin, 
                      y_train_pred + margin,
                      color='red', alpha=0.2, label='95% CI')

# 2. Residual plot
residuals_train = y_train - y_train_pred
axes[1].scatter(y_train_pred, residuals_train, alpha=0.5)
axes[1].axhline(y=0, color='red', linestyle='--')
axes[1].set_xlabel('Predicted Price')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot (Training)')
axes[1].grid(True, alpha=0.3)

# Add ±2 std lines
std_residuals = np.std(residuals_train)
axes[1].axhline(y=2*std_residuals, color='orange', linestyle='--', alpha=0.5)
axes[1].axhline(y=-2*std_residuals, color='orange', linestyle='--', alpha=0.5)
axes[1].text(0.02, 0.98, f'Std: {std_residuals:.0f}', transform=axes[1].transAxes)

# 3. Actual vs Predicted
axes[2].scatter(y_test, y_test_pred, alpha=0.5)
axes[2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', label='Perfect prediction')
axes[2].set_xlabel('Actual Price')
axes[2].set_ylabel('Predicted Price')
axes[2].set_title('Actual vs Predicted (Test)')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

📊 3. Model Evaluation Metrics

Key Metrics for Regression

Metric	Formula	Range	Interpretation
MAE	Σ\|yᵢ - ŷᵢ\|/n	[0, ∞)	Average absolute error
MSE	Σ(yᵢ - ŷᵢ)²/n	[0, ∞)	Penalizes large errors
RMSE	√MSE	[0, ∞)	Same unit as target
R²	1 - (SSE/SST)	[-∞, 1]	Variance explained
Adj R²	1 - [(1-R²)(n-1)/(n-p-1)]	[-∞, 1]	Penalizes complexity

# Comprehensive Model Evaluation
print("="*60)
print("MODEL EVALUATION METRICS")
print("="*60)

def evaluate_model(y_true, y_pred, X, model_name="Model"):
    """Calculate and display regression metrics"""
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    
    # Adjusted R²
    n = len(y_true)
    p = X.shape[1] if len(X.shape) > 1 else 1
    adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - p - 1))
    
    # MAPE
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    
    print(f"\n{model_name} Performance:")
    print(f"  MAE        : ${mae:,.2f}")
    print(f"  MSE        : {mse:,.2f}")
    print(f"  RMSE       : ${rmse:,.2f}")
    print(f"  R² Score   : {r2:.4f} ({r2*100:.2f}% variance explained)")
    print(f"  Adj R²     : {adj_r2:.4f}")
    print(f"  MAPE       : {mape:.2f}%")
    
    return {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R2': r2, 'Adj_R2': adj_r2, 'MAPE': mape}

# Evaluate on both sets
train_metrics = evaluate_model(y_train, y_train_pred, X_train, "Training Set")
test_metrics = evaluate_model(y_test, y_test_pred, X_test, "Test Set")

# Check for overfitting
print("\n" + "="*40)
print("OVERFITTING CHECK")
print("="*40)
r2_diff = train_metrics['R2'] - test_metrics['R2']
print(f"R² difference (Train - Test): {r2_diff:.4f}")
if r2_diff > 0.1:
    print("⚠️ Warning: Possible overfitting detected")
elif r2_diff < -0.05:
    print("⚠️ Warning: Possible underfitting detected")
else:
    print("✅ Model generalizes well")

# Visualize metrics comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Metrics comparison
metrics_names = ['MAE', 'RMSE', 'MAPE']
train_values = [train_metrics[m] for m in metrics_names]
test_values = [test_metrics[m] for m in metrics_names]

x_pos = np.arange(len(metrics_names))
width = 0.35

axes[0].bar(x_pos - width/2, train_values, width, label='Training', color='skyblue')
axes[0].bar(x_pos + width/2, test_values, width, label='Test', color='lightcoral')
axes[0].set_xlabel('Metrics')
axes[0].set_ylabel('Value')
axes[0].set_title('Error Metrics Comparison')
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(metrics_names)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# R² comparison
axes[1].bar(['Training', 'Test'], 
           [train_metrics['R2'], test_metrics['R2']], 
           color=['skyblue', 'lightcoral'])
axes[1].set_ylabel('R² Score')
axes[1].set_title('R² Score Comparison')
axes[1].set_ylim([0, 1])
axes[1].grid(True, alpha=0.3)

for i, v in enumerate([train_metrics['R2'], test_metrics['R2']]):
    axes[1].text(i, v + 0.01, f'{v:.3f}', ha='center')

plt.tight_layout()
plt.show()

🔍 4. Validating Regression Assumptions

Four Key Assumptions of Linear Regression

Linearity: Linear relationship between X and Y
Independence: Residuals are independent
Homoscedasticity: Constant variance of residuals
Normality: Residuals are normally distributed

# Assumption Testing
print("="*60)
print("REGRESSION ASSUMPTIONS VALIDATION")
print("="*60)

residuals = y_test - y_test_pred

# Create diagnostic plots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Regression Diagnostics', fontsize=16, fontweight='bold')

# 1. Linearity - Component + Residual Plot
axes[0, 0].scatter(X_test, y_test, alpha=0.5, label='Actual')
axes[0, 0].plot(X_test, y_test_pred, 'r-', label='Fitted', linewidth=2)
axes[0, 0].set_xlabel('Size (m²)')
axes[0, 0].set_ylabel('Price ($)')
axes[0, 0].set_title('1. Linearity Check')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Independence - Residuals vs Order
axes[0, 1].plot(residuals, marker='o', linestyle='-', alpha=0.5)
axes[0, 1].axhline(y=0, color='red', linestyle='--')
axes[0, 1].set_xlabel('Observation Order')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('2. Independence Check')
axes[0, 1].grid(True, alpha=0.3)

# Durbin-Watson test
from statsmodels.stats.stattools import durbin_watson
dw = durbin_watson(residuals)
axes[0, 1].text(0.02, 0.98, f'Durbin-Watson: {dw:.2f}', 
                transform=axes[0, 1].transAxes, fontsize=10)

# 3. Homoscedasticity - Scale-Location Plot
axes[0, 2].scatter(y_test_pred, np.sqrt(np.abs(residuals)), alpha=0.5)
z = np.polyfit(y_test_pred, np.sqrt(np.abs(residuals)), 1)
p = np.poly1d(z)
axes[0, 2].plot(y_test_pred, p(y_test_pred), "r--", alpha=0.8)
axes[0, 2].set_xlabel('Fitted Values')
axes[0, 2].set_ylabel('√|Residuals|')
axes[0, 2].set_title('3. Homoscedasticity Check')
axes[0, 2].grid(True, alpha=0.3)

# 4. Normality - Q-Q Plot
stats.probplot(residuals, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('4. Normality Check (Q-Q Plot)')
axes[1, 0].grid(True, alpha=0.3)

# 5. Normality - Histogram
axes[1, 1].hist(residuals, bins=20, density=True, alpha=0.7, color='skyblue', edgecolor='black')
mu, std = residuals.mean(), residuals.std()
xmin, xmax = axes[1, 1].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)
axes[1, 1].plot(x, p, 'r-', linewidth=2, label='Normal distribution')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('5. Residual Distribution')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# 6. Leverage - Cook's Distance
from scipy.stats import f
n = len(y_test)
k = X_test.shape[1]
hat_matrix = X_test @ np.linalg.inv(X_test.T @ X_test) @ X_test.T
leverage = np.diag(hat_matrix)
standardized_residuals = residuals / (np.std(residuals) * np.sqrt(1 - leverage))
cooks_distance = (standardized_residuals**2 / k) * (leverage / (1 - leverage))

axes[1, 2].stem(range(len(cooks_distance)), cooks_distance, markerfmt='o')
axes[1, 2].axhline(y=4/n, color='red', linestyle='--', label=f'Threshold (4/n={4/n:.3f})')
axes[1, 2].set_xlabel('Observation Index')
axes[1, 2].set_ylabel("Cook's Distance")
axes[1, 2].set_title("6. Influential Points (Cook's Distance)")
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical tests
print("\n📊 Statistical Tests for Assumptions:")

# Normality test
shapiro_stat, shapiro_p = stats.shapiro(residuals)
print(f"\n1. Normality (Shapiro-Wilk Test):")
print(f"   Statistic: {shapiro_stat:.4f}, p-value: {shapiro_p:.4f}")
if shapiro_p > 0.05:
    print("   ✅ Residuals appear normally distributed")
else:
    print("   ⚠️ Residuals may not be normally distributed")

# Homoscedasticity test (Breusch-Pagan)
from scipy.stats import chi2
residuals_squared = residuals**2
_, bp_p = stats.spearmanr(y_test_pred, residuals_squared)
print(f"\n2. Homoscedasticity (Spearman correlation test):")
print(f"   p-value: {bp_p:.4f}")
if bp_p > 0.05:
    print("   ✅ Constant variance assumption appears satisfied")
else:
    print("   ⚠️ Heteroscedasticity may be present")

# Independence test
print(f"\n3. Independence (Durbin-Watson Test):")
print(f"   Statistic: {dw:.4f}")
if 1.5 < dw < 2.5:
    print("   ✅ No significant autocorrelation")
else:
    print("   ⚠️ Autocorrelation may be present")

🚀 5. Multiple Linear Regression

# Multiple Linear Regression
print("\n" + "="*60)
print("MULTIPLE LINEAR REGRESSION")
print("="*60)

# Prepare data with all features
X_multi = df[['Size_m2', 'Bedrooms', 'Age_years', 'Location_Score']]
y_multi = df['Price']

# Split data
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
    X_multi, y_multi, test_size=0.2, random_state=42
)

# Train model
model_multi = LinearRegression()
model_multi.fit(X_train_m, y_train_m)

# Predictions
y_train_pred_m = model_multi.predict(X_train_m)
y_test_pred_m = model_multi.predict(X_test_m)

# Model equation
print("📈 Model Equation:")
print(f"Price = {model_multi.intercept_:,.2f}")
for i, col in enumerate(X_multi.columns):
    sign = "+" if model_multi.coef_[i] >= 0 else ""
    print(f"        {sign} {model_multi.coef_[i]:,.2f} × {col}")

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X_multi.columns,
    'Coefficient': model_multi.coef_,
    'Abs_Coefficient': np.abs(model_multi.coef_)
}).sort_values('Abs_Coefficient', ascending=False)

print("\n📊 Feature Importance (by coefficient magnitude):")
print(feature_importance.to_string(index=False))

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. Feature coefficients
axes[0, 0].barh(feature_importance['Feature'], feature_importance['Coefficient'], 
                color=['green' if x > 0 else 'red' for x in feature_importance['Coefficient']])
axes[0, 0].set_xlabel('Coefficient Value')
axes[0, 0].set_title('Feature Coefficients')
axes[0, 0].grid(True, alpha=0.3)

# 2. Actual vs Predicted
axes[0, 1].scatter(y_test_m, y_test_pred_m, alpha=0.5)
axes[0, 1].plot([y_test_m.min(), y_test_m.max()], 
                [y_test_m.min(), y_test_m.max()], 'r--')
axes[0, 1].set_xlabel('Actual Price')
axes[0, 1].set_ylabel('Predicted Price')
axes[0, 1].set_title('Multiple Regression: Actual vs Predicted')
axes[0, 1].grid(True, alpha=0.3)

# 3. Residuals
residuals_m = y_test_m - y_test_pred_m
axes[0, 2].scatter(y_test_pred_m, residuals_m, alpha=0.5)
axes[0, 2].axhline(y=0, color='red', linestyle='--')
axes[0, 2].set_xlabel('Predicted Price')
axes[0, 2].set_ylabel('Residuals')
axes[0, 2].set_title('Residual Plot')
axes[0, 2].grid(True, alpha=0.3)

# 4-6. Partial regression plots
for i, col in enumerate(['Size_m2', 'Bedrooms', 'Location_Score']):
    ax = axes[1, i]
    ax.scatter(X_test_m[col], y_test_m, alpha=0.5)
    # Simple linear fit for visualization
    z = np.polyfit(X_test_m[col], y_test_m, 1)
    p = np.poly1d(z)
    ax.plot(X_test_m[col], p(X_test_m[col]), "r--", alpha=0.8)
    ax.set_xlabel(col)
    ax.set_ylabel('Price')
    ax.set_title(f'Partial: {col} vs Price')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Model comparison
print("\n" + "="*40)
print("MODEL COMPARISON: Simple vs Multiple")
print("="*40)

# Evaluate multiple regression
multi_metrics = evaluate_model(y_test_m, y_test_pred_m, X_test_m, "Multiple Regression")

# Compare with simple regression
print(f"\n📊 Performance Improvement:")
print(f"  R² improvement: {(multi_metrics['R2'] - test_metrics['R2'])*100:.2f}%")
print(f"  RMSE reduction: ${(test_metrics['RMSE'] - multi_metrics['RMSE']):,.2f}")

if multi_metrics['Adj_R2'] > test_metrics['Adj_R2']:
    print("  ✅ Multiple regression performs better (higher Adj R²)")
else:
    print("  ⚠️ Simple regression might be sufficient")

🎯 6. Studi Kasus: Prediksi Harga Rumah

📋 Real Estate Price Prediction

Membangun model prediksi harga rumah berdasarkan berbagai fitur properti untuk membantu agen real estate dalam menentukan harga yang kompetitif.

# COMPREHENSIVE CASE STUDY
print("="*60)
print("CASE STUDY: REAL ESTATE PRICE PREDICTION")
print("="*60)

# Generate more realistic dataset
np.random.seed(42)
n_houses = 500

# Create features
data_case = {
    'Size_m2': np.random.gamma(4, 30, n_houses),
    'Bedrooms': np.random.choice([1, 2, 3, 4, 5], n_houses, p=[0.1, 0.3, 0.35, 0.2, 0.05]),
    'Bathrooms': np.random.choice([1, 2, 3], n_houses, p=[0.4, 0.5, 0.1]),
    'Age_years': np.random.uniform(0, 50, n_houses),
    'Garage': np.random.choice([0, 1, 2], n_houses, p=[0.3, 0.5, 0.2]),
    'Location_Score': np.random.beta(5, 2, n_houses) * 10,
    'School_Distance': np.random.exponential(2, n_houses),
    'Crime_Rate': np.random.beta(2, 5, n_houses) * 10
}

df_case = pd.DataFrame(data_case)

# Create realistic price
df_case['Price'] = (
    80000 +
    df_case['Size_m2'] * 1200 +
    df_case['Bedrooms'] * 8000 +
    df_case['Bathrooms'] * 5000 +
    df_case['Garage'] * 10000 +
    df_case['Location_Score'] * 6000 -
    df_case['Age_years'] * 500 -
    df_case['School_Distance'] * 2000 -
    df_case['Crime_Rate'] * 3000 +
    np.random.normal(0, 15000, n_houses)
)

print("\n📊 Dataset Overview:")
print(df_case.head())
print(f"\nDataset shape: {df_case.shape}")
print(f"Price range: ${df_case['Price'].min():,.0f} - ${df_case['Price'].max():,.0f}")
print(f"Average price: ${df_case['Price'].mean():,.0f}")

# Feature engineering
df_case['Total_Rooms'] = df_case['Bedrooms'] + df_case['Bathrooms']
df_case['Size_per_Room'] = df_case['Size_m2'] / df_case['Total_Rooms']
df_case['Age_Category'] = pd.cut(df_case['Age_years'], 
                                  bins=[0, 10, 25, 50], 
                                  labels=['New', 'Medium', 'Old'])
df_case['Location_Category'] = pd.cut(df_case['Location_Score'], 
                                       bins=[0, 3, 7, 10], 
                                       labels=['Poor', 'Average', 'Excellent'])

# Prepare for modeling
features = ['Size_m2', 'Bedrooms', 'Bathrooms', 'Age_years', 'Garage', 
           'Location_Score', 'School_Distance', 'Crime_Rate']
X = df_case[features]
y = df_case['Price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple models
from sklearn.linear_model import Ridge, Lasso, ElasticNet

models = {
    'Linear Regression': LinearRegression(),
    'Ridge (α=1)': Ridge(alpha=1),
    'Lasso (α=1)': Lasso(alpha=1),
    'ElasticNet (α=1)': ElasticNet(alpha=1)
}

results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    results[name] = {
        'R2': r2_score(y_test, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
        'MAE': mean_absolute_error(y_test, y_pred)
    }

# Results comparison
results_df = pd.DataFrame(results).T
print("\n📊 Model Comparison:")
print(results_df.round(4))

# Best model analysis
best_model_name = results_df['R2'].idxmax()
print(f"\n🏆 Best Model: {best_model_name}")
print(f"   R² Score: {results_df.loc[best_model_name, 'R2']:.4f}")
print(f"   RMSE: ${results_df.loc[best_model_name, 'RMSE']:,.2f}")

# Feature importance for best model
best_model = models[best_model_name]
best_model.fit(X_train_scaled, y_train)

# Visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Real Estate Price Prediction Analysis', fontsize=16, fontweight='bold')

# 1. Model comparison
axes[0, 0].bar(results_df.index, results_df['R2'], color='skyblue')
axes[0, 0].set_ylabel('R² Score')
axes[0, 0].set_title('Model Performance Comparison')
axes[0, 0].set_xticklabels(results_df.index, rotation=45)
axes[0, 0].grid(True, alpha=0.3)

# 2. Feature importance
if hasattr(best_model, 'coef_'):
    feature_imp = pd.DataFrame({
        'Feature': features,
        'Importance': np.abs(best_model.coef_)
    }).sort_values('Importance', ascending=True)
    
    axes[0, 1].barh(feature_imp['Feature'], feature_imp['Importance'], color='coral')
    axes[0, 1].set_xlabel('Absolute Coefficient')
    axes[0, 1].set_title('Feature Importance')

# 3. Prediction accuracy
y_pred_best = best_model.predict(X_test_scaled)
axes[0, 2].scatter(y_test, y_pred_best, alpha=0.5)
axes[0, 2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
axes[0, 2].set_xlabel('Actual Price')
axes[0, 2].set_ylabel('Predicted Price')
axes[0, 2].set_title(f'Best Model: {best_model_name}')
axes[0, 2].grid(True, alpha=0.3)

# 4. Error distribution
errors = y_test - y_pred_best
axes[1, 0].hist(errors, bins=30, color='lightgreen', edgecolor='black', alpha=0.7)
axes[1, 0].axvline(errors.mean(), color='red', linestyle='--', 
                   label=f'Mean: ${errors.mean():,.0f}')
axes[1, 0].set_xlabel('Prediction Error ($)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Error Distribution')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 5. Price distribution by location
axes[1, 1].boxplot([df_case[df_case['Location_Category'] == cat]['Price'].values 
                    for cat in ['Poor', 'Average', 'Excellent']],
                   labels=['Poor', 'Average', 'Excellent'])
axes[1, 1].set_xlabel('Location Category')
axes[1, 1].set_ylabel('Price ($)')
axes[1, 1].set_title('Price by Location Quality')
axes[1, 1].grid(True, alpha=0.3)

# 6. Learning curves
from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
    best_model, X_train_scaled, y_train, cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='r2'
)

axes[1, 2].plot(train_sizes, np.mean(train_scores, axis=1), 
                'o-', color='blue', label='Training score')
axes[1, 2].plot(train_sizes, np.mean(val_scores, axis=1), 
                'o-', color='red', label='Cross-validation score')
axes[1, 2].set_xlabel('Training Set Size')
axes[1, 2].set_ylabel('R² Score')
axes[1, 2].set_title('Learning Curves')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Business insights
print("\n💡 Business Insights:")
print("1. Size and Location are the strongest price predictors")
print("2. Each additional bedroom adds ~$8,000 to price")
print("3. Houses lose ~$500 in value per year of age")
print("4. Properties near schools command premium prices")
print("5. Crime rate significantly impacts property values")

🧪 Contoh Soal & Latihan

Latihan 1: Simple Linear Regression

Load dataset advertising (TV, Radio, Newspaper, Sales)
Buat model regresi TV spending vs Sales
Hitung dan interpretasikan slope dan intercept
Evaluasi dengan R², RMSE, MAE
Buat prediction interval 95%
Check all regression assumptions

Latihan 2: Multiple Regression

Gunakan semua features (TV, Radio, Newspaper)
Compare dengan simple regression
Identify most important feature
Check for multicollinearity (VIF)
Apply feature scaling
Compare Ridge, Lasso, ElasticNet

Latihan 3: Model Selection

Implement forward selection
Implement backward elimination
Use cross-validation for model selection
Plot learning curves
Analyze bias-variance tradeoff
Create ensemble model

💡 Best Practices:

Always split data: Train/test atau cross-validation
Scale features: Jika range berbeda signifikan
Check assumptions: Sebelum interpretasi hasil
Handle outliers: Bisa sangat mempengaruhi hasil
Feature engineering: Bisa meningkatkan performa
Regularization: Untuk mencegah overfitting
Domain knowledge: Penting untuk interpretasi

✍️ Kesimpulan

Pada pertemuan ini, mahasiswa telah mempelajari:

✅ Teori dan konsep regresi linear
✅ Implementasi dengan scikit-learn
✅ Evaluasi model dengan berbagai metrik
✅ Validasi asumsi regresi
✅ Multiple regression dan feature importance
✅ Regularization techniques
✅ Real-world case study

Pertemuan selanjutnya: Klasifikasi Dasar