📘 Pertemuan 10 - Regresi Linear Sederhana
🎯 Tujuan Pembelajaran
- Memahami konsep dan teori regresi linear
- Mengimplementasikan regresi linear dengan Python
- Mengevaluasi model dengan berbagai metrik
- Memvalidasi asumsi-asumsi regresi linear
- Melakukan prediksi dan interpretasi hasil
- Menangani overfitting dan underfitting
🧩 Apa itu Regresi Linear?
Regresi linear adalah metode statistik untuk memodelkan hubungan linear antara variabel dependen (Y) dan satu atau lebih variabel independen (X).
- 📈 Simple Linear Regression: Satu variabel independen (Y = β₀ + β₁X)
- 📊 Multiple Linear Regression: Beberapa variabel independen (Y = β₀ + β₁X₁ + β₂X₂ + ...)
- 🎯 Tujuan: Prediksi nilai kontinu dan memahami hubungan antar variabel
📐 1. Teori Regresi Linear
Model Regresi Linear: Y = β₀ + β₁X + ε
β₀ = Intercept (konstanta)
β₁ = Slope (koefisien)
ε = Error term (residual)
β₀ = Intercept (konstanta)
β₁ = Slope (koefisien)
ε = Error term (residual)
Metode Ordinary Least Squares (OLS)
OLS meminimalkan jumlah kuadrat residual (SSE):
- SSE = Σ(yᵢ - ŷᵢ)²
- β₁ = Σ((xᵢ - x̄)(yᵢ - ȳ)) / Σ(xᵢ - x̄)²
- β₀ = ȳ - β₁x̄
💻 2. Implementasi Regresi Linear
A. Setup dan Data Preparation
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
# Generate sample data - House Price Prediction
np.random.seed(42)
n_samples = 200
# Features
size = np.random.uniform(50, 300, n_samples) # House size in m²
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.uniform(0, 50, n_samples)
location_score = np.random.uniform(1, 10, n_samples)
# Target variable with realistic relationship
noise = np.random.normal(0, 20000, n_samples)
price = 50000 + size * 1500 + bedrooms * 10000 - age * 800 + location_score * 5000 + noise
# Create DataFrame
df = pd.DataFrame({
'Size_m2': size,
'Bedrooms': bedrooms,
'Age_years': age,
'Location_Score': location_score,
'Price': price
})
print("="*60)
print("HOUSE PRICE DATASET")
print("="*60)
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"Price range: ${df['Price'].min():,.0f} - ${df['Price'].max():,.0f}")
B. Simple Linear Regression - Size vs Price
# Simple Linear Regression
print("\n" + "="*60)
print("SIMPLE LINEAR REGRESSION: Size vs Price")
print("="*60)
# Prepare data
X_simple = df[['Size_m2']]
y_simple = df['Price']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X_simple, y_simple, test_size=0.2, random_state=42
)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
# Train model
model_simple = LinearRegression()
model_simple.fit(X_train, y_train)
# Make predictions
y_train_pred = model_simple.predict(X_train)
y_test_pred = model_simple.predict(X_test)
# Model parameters
print(f"\n📈 Model Equation:")
print(f"Price = {model_simple.intercept_:,.2f} + {model_simple.coef_[0]:,.2f} × Size_m2")
print(f"\nInterpretation:")
print(f"- Base price (0 m²): ${model_simple.intercept_:,.0f}")
print(f"- Price per m²: ${model_simple.coef_[0]:,.0f}")
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# 1. Scatter plot with regression line
axes[0].scatter(X_train, y_train, alpha=0.5, label='Training data')
axes[0].scatter(X_test, y_test, alpha=0.5, label='Test data')
axes[0].plot(X_train, y_train_pred, color='red', linewidth=2, label='Regression line')
axes[0].set_xlabel('Size (m²)')
axes[0].set_ylabel('Price ($)')
axes[0].set_title('Linear Regression: Size vs Price')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Add confidence interval
from sklearn.preprocessing import PolynomialFeatures
confidence = 0.95
predict_mean_se = np.sqrt(mean_squared_error(y_train, y_train_pred))
margin = confidence * predict_mean_se
axes[0].fill_between(X_train.values.ravel(),
y_train_pred - margin,
y_train_pred + margin,
color='red', alpha=0.2, label='95% CI')
# 2. Residual plot
residuals_train = y_train - y_train_pred
axes[1].scatter(y_train_pred, residuals_train, alpha=0.5)
axes[1].axhline(y=0, color='red', linestyle='--')
axes[1].set_xlabel('Predicted Price')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot (Training)')
axes[1].grid(True, alpha=0.3)
# Add ±2 std lines
std_residuals = np.std(residuals_train)
axes[1].axhline(y=2*std_residuals, color='orange', linestyle='--', alpha=0.5)
axes[1].axhline(y=-2*std_residuals, color='orange', linestyle='--', alpha=0.5)
axes[1].text(0.02, 0.98, f'Std: {std_residuals:.0f}', transform=axes[1].transAxes)
# 3. Actual vs Predicted
axes[2].scatter(y_test, y_test_pred, alpha=0.5)
axes[2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
'r--', label='Perfect prediction')
axes[2].set_xlabel('Actual Price')
axes[2].set_ylabel('Predicted Price')
axes[2].set_title('Actual vs Predicted (Test)')
axes[2].legend()
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
📊 3. Model Evaluation Metrics
Key Metrics for Regression
| Metric | Formula | Range | Interpretation |
|---|---|---|---|
| MAE | Σ|yᵢ - ŷᵢ|/n | [0, ∞) | Average absolute error |
| MSE | Σ(yᵢ - ŷᵢ)²/n | [0, ∞) | Penalizes large errors |
| RMSE | √MSE | [0, ∞) | Same unit as target |
| R² | 1 - (SSE/SST) | [-∞, 1] | Variance explained |
| Adj R² | 1 - [(1-R²)(n-1)/(n-p-1)] | [-∞, 1] | Penalizes complexity |
# Comprehensive Model Evaluation
print("="*60)
print("MODEL EVALUATION METRICS")
print("="*60)
def evaluate_model(y_true, y_pred, X, model_name="Model"):
"""Calculate and display regression metrics"""
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)
# Adjusted R²
n = len(y_true)
p = X.shape[1] if len(X.shape) > 1 else 1
adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - p - 1))
# MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f"\n{model_name} Performance:")
print(f" MAE : ${mae:,.2f}")
print(f" MSE : {mse:,.2f}")
print(f" RMSE : ${rmse:,.2f}")
print(f" R² Score : {r2:.4f} ({r2*100:.2f}% variance explained)")
print(f" Adj R² : {adj_r2:.4f}")
print(f" MAPE : {mape:.2f}%")
return {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R2': r2, 'Adj_R2': adj_r2, 'MAPE': mape}
# Evaluate on both sets
train_metrics = evaluate_model(y_train, y_train_pred, X_train, "Training Set")
test_metrics = evaluate_model(y_test, y_test_pred, X_test, "Test Set")
# Check for overfitting
print("\n" + "="*40)
print("OVERFITTING CHECK")
print("="*40)
r2_diff = train_metrics['R2'] - test_metrics['R2']
print(f"R² difference (Train - Test): {r2_diff:.4f}")
if r2_diff > 0.1:
print("⚠️ Warning: Possible overfitting detected")
elif r2_diff < -0.05:
print("⚠️ Warning: Possible underfitting detected")
else:
print("✅ Model generalizes well")
# Visualize metrics comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Metrics comparison
metrics_names = ['MAE', 'RMSE', 'MAPE']
train_values = [train_metrics[m] for m in metrics_names]
test_values = [test_metrics[m] for m in metrics_names]
x_pos = np.arange(len(metrics_names))
width = 0.35
axes[0].bar(x_pos - width/2, train_values, width, label='Training', color='skyblue')
axes[0].bar(x_pos + width/2, test_values, width, label='Test', color='lightcoral')
axes[0].set_xlabel('Metrics')
axes[0].set_ylabel('Value')
axes[0].set_title('Error Metrics Comparison')
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(metrics_names)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# R² comparison
axes[1].bar(['Training', 'Test'],
[train_metrics['R2'], test_metrics['R2']],
color=['skyblue', 'lightcoral'])
axes[1].set_ylabel('R² Score')
axes[1].set_title('R² Score Comparison')
axes[1].set_ylim([0, 1])
axes[1].grid(True, alpha=0.3)
for i, v in enumerate([train_metrics['R2'], test_metrics['R2']]):
axes[1].text(i, v + 0.01, f'{v:.3f}', ha='center')
plt.tight_layout()
plt.show()
🔍 4. Validating Regression Assumptions
Four Key Assumptions of Linear Regression
- Linearity: Linear relationship between X and Y
- Independence: Residuals are independent
- Homoscedasticity: Constant variance of residuals
- Normality: Residuals are normally distributed
# Assumption Testing
print("="*60)
print("REGRESSION ASSUMPTIONS VALIDATION")
print("="*60)
residuals = y_test - y_test_pred
# Create diagnostic plots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Regression Diagnostics', fontsize=16, fontweight='bold')
# 1. Linearity - Component + Residual Plot
axes[0, 0].scatter(X_test, y_test, alpha=0.5, label='Actual')
axes[0, 0].plot(X_test, y_test_pred, 'r-', label='Fitted', linewidth=2)
axes[0, 0].set_xlabel('Size (m²)')
axes[0, 0].set_ylabel('Price ($)')
axes[0, 0].set_title('1. Linearity Check')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# 2. Independence - Residuals vs Order
axes[0, 1].plot(residuals, marker='o', linestyle='-', alpha=0.5)
axes[0, 1].axhline(y=0, color='red', linestyle='--')
axes[0, 1].set_xlabel('Observation Order')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('2. Independence Check')
axes[0, 1].grid(True, alpha=0.3)
# Durbin-Watson test
from statsmodels.stats.stattools import durbin_watson
dw = durbin_watson(residuals)
axes[0, 1].text(0.02, 0.98, f'Durbin-Watson: {dw:.2f}',
transform=axes[0, 1].transAxes, fontsize=10)
# 3. Homoscedasticity - Scale-Location Plot
axes[0, 2].scatter(y_test_pred, np.sqrt(np.abs(residuals)), alpha=0.5)
z = np.polyfit(y_test_pred, np.sqrt(np.abs(residuals)), 1)
p = np.poly1d(z)
axes[0, 2].plot(y_test_pred, p(y_test_pred), "r--", alpha=0.8)
axes[0, 2].set_xlabel('Fitted Values')
axes[0, 2].set_ylabel('√|Residuals|')
axes[0, 2].set_title('3. Homoscedasticity Check')
axes[0, 2].grid(True, alpha=0.3)
# 4. Normality - Q-Q Plot
stats.probplot(residuals, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('4. Normality Check (Q-Q Plot)')
axes[1, 0].grid(True, alpha=0.3)
# 5. Normality - Histogram
axes[1, 1].hist(residuals, bins=20, density=True, alpha=0.7, color='skyblue', edgecolor='black')
mu, std = residuals.mean(), residuals.std()
xmin, xmax = axes[1, 1].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)
axes[1, 1].plot(x, p, 'r-', linewidth=2, label='Normal distribution')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('5. Residual Distribution')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
# 6. Leverage - Cook's Distance
from scipy.stats import f
n = len(y_test)
k = X_test.shape[1]
hat_matrix = X_test @ np.linalg.inv(X_test.T @ X_test) @ X_test.T
leverage = np.diag(hat_matrix)
standardized_residuals = residuals / (np.std(residuals) * np.sqrt(1 - leverage))
cooks_distance = (standardized_residuals**2 / k) * (leverage / (1 - leverage))
axes[1, 2].stem(range(len(cooks_distance)), cooks_distance, markerfmt='o')
axes[1, 2].axhline(y=4/n, color='red', linestyle='--', label=f'Threshold (4/n={4/n:.3f})')
axes[1, 2].set_xlabel('Observation Index')
axes[1, 2].set_ylabel("Cook's Distance")
axes[1, 2].set_title("6. Influential Points (Cook's Distance)")
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Statistical tests
print("\n📊 Statistical Tests for Assumptions:")
# Normality test
shapiro_stat, shapiro_p = stats.shapiro(residuals)
print(f"\n1. Normality (Shapiro-Wilk Test):")
print(f" Statistic: {shapiro_stat:.4f}, p-value: {shapiro_p:.4f}")
if shapiro_p > 0.05:
print(" ✅ Residuals appear normally distributed")
else:
print(" ⚠️ Residuals may not be normally distributed")
# Homoscedasticity test (Breusch-Pagan)
from scipy.stats import chi2
residuals_squared = residuals**2
_, bp_p = stats.spearmanr(y_test_pred, residuals_squared)
print(f"\n2. Homoscedasticity (Spearman correlation test):")
print(f" p-value: {bp_p:.4f}")
if bp_p > 0.05:
print(" ✅ Constant variance assumption appears satisfied")
else:
print(" ⚠️ Heteroscedasticity may be present")
# Independence test
print(f"\n3. Independence (Durbin-Watson Test):")
print(f" Statistic: {dw:.4f}")
if 1.5 < dw < 2.5:
print(" ✅ No significant autocorrelation")
else:
print(" ⚠️ Autocorrelation may be present")
🚀 5. Multiple Linear Regression
# Multiple Linear Regression
print("\n" + "="*60)
print("MULTIPLE LINEAR REGRESSION")
print("="*60)
# Prepare data with all features
X_multi = df[['Size_m2', 'Bedrooms', 'Age_years', 'Location_Score']]
y_multi = df['Price']
# Split data
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
X_multi, y_multi, test_size=0.2, random_state=42
)
# Train model
model_multi = LinearRegression()
model_multi.fit(X_train_m, y_train_m)
# Predictions
y_train_pred_m = model_multi.predict(X_train_m)
y_test_pred_m = model_multi.predict(X_test_m)
# Model equation
print("📈 Model Equation:")
print(f"Price = {model_multi.intercept_:,.2f}")
for i, col in enumerate(X_multi.columns):
sign = "+" if model_multi.coef_[i] >= 0 else ""
print(f" {sign} {model_multi.coef_[i]:,.2f} × {col}")
# Feature importance
feature_importance = pd.DataFrame({
'Feature': X_multi.columns,
'Coefficient': model_multi.coef_,
'Abs_Coefficient': np.abs(model_multi.coef_)
}).sort_values('Abs_Coefficient', ascending=False)
print("\n📊 Feature Importance (by coefficient magnitude):")
print(feature_importance.to_string(index=False))
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# 1. Feature coefficients
axes[0, 0].barh(feature_importance['Feature'], feature_importance['Coefficient'],
color=['green' if x > 0 else 'red' for x in feature_importance['Coefficient']])
axes[0, 0].set_xlabel('Coefficient Value')
axes[0, 0].set_title('Feature Coefficients')
axes[0, 0].grid(True, alpha=0.3)
# 2. Actual vs Predicted
axes[0, 1].scatter(y_test_m, y_test_pred_m, alpha=0.5)
axes[0, 1].plot([y_test_m.min(), y_test_m.max()],
[y_test_m.min(), y_test_m.max()], 'r--')
axes[0, 1].set_xlabel('Actual Price')
axes[0, 1].set_ylabel('Predicted Price')
axes[0, 1].set_title('Multiple Regression: Actual vs Predicted')
axes[0, 1].grid(True, alpha=0.3)
# 3. Residuals
residuals_m = y_test_m - y_test_pred_m
axes[0, 2].scatter(y_test_pred_m, residuals_m, alpha=0.5)
axes[0, 2].axhline(y=0, color='red', linestyle='--')
axes[0, 2].set_xlabel('Predicted Price')
axes[0, 2].set_ylabel('Residuals')
axes[0, 2].set_title('Residual Plot')
axes[0, 2].grid(True, alpha=0.3)
# 4-6. Partial regression plots
for i, col in enumerate(['Size_m2', 'Bedrooms', 'Location_Score']):
ax = axes[1, i]
ax.scatter(X_test_m[col], y_test_m, alpha=0.5)
# Simple linear fit for visualization
z = np.polyfit(X_test_m[col], y_test_m, 1)
p = np.poly1d(z)
ax.plot(X_test_m[col], p(X_test_m[col]), "r--", alpha=0.8)
ax.set_xlabel(col)
ax.set_ylabel('Price')
ax.set_title(f'Partial: {col} vs Price')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Model comparison
print("\n" + "="*40)
print("MODEL COMPARISON: Simple vs Multiple")
print("="*40)
# Evaluate multiple regression
multi_metrics = evaluate_model(y_test_m, y_test_pred_m, X_test_m, "Multiple Regression")
# Compare with simple regression
print(f"\n📊 Performance Improvement:")
print(f" R² improvement: {(multi_metrics['R2'] - test_metrics['R2'])*100:.2f}%")
print(f" RMSE reduction: ${(test_metrics['RMSE'] - multi_metrics['RMSE']):,.2f}")
if multi_metrics['Adj_R2'] > test_metrics['Adj_R2']:
print(" ✅ Multiple regression performs better (higher Adj R²)")
else:
print(" ⚠️ Simple regression might be sufficient")
🎯 6. Studi Kasus: Prediksi Harga Rumah
📋 Real Estate Price Prediction
Membangun model prediksi harga rumah berdasarkan berbagai fitur properti untuk membantu agen real estate dalam menentukan harga yang kompetitif.
# COMPREHENSIVE CASE STUDY
print("="*60)
print("CASE STUDY: REAL ESTATE PRICE PREDICTION")
print("="*60)
# Generate more realistic dataset
np.random.seed(42)
n_houses = 500
# Create features
data_case = {
'Size_m2': np.random.gamma(4, 30, n_houses),
'Bedrooms': np.random.choice([1, 2, 3, 4, 5], n_houses, p=[0.1, 0.3, 0.35, 0.2, 0.05]),
'Bathrooms': np.random.choice([1, 2, 3], n_houses, p=[0.4, 0.5, 0.1]),
'Age_years': np.random.uniform(0, 50, n_houses),
'Garage': np.random.choice([0, 1, 2], n_houses, p=[0.3, 0.5, 0.2]),
'Location_Score': np.random.beta(5, 2, n_houses) * 10,
'School_Distance': np.random.exponential(2, n_houses),
'Crime_Rate': np.random.beta(2, 5, n_houses) * 10
}
df_case = pd.DataFrame(data_case)
# Create realistic price
df_case['Price'] = (
80000 +
df_case['Size_m2'] * 1200 +
df_case['Bedrooms'] * 8000 +
df_case['Bathrooms'] * 5000 +
df_case['Garage'] * 10000 +
df_case['Location_Score'] * 6000 -
df_case['Age_years'] * 500 -
df_case['School_Distance'] * 2000 -
df_case['Crime_Rate'] * 3000 +
np.random.normal(0, 15000, n_houses)
)
print("\n📊 Dataset Overview:")
print(df_case.head())
print(f"\nDataset shape: {df_case.shape}")
print(f"Price range: ${df_case['Price'].min():,.0f} - ${df_case['Price'].max():,.0f}")
print(f"Average price: ${df_case['Price'].mean():,.0f}")
# Feature engineering
df_case['Total_Rooms'] = df_case['Bedrooms'] + df_case['Bathrooms']
df_case['Size_per_Room'] = df_case['Size_m2'] / df_case['Total_Rooms']
df_case['Age_Category'] = pd.cut(df_case['Age_years'],
bins=[0, 10, 25, 50],
labels=['New', 'Medium', 'Old'])
df_case['Location_Category'] = pd.cut(df_case['Location_Score'],
bins=[0, 3, 7, 10],
labels=['Poor', 'Average', 'Excellent'])
# Prepare for modeling
features = ['Size_m2', 'Bedrooms', 'Bathrooms', 'Age_years', 'Garage',
'Location_Score', 'School_Distance', 'Crime_Rate']
X = df_case[features]
y = df_case['Price']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train multiple models
from sklearn.linear_model import Ridge, Lasso, ElasticNet
models = {
'Linear Regression': LinearRegression(),
'Ridge (α=1)': Ridge(alpha=1),
'Lasso (α=1)': Lasso(alpha=1),
'ElasticNet (α=1)': ElasticNet(alpha=1)
}
results = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
results[name] = {
'R2': r2_score(y_test, y_pred),
'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
'MAE': mean_absolute_error(y_test, y_pred)
}
# Results comparison
results_df = pd.DataFrame(results).T
print("\n📊 Model Comparison:")
print(results_df.round(4))
# Best model analysis
best_model_name = results_df['R2'].idxmax()
print(f"\n🏆 Best Model: {best_model_name}")
print(f" R² Score: {results_df.loc[best_model_name, 'R2']:.4f}")
print(f" RMSE: ${results_df.loc[best_model_name, 'RMSE']:,.2f}")
# Feature importance for best model
best_model = models[best_model_name]
best_model.fit(X_train_scaled, y_train)
# Visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Real Estate Price Prediction Analysis', fontsize=16, fontweight='bold')
# 1. Model comparison
axes[0, 0].bar(results_df.index, results_df['R2'], color='skyblue')
axes[0, 0].set_ylabel('R² Score')
axes[0, 0].set_title('Model Performance Comparison')
axes[0, 0].set_xticklabels(results_df.index, rotation=45)
axes[0, 0].grid(True, alpha=0.3)
# 2. Feature importance
if hasattr(best_model, 'coef_'):
feature_imp = pd.DataFrame({
'Feature': features,
'Importance': np.abs(best_model.coef_)
}).sort_values('Importance', ascending=True)
axes[0, 1].barh(feature_imp['Feature'], feature_imp['Importance'], color='coral')
axes[0, 1].set_xlabel('Absolute Coefficient')
axes[0, 1].set_title('Feature Importance')
# 3. Prediction accuracy
y_pred_best = best_model.predict(X_test_scaled)
axes[0, 2].scatter(y_test, y_pred_best, alpha=0.5)
axes[0, 2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
axes[0, 2].set_xlabel('Actual Price')
axes[0, 2].set_ylabel('Predicted Price')
axes[0, 2].set_title(f'Best Model: {best_model_name}')
axes[0, 2].grid(True, alpha=0.3)
# 4. Error distribution
errors = y_test - y_pred_best
axes[1, 0].hist(errors, bins=30, color='lightgreen', edgecolor='black', alpha=0.7)
axes[1, 0].axvline(errors.mean(), color='red', linestyle='--',
label=f'Mean: ${errors.mean():,.0f}')
axes[1, 0].set_xlabel('Prediction Error ($)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Error Distribution')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# 5. Price distribution by location
axes[1, 1].boxplot([df_case[df_case['Location_Category'] == cat]['Price'].values
for cat in ['Poor', 'Average', 'Excellent']],
labels=['Poor', 'Average', 'Excellent'])
axes[1, 1].set_xlabel('Location Category')
axes[1, 1].set_ylabel('Price ($)')
axes[1, 1].set_title('Price by Location Quality')
axes[1, 1].grid(True, alpha=0.3)
# 6. Learning curves
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
best_model, X_train_scaled, y_train, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='r2'
)
axes[1, 2].plot(train_sizes, np.mean(train_scores, axis=1),
'o-', color='blue', label='Training score')
axes[1, 2].plot(train_sizes, np.mean(val_scores, axis=1),
'o-', color='red', label='Cross-validation score')
axes[1, 2].set_xlabel('Training Set Size')
axes[1, 2].set_ylabel('R² Score')
axes[1, 2].set_title('Learning Curves')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Business insights
print("\n💡 Business Insights:")
print("1. Size and Location are the strongest price predictors")
print("2. Each additional bedroom adds ~$8,000 to price")
print("3. Houses lose ~$500 in value per year of age")
print("4. Properties near schools command premium prices")
print("5. Crime rate significantly impacts property values")
🧪 Contoh Soal & Latihan
Latihan 1: Simple Linear Regression
- Load dataset advertising (TV, Radio, Newspaper, Sales)
- Buat model regresi TV spending vs Sales
- Hitung dan interpretasikan slope dan intercept
- Evaluasi dengan R², RMSE, MAE
- Buat prediction interval 95%
- Check all regression assumptions
Latihan 2: Multiple Regression
- Gunakan semua features (TV, Radio, Newspaper)
- Compare dengan simple regression
- Identify most important feature
- Check for multicollinearity (VIF)
- Apply feature scaling
- Compare Ridge, Lasso, ElasticNet
Latihan 3: Model Selection
- Implement forward selection
- Implement backward elimination
- Use cross-validation for model selection
- Plot learning curves
- Analyze bias-variance tradeoff
- Create ensemble model
💡 Best Practices:
- Always split data: Train/test atau cross-validation
- Scale features: Jika range berbeda signifikan
- Check assumptions: Sebelum interpretasi hasil
- Handle outliers: Bisa sangat mempengaruhi hasil
- Feature engineering: Bisa meningkatkan performa
- Regularization: Untuk mencegah overfitting
- Domain knowledge: Penting untuk interpretasi
✍️ Kesimpulan
Pada pertemuan ini, mahasiswa telah mempelajari:
- ✅ Teori dan konsep regresi linear
- ✅ Implementasi dengan scikit-learn
- ✅ Evaluasi model dengan berbagai metrik
- ✅ Validasi asumsi regresi
- ✅ Multiple regression dan feature importance
- ✅ Regularization techniques
- ✅ Real-world case study
Pertemuan selanjutnya: Klasifikasi Dasar