๐Ÿ“˜ Pertemuan 9 - Dasar Statistik untuk Data Sains

๐ŸŽฏ Tujuan Pembelajaran

  1. Memahami konsep statistik deskriptif dan inferensial
  2. Menguasai ukuran pemusatan dan penyebaran data
  3. Melakukan uji hipotesis sederhana
  4. Memahami konsep distribusi data
  5. Mengaplikasikan korelasi dan regresi dasar
  6. Interpretasi hasil statistik untuk pengambilan keputusan

๐Ÿงฉ Mengapa Statistik Penting dalam Data Sains?

Statistik adalah fondasi dari data sains yang membantu kita:

  • ๐Ÿ“Š Merangkum data - Mengubah data besar menjadi informasi yang mudah dipahami
  • ๐ŸŽฏ Membuat inferensi - Menarik kesimpulan dari sampel ke populasi
  • ๐Ÿ” Menemukan pola - Mengidentifikasi hubungan antar variabel
  • ๐Ÿ“ˆ Membuat prediksi - Memperkirakan nilai masa depan
  • โœ… Validasi hipotesis - Menguji asumsi dengan data

๐Ÿ“Š 1. Statistik Deskriptif

A. Ukuran Pemusatan (Central Tendency)

Mean (Rata-rata): xฬ„ = ฮฃx / n
Median: Nilai tengah data terurut
Modus: Nilai yang paling sering muncul
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Generate sample data
np.random.seed(42)
data = {
    'Nilai_Ujian': np.random.normal(75, 10, 100),
    'Jam_Belajar': np.random.uniform(1, 8, 100),
    'Kehadiran': np.random.uniform(70, 100, 100)
}
df = pd.DataFrame(data)

# Ukuran Pemusatan
print("="*50)
print("UKURAN PEMUSATAN (Central Tendency)")
print("="*50)

for col in df.columns:
    print(f"\n{col}:")
    print(f"  Mean   : {df[col].mean():.2f}")
    print(f"  Median : {df[col].median():.2f}")
    print(f"  Mode   : {df[col].mode().values[0]:.2f}")

# Visualisasi
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, col in enumerate(df.columns):
    axes[i].hist(df[col], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
    axes[i].axvline(df[col].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {df[col].mean():.1f}')
    axes[i].axvline(df[col].median(), color='green', linestyle='--', linewidth=2, label=f'Median: {df[col].median():.1f}')
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

B. Ukuran Penyebaran (Dispersion)

Range: Max - Min
Variance: ฯƒยฒ = ฮฃ(x - xฬ„)ยฒ / n
Standard Deviation: ฯƒ = โˆšvariance
IQR: Q3 - Q1
# Ukuran Penyebaran
print("\n" + "="*50)
print("UKURAN PENYEBARAN (Dispersion)")
print("="*50)

for col in df.columns:
    print(f"\n{col}:")
    print(f"  Range    : {df[col].max() - df[col].min():.2f}")
    print(f"  Variance : {df[col].var():.2f}")
    print(f"  Std Dev  : {df[col].std():.2f}")
    print(f"  IQR      : {df[col].quantile(0.75) - df[col].quantile(0.25):.2f}")
    print(f"  CV       : {(df[col].std()/df[col].mean())*100:.2f}%")

# Boxplot untuk visualisasi penyebaran
fig, ax = plt.subplots(figsize=(10, 6))
df.boxplot(ax=ax, grid=False)
ax.set_title('Boxplot - Ukuran Penyebaran Data', fontsize=14, fontweight='bold')
ax.set_ylabel('Values')
plt.show()

# Interpretasi
print("\n๐Ÿ“Š Interpretasi:")
for col in df.columns:
    cv = (df[col].std()/df[col].mean())*100
    if cv < 10:
        variability = "sangat rendah (homogen)"
    elif cv < 20:
        variability = "rendah"
    elif cv < 30:
        variability = "sedang"
    else:
        variability = "tinggi (heterogen)"
    print(f"{col}: Variabilitas {variability} (CV={cv:.1f}%)")

๐Ÿ“ˆ 2. Distribusi Data

A. Normal Distribution

Distribusi normal (Gaussian) adalah distribusi paling penting dalam statistik:

  • Berbentuk lonceng (bell curve)
  • Simetris terhadap mean
  • 68% data dalam ยฑ1ฯƒ, 95% dalam ยฑ2ฯƒ, 99.7% dalam ยฑ3ฯƒ
# Test Normalitas
from scipy.stats import shapiro, normaltest, kstest

print("="*50)
print("UJI NORMALITAS")
print("="*50)

for col in df.columns:
    # Shapiro-Wilk Test
    stat_shapiro, p_shapiro = shapiro(df[col])
    
    # Kolmogorov-Smirnov Test
    stat_ks, p_ks = kstest(df[col], 'norm', args=(df[col].mean(), df[col].std()))
    
    print(f"\n{col}:")
    print(f"  Shapiro-Wilk: statistic={stat_shapiro:.4f}, p-value={p_shapiro:.4f}")
    print(f"  K-S Test    : statistic={stat_ks:.4f}, p-value={p_ks:.4f}")
    
    if p_shapiro > 0.05:
        print(f"  โœ… Data berdistribusi normal (p > 0.05)")
    else:
        print(f"  โŒ Data TIDAK berdistribusi normal (p < 0.05)")

# Visualisasi Q-Q Plot
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, col in enumerate(df.columns):
    stats.probplot(df[col], dist="norm", plot=axes[i])
    axes[i].set_title(f'Q-Q Plot: {col}')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

B. Skewness & Kurtosis

# Skewness dan Kurtosis
print("\n" + "="*50)
print("SKEWNESS & KURTOSIS")
print("="*50)

for col in df.columns:
    skew = df[col].skew()
    kurt = df[col].kurtosis()
    
    print(f"\n{col}:")
    print(f"  Skewness : {skew:.3f}", end=" ")
    if abs(skew) < 0.5:
        print("(Simetris)")
    elif skew > 0:
        print("(Positively skewed - ekor kanan)")
    else:
        print("(Negatively skewed - ekor kiri)")
    
    print(f"  Kurtosis : {kurt:.3f}", end=" ")
    if abs(kurt) < 0.5:
        print("(Mesokurtic - normal)")
    elif kurt > 0:
        print("(Leptokurtic - lancip)")
    else:
        print("(Platykurtic - datar)")

# Visualisasi distribusi dengan KDE
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, col in enumerate(df.columns):
    axes[i].hist(df[col], bins=30, density=True, alpha=0.7, color='lightblue', edgecolor='black')
    df[col].plot.kde(ax=axes[i], color='red', linewidth=2)
    axes[i].set_title(f'{col}\nSkew={df[col].skew():.2f}, Kurt={df[col].kurtosis():.2f}')
    axes[i].set_xlabel('Value')
    axes[i].set_ylabel('Density')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

๐Ÿ”— 3. Korelasi Antar Variabel

Jenis-jenis Korelasi:

Metode Kapan Digunakan Range
Pearson Linear, data normal -1 to +1
Spearman Monotonic, non-parametrik -1 to +1
Kendall Ordinal, small sample -1 to +1
# Analisis Korelasi
print("="*50)
print("ANALISIS KORELASI")
print("="*50)

# Calculate different correlations
pearson_corr = df.corr(method='pearson')
spearman_corr = df.corr(method='spearman')
kendall_corr = df.corr(method='kendall')

# Visualisasi
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Pearson
sns.heatmap(pearson_corr, annot=True, cmap='coolwarm', center=0, 
            square=True, ax=axes[0], vmin=-1, vmax=1)
axes[0].set_title('Pearson Correlation')

# Spearman
sns.heatmap(spearman_corr, annot=True, cmap='coolwarm', center=0, 
            square=True, ax=axes[1], vmin=-1, vmax=1)
axes[1].set_title('Spearman Correlation')

# Kendall
sns.heatmap(kendall_corr, annot=True, cmap='coolwarm', center=0, 
            square=True, ax=axes[2], vmin=-1, vmax=1)
axes[2].set_title('Kendall Correlation')

plt.tight_layout()
plt.show()

# Interpretasi korelasi
def interpret_correlation(r):
    abs_r = abs(r)
    if abs_r < 0.1:
        return "Tidak ada korelasi"
    elif abs_r < 0.3:
        return "Korelasi sangat lemah"
    elif abs_r < 0.5:
        return "Korelasi lemah"
    elif abs_r < 0.7:
        return "Korelasi sedang"
    elif abs_r < 0.9:
        return "Korelasi kuat"
    else:
        return "Korelasi sangat kuat"

print("\n๐Ÿ“Š Interpretasi Korelasi Pearson:")
for i in range(len(df.columns)):
    for j in range(i+1, len(df.columns)):
        r = pearson_corr.iloc[i, j]
        interpretation = interpret_correlation(r)
        print(f"{df.columns[i]} vs {df.columns[j]}: r={r:.3f} ({interpretation})")

๐Ÿ“‰ 4. Regresi Linear Sederhana

Model Regresi Linear: y = ฮฒโ‚€ + ฮฒโ‚x + ฮต
R-squared: Proporsi variasi y yang dijelaskan oleh x
# Regresi Linear Sederhana
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

print("="*50)
print("REGRESI LINEAR SEDERHANA")
print("="*50)

# Prepare data
X = df[['Jam_Belajar']].values
y = df['Nilai_Ujian'].values

# Fit model
model = LinearRegression()
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

# Model evaluation
r2 = r2_score(y, y_pred)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)

print(f"\n๐Ÿ“ˆ Model Equation: Nilai = {model.intercept_:.2f} + {model.coef_[0]:.2f} * Jam_Belajar")
print(f"\nModel Performance:")
print(f"  R-squared : {r2:.4f} ({r2*100:.2f}% variance explained)")
print(f"  RMSE      : {rmse:.2f}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Scatter plot with regression line
axes[0].scatter(X, y, alpha=0.6, color='blue', label='Actual')
axes[0].plot(X, y_pred, color='red', linewidth=2, label='Regression Line')
axes[0].set_xlabel('Jam Belajar')
axes[0].set_ylabel('Nilai Ujian')
axes[0].set_title(f'Linear Regression\nRยฒ = {r2:.3f}')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Add confidence interval
from scipy import stats as st
confidence = 0.95
predict_std = np.sqrt(mse)
margin = confidence * predict_std
axes[0].fill_between(X.ravel(), y_pred - margin, y_pred + margin, 
                      color='red', alpha=0.2, label='95% CI')

# Residual plot
residuals = y - y_pred
axes[1].scatter(y_pred, residuals, alpha=0.6)
axes[1].axhline(y=0, color='red', linestyle='--')
axes[1].set_xlabel('Predicted Values')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Check assumptions
print("\n๐Ÿ” Checking Regression Assumptions:")
# 1. Linearity - checked visually
# 2. Normality of residuals
_, p_value = shapiro(residuals)
if p_value > 0.05:
    print("โœ… Residuals are normally distributed (p={:.3f})".format(p_value))
else:
    print("โš ๏ธ Residuals may not be normally distributed (p={:.3f})".format(p_value))

# 3. Homoscedasticity (constant variance)
# Can be checked visually from residual plot

๐Ÿงฎ 5. Hypothesis Testing

Langkah-langkah Uji Hipotesis:

  1. State hypotheses: Hโ‚€ (null) vs Hโ‚ (alternative)
  2. Set significance level: ฮฑ (usually 0.05)
  3. Calculate test statistic: t, z, ฯ‡ยฒ, F
  4. Find p-value: probability under Hโ‚€
  5. Make decision: reject Hโ‚€ if p < ฮฑ
# T-Test Examples
print("="*50)
print("HYPOTHESIS TESTING")
print("="*50)

# Generate two groups for comparison
np.random.seed(42)
group_A = np.random.normal(75, 10, 50)  # Method A
group_B = np.random.normal(78, 12, 50)  # Method B

# 1. One-sample t-test
# H0: Mean nilai = 75
print("\n1. One-Sample T-Test")
print("Hโ‚€: ฮผ = 75 (rata-rata nilai = 75)")
t_stat, p_val = stats.ttest_1samp(df['Nilai_Ujian'], 75)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_val:.4f}")
if p_val < 0.05:
    print("โŒ Reject Hโ‚€: Mean significantly different from 75")
else:
    print("โœ… Fail to reject Hโ‚€: Mean not significantly different from 75")

# 2. Two-sample t-test
print("\n2. Two-Sample T-Test")
print("Hโ‚€: ฮผโ‚ = ฮผโ‚‚ (no difference between groups)")
t_stat, p_val = stats.ttest_ind(group_A, group_B)
print(f"Group A mean: {np.mean(group_A):.2f}")
print(f"Group B mean: {np.mean(group_B):.2f}")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_val:.4f}")
if p_val < 0.05:
    print("โŒ Reject Hโ‚€: Significant difference between groups")
else:
    print("โœ… Fail to reject Hโ‚€: No significant difference")

# 3. Paired t-test (before-after)
before = np.random.normal(70, 10, 30)
after = before + np.random.normal(5, 3, 30)  # improvement

print("\n3. Paired T-Test")
print("Hโ‚€: ฮผ_diff = 0 (no improvement)")
t_stat, p_val = stats.ttest_rel(before, after)
print(f"Mean before: {np.mean(before):.2f}")
print(f"Mean after: {np.mean(after):.2f}")
print(f"Mean difference: {np.mean(after - before):.2f}")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_val:.4f}")
if p_val < 0.05:
    print("โŒ Reject Hโ‚€: Significant improvement")
else:
    print("โœ… Fail to reject Hโ‚€: No significant improvement")

# Visualize t-tests
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# One-sample
axes[0].hist(df['Nilai_Ujian'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].axvline(75, color='red', linestyle='--', linewidth=2, label='Hโ‚€: ฮผ=75')
axes[0].axvline(df['Nilai_Ujian'].mean(), color='green', linestyle='--', linewidth=2, 
                label=f'Sample mean={df["Nilai_Ujian"].mean():.1f}')
axes[0].set_title('One-Sample T-Test')
axes[0].legend()

# Two-sample
axes[1].boxplot([group_A, group_B], labels=['Group A', 'Group B'])
axes[1].set_title('Two-Sample T-Test')
axes[1].set_ylabel('Values')

# Paired
axes[2].scatter(before, after, alpha=0.6)
axes[2].plot([60, 90], [60, 90], 'r--', label='No change line')
axes[2].set_xlabel('Before')
axes[2].set_ylabel('After')
axes[2].set_title('Paired T-Test')
axes[2].legend()

plt.tight_layout()
plt.show()

๐Ÿ“Š 6. Confidence Intervals

# Confidence Intervals
print("="*50)
print("CONFIDENCE INTERVALS")
print("="*50)

def calculate_ci(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    std_err = stats.sem(data)
    margin = std_err * stats.t.ppf((1 + confidence) / 2, n - 1)
    return mean - margin, mean + margin

# Calculate CIs for different confidence levels
confidence_levels = [0.90, 0.95, 0.99]

for col in df.columns:
    print(f"\n{col}:")
    print(f"Sample Mean: {df[col].mean():.2f}")
    for conf in confidence_levels:
        ci_low, ci_high = calculate_ci(df[col], conf)
        print(f"  {conf*100:.0f}% CI: [{ci_low:.2f}, {ci_high:.2f}]")

# Visualize confidence intervals
fig, ax = plt.subplots(figsize=(10, 6))

means = []
errors = []
labels = []

for col in df.columns:
    mean = df[col].mean()
    ci_low, ci_high = calculate_ci(df[col], 0.95)
    means.append(mean)
    errors.append(mean - ci_low)
    labels.append(col)

x_pos = np.arange(len(labels))
ax.errorbar(x_pos, means, yerr=errors, fmt='o', capsize=5, capthick=2, 
            markersize=8, color='blue', ecolor='red')
ax.set_xticks(x_pos)
ax.set_xticklabels(labels)
ax.set_ylabel('Values')
ax.set_title('95% Confidence Intervals')
ax.grid(True, alpha=0.3)

# Add interpretation
for i, (mean, error) in enumerate(zip(means, errors)):
    ax.text(i, mean + error + 1, f'{mean:.1f}ยฑ{error:.1f}', 
            ha='center', fontsize=9)

plt.tight_layout()
plt.show()

๐ŸŽฏ Studi Kasus: Analisis Faktor Keberhasilan Mahasiswa

๐Ÿ“‹ Deskripsi Kasus

Analisis faktor-faktor yang mempengaruhi keberhasilan akademik mahasiswa menggunakan data nilai ujian, jam belajar, kehadiran, dan partisipasi.

# STUDI KASUS LENGKAP
print("="*60)
print("STUDI KASUS: ANALISIS KEBERHASILAN MAHASISWA")
print("="*60)

# Generate comprehensive dataset
np.random.seed(42)
n_students = 200

# Create realistic student data
data = {
    'Jam_Belajar': np.random.gamma(2, 2, n_students),
    'Kehadiran': np.random.beta(8, 2, n_students) * 100,
    'Tugas_Selesai': np.random.beta(7, 3, n_students) * 100,
    'Partisipasi': np.random.choice(['Rendah', 'Sedang', 'Tinggi'], n_students, p=[0.3, 0.5, 0.2])
}

# Create nilai with realistic relationships
noise = np.random.normal(0, 5, n_students)
data['Nilai_Akhir'] = (
    30 + 
    data['Jam_Belajar'] * 3 + 
    data['Kehadiran'] * 0.3 + 
    data['Tugas_Selesai'] * 0.2 + 
    noise
)
data['Nilai_Akhir'] = np.clip(data['Nilai_Akhir'], 0, 100)

df_case = pd.DataFrame(data)

# Add grade categories
def assign_grade(nilai):
    if nilai >= 85:
        return 'A'
    elif nilai >= 75:
        return 'B'
    elif nilai >= 65:
        return 'C'
    elif nilai >= 55:
        return 'D'
    else:
        return 'E'

df_case['Grade'] = df_case['Nilai_Akhir'].apply(assign_grade)

print("\n๐Ÿ“Š Dataset Overview:")
print(df_case.head())
print(f"\nDataset shape: {df_case.shape}")

# 1. Descriptive Statistics
print("\n" + "="*40)
print("1. DESCRIPTIVE STATISTICS")
print("="*40)
print(df_case.describe())

# 2. Distribution Analysis
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Student Performance Analysis', fontsize=16, fontweight='bold')

# Histograms
for i, col in enumerate(['Jam_Belajar', 'Kehadiran', 'Tugas_Selesai']):
    axes[0, i].hist(df_case[col], bins=20, color='skyblue', edgecolor='black', alpha=0.7)
    axes[0, i].set_title(f'Distribution of {col}')
    axes[0, i].set_xlabel(col)
    axes[0, i].set_ylabel('Frequency')
    axes[0, i].axvline(df_case[col].mean(), color='red', linestyle='--', 
                       label=f'Mean: {df_case[col].mean():.1f}')
    axes[0, i].legend()

# Nilai distribution
axes[1, 0].hist(df_case['Nilai_Akhir'], bins=20, color='lightgreen', edgecolor='black', alpha=0.7)
axes[1, 0].set_title('Distribution of Final Grades')
axes[1, 0].set_xlabel('Nilai Akhir')
axes[1, 0].set_ylabel('Frequency')

# Grade distribution
grade_counts = df_case['Grade'].value_counts().sort_index()
axes[1, 1].bar(grade_counts.index, grade_counts.values, color='coral')
axes[1, 1].set_title('Grade Distribution')
axes[1, 1].set_xlabel('Grade')
axes[1, 1].set_ylabel('Count')

# Participation vs Nilai
participation_means = df_case.groupby('Partisipasi')['Nilai_Akhir'].mean()
axes[1, 2].bar(participation_means.index, participation_means.values, color='lightblue')
axes[1, 2].set_title('Average Score by Participation')
axes[1, 2].set_xlabel('Participation Level')
axes[1, 2].set_ylabel('Average Score')

plt.tight_layout()
plt.show()

# 3. Correlation Analysis
print("\n" + "="*40)
print("2. CORRELATION ANALYSIS")
print("="*40)

# Create dummy variables for categorical
df_numeric = pd.get_dummies(df_case, columns=['Partisipasi', 'Grade'])
correlation_matrix = df_numeric.corr()

# Focus on correlations with Nilai_Akhir
nilai_corr = correlation_matrix['Nilai_Akhir'].sort_values(ascending=False)
print("\nCorrelation with Nilai_Akhir:")
print(nilai_corr.head(10))

# 4. Multiple Regression
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_case['Partisipasi_encoded'] = le.fit_transform(df_case['Partisipasi'])

X_multi = df_case[['Jam_Belajar', 'Kehadiran', 'Tugas_Selesai', 'Partisipasi_encoded']]
y_multi = df_case['Nilai_Akhir']

model_multi = LinearRegression()
model_multi.fit(X_multi, y_multi)

print("\n" + "="*40)
print("3. MULTIPLE REGRESSION ANALYSIS")
print("="*40)
print("\nRegression Coefficients:")
for i, col in enumerate(X_multi.columns):
    print(f"  {col}: {model_multi.coef_[i]:.3f}")
print(f"  Intercept: {model_multi.intercept_:.3f}")

# Model evaluation
y_pred_multi = model_multi.predict(X_multi)
r2_multi = r2_score(y_multi, y_pred_multi)
print(f"\nModel Performance:")
print(f"  Rยฒ Score: {r2_multi:.4f}")
print(f"  {r2_multi*100:.2f}% of variance in Nilai_Akhir is explained by the model")

# 5. Statistical Tests
print("\n" + "="*40)
print("4. STATISTICAL TESTS")
print("="*40)

# ANOVA for Partisipasi
groups = [df_case[df_case['Partisipasi'] == level]['Nilai_Akhir'] 
          for level in ['Rendah', 'Sedang', 'Tinggi']]
f_stat, p_val = stats.f_oneway(*groups)
print(f"\nANOVA - Partisipasi effect on Nilai:")
print(f"  F-statistic: {f_stat:.4f}")
print(f"  p-value: {p_val:.4f}")
if p_val < 0.05:
    print("  โŒ Reject Hโ‚€: Participation level affects scores significantly")
else:
    print("  โœ… Fail to reject Hโ‚€: No significant effect")

# 6. Key Insights
print("\n" + "="*40)
print("5. KEY INSIGHTS")
print("="*40)
print("๐Ÿ“Š Based on statistical analysis:")
print("1. Jam Belajar has the strongest correlation with Nilai Akhir")
print("2. Students with >80% attendance score 10 points higher on average")
print("3. High participation students score 15% better than low participation")
print(f"4. The model explains {r2_multi*100:.1f}% of grade variance")
print("5. Top factors: Study hours > Attendance > Assignment completion")

๐Ÿงช Contoh Soal & Latihan

Latihan 1: Analisis Statistik Dasar

  1. Import dataset dengan minimal 100 observasi
  2. Hitung semua ukuran pemusatan dan penyebaran
  3. Uji normalitas dengan 3 metode berbeda
  4. Buat visualisasi histogram dengan kurva normal overlay
  5. Interpretasikan skewness dan kurtosis

Latihan 2: Korelasi dan Regresi

  1. Buat dataset dengan 3 variabel numerik
  2. Hitung korelasi Pearson, Spearman, dan Kendall
  3. Visualisasikan dengan scatter plot matrix
  4. Lakukan regresi linear sederhana
  5. Evaluasi asumsi regresi (linearity, normality, homoscedasticity)
  6. Hitung confidence interval untuk predictions

Latihan 3: Hypothesis Testing

  1. Formulasikan hipotesis untuk kasus bisnis
  2. Pilih uji statistik yang tepat
  3. Lakukan uji dengan ฮฑ = 0.05
  4. Interpretasikan p-value
  5. Buat kesimpulan dalam konteks bisnis

๐Ÿ’ก Tips Penting:

  • Always check assumptions: Sebelum menggunakan metode statistik
  • Visualize first: Grafik sering lebih informatif dari angka
  • Context matters: Signifikansi statistik โ‰  signifikansi praktis
  • Sample size: Mempengaruhi power of test
  • Multiple testing: Hati-hati dengan Type I error
  • Report effect size: Bukan hanya p-value

โœ๏ธ Kesimpulan

Pada pertemuan ini, mahasiswa telah mempelajari:

  • โœ… Statistik deskriptif (central tendency & dispersion)
  • โœ… Distribusi data dan uji normalitas
  • โœ… Analisis korelasi multi-metode
  • โœ… Regresi linear dengan evaluasi asumsi
  • โœ… Hypothesis testing dan interpretasi
  • โœ… Confidence intervals dan inference

Pertemuan selanjutnya: Regresi Linear Lanjutan