š Pertemuan 6 - Pembersihan dan Transformasi Data
šÆ Tujuan Pembelajaran
- Memahami pentingnya data cleaning
- Mengidentifikasi dan menangani missing values
- Mendeteksi dan menangani data duplikat
- Melakukan normalisasi dan standarisasi
- Menerapkan pipeline cleaning pada dataset nyata
š§© 1. Mengapa Data Cleaning Penting?
"Garbage In, Garbage Out" - Data buruk = Analisis buruk!
Data mentah sering mengandung:
- ā Missing Values - Data kosong/NULL
- ā Duplicate Data - Data ganda
- ā Inconsistent Format - Format berbeda
- ā Outliers - Nilai ekstrem
- ā Invalid Data - Data di luar range
Dampak Data Kotor:
- Bias dalam analisis
- Model ML tidak akurat
- Waste resources
- Keputusan bisnis yang salah
š¦ 2. Import Library dan Dataset
# Import library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Dataset Titanic
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
print("Dataset Titanic")
print(df.head())
print(f"\nShape: {df.shape}")
print(f"Total: {len(df):,} baris")
šØ 3. Menangani Missing Values
Deteksi Missing Values
# Cek missing values
print("\nMISSING VALUES CHECK")
print("="*50)
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
missing_df = pd.DataFrame({
'Missing': missing,
'Percentage': missing_pct
})
missing_df = missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False)
print(missing_df)
# Visualisasi
plt.figure(figsize=(10, 6))
missing_df['Percentage'].plot(kind='barh', color='crimson')
plt.title('Missing Values per Kolom (%)', fontweight='bold')
plt.xlabel('Percentage')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
Strategi Menangani Missing Values
| Strategi | Kapan Digunakan | Keuntungan |
|---|---|---|
| Drop Rows | Missing < 5% | Sederhana, cepat |
| Drop Columns | Missing > 70% | Hapus kolom tidak berguna |
| Fill Mean/Median | Data numerik | Preserve data size |
| Fill Mode | Data kategorikal | Logis untuk kategori |
# Implementasi cleaning
df_clean = df.copy()
# 1. Drop kolom dengan >70% missing
threshold = 0.7 * len(df)
df_clean = df_clean.dropna(thresh=threshold, axis=1)
# 2. Fill Age dengan median
df_clean['Age'].fillna(df_clean['Age'].median(), inplace=True)
# 3. Fill Embarked dengan mode
df_clean['Embarked'].fillna(df_clean['Embarked'].mode()[0], inplace=True)
# 4. Drop baris dengan missing
df_clean = df_clean.dropna()
print(f"\nā
Cleaning selesai!")
print(f"Sebelum: {df.shape}")
print(f"Sesudah: {df_clean.shape}")
print(f"Data hilang: {df.shape[0] - df_clean.shape[0]} baris")
š 4. Menghapus Duplikat
# Cek duplikat
duplicates = df_clean.duplicated().sum()
print(f"\nDuplikat: {duplicates} baris")
if duplicates > 0:
df_clean = df_clean.drop_duplicates()
print(f"ā
{duplicates} duplikat dihapus")
else:
print("ā Tidak ada duplikat")
š¢ 5. Normalisasi & Standarisasi
Perbedaan Normalisasi vs Standarisasi
| Aspek | Normalisasi | Standarisasi |
|---|---|---|
| Formula | (X - min) / (max - min) | (X - μ) / Ļ |
| Range | [0, 1] | Tidak terbatas |
| Kapan | Data merata | Data normal |
| Outlier | Sangat sensitif | Kurang sensitif |
# Normalisasi dan Standarisasi
numeric_cols = ['Age', 'Fare']
# Normalisasi (0-1)
scaler_norm = MinMaxScaler()
df_clean[['Age_norm', 'Fare_norm']] = scaler_norm.fit_transform(df_clean[numeric_cols])
# Standarisasi (mean=0, std=1)
scaler_std = StandardScaler()
df_clean[['Age_std', 'Fare_std']] = scaler_std.fit_transform(df_clean[numeric_cols])
# Perbandingan distribusi
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Original vs Normalized vs Standardized', fontsize=16, fontweight='bold')
# Original
axes[0, 0].hist(df_clean['Age'], bins=30, color='blue', alpha=0.7, edgecolor='black')
axes[0, 0].set_title('Age - Original')
axes[0, 1].hist(df_clean['Fare'], bins=30, color='blue', alpha=0.7, edgecolor='black')
axes[0, 1].set_title('Fare - Original')
# Normalized
axes[1, 0].hist(df_clean['Age_norm'], bins=30, color='green', alpha=0.7, edgecolor='black')
axes[1, 0].set_title('Age - Normalized')
axes[1, 1].hist(df_clean['Fare_norm'], bins=30, color='green', alpha=0.7, edgecolor='black')
axes[1, 1].set_title('Fare - Normalized')
plt.tight_layout()
plt.show()
# Statistik
for col in ['Age', 'Fare']:
print(f"\n{col}:")
print(f" Original: Mean={df_clean[col].mean():.2f}, Std={df_clean[col].std():.2f}")
print(f" Normalized: Mean={df_clean[col+'_norm'].mean():.2f}, Std={df_clean[col+'_norm'].std():.2f}")
print(f" Standardized: Mean={df_clean[col+'_std'].mean():.2f}, Std={df_clean[col+'_std'].std():.2f}")
š 6. Studi Kasus - Pipeline Cleaning Titanic
š¢ Complete Data Cleaning Pipeline
Pipeline lengkap untuk membersihkan Titanic dataset
# PIPELINE LENGKAP
print("="*60)
print("DATA CLEANING PIPELINE - TITANIC")
print("="*60)
# 1. Load
df = pd.read_csv(url)
print(f"\n1ļøā£ Load: {df.shape}")
# 2. Drop kolom tidak penting
drop_cols = ['PassengerId', 'Name', 'Ticket', 'Cabin']
df = df.drop(columns=drop_cols)
print(f"2ļøā£ Drop kolom: {df.shape}")
# 3. Handle missing
df['Age'] = df.groupby('Pclass')['Age'].transform(lambda x: x.fillna(x.median()))
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Fare'] = df.groupby('Pclass')['Fare'].transform(lambda x: x.fillna(x.median()))
print(f"3ļøā£ Fill missing: {df.isnull().sum().sum()} tersisa")
# 4. Remove duplicates
df = df.drop_duplicates()
print(f"4ļøā£ Remove duplicates: {df.shape}")
# 5. Feature engineering
df['Family_Size'] = df['SibSp'] + df['Parch'] + 1
df['Is_Alone'] = (df['Family_Size'] == 1).astype(int)
print(f"5ļøā£ Features added: Family_Size, Is_Alone")
# 6. Normalize
scaler = MinMaxScaler()
df[['Age_scaled', 'Fare_scaled']] = scaler.fit_transform(df[['Age', 'Fare']])
print(f"6ļøā£ Normalized: Age, Fare")
print(f"\nā
Pipeline complete: {df.shape}")
print(f"Missing values: {df.isnull().sum().sum()}")
# Simpan
df.to_csv('titanic_cleaned.csv', index=False)
print("\nš¾ Saved: titanic_cleaned.csv")
š” Best Practices Data Cleaning:
- Understand your data first - eksplorasi dulu
- Document everything - catat semua keputusan
- Keep original data - backup data asli
- Be consistent - strategi yang konsisten
- Validate results - check dengan visualisasi
- Domain knowledge matters - pahami konteks
š§© 7. Latihan Mandiri
Latihan 1: Pipeline Cleaning
- Download dataset dari Kaggle
- Identifikasi semua masalah data
- Buat pipeline cleaning lengkap
- Dokumentasikan setiap langkah
- Visualisasikan before-after
Latihan 2: Missing Values
- Buat dataset dengan missing values
- Implementasi 3 strategi berbeda
- Bandingkan hasilnya
- Tentukan strategi terbaik
āļø Kesimpulan
Pada pertemuan ini, mahasiswa telah mempelajari:
- ā Pentingnya data cleaning
- ā Teknik handling missing values
- ā Mendeteksi dan hapus duplikat
- ā Normalisasi vs standarisasi
- ā Pipeline cleaning lengkap
- ā Best practices cleaning
Pertemuan selanjutnya: Studi Kasus dengan dataset nyata