Data Wrangling — Cleaning, Merging, and Grouping

I thought I just had to load a CSV and press "Train." I was wrong. My data was scattered across two different files. It had empty rows. It had text where numbers should be.

If I feed this into a model, it will crash. Today is about Data Wrangling: The art of Merging, Cleaning, and Shaping data.

1. Assembling the Data: Merging & Concat

Before I can clean, I need to get everything in one place.

Scenario A: Stacking (`pd.concat`)

I have data_january.csv and data_february.csv. They have the same columns, just different rows. I need to stack them on top of each other.

df1 = pd.read_csv('jan_data.csv')
df2 = pd.read_csv('feb_data.csv')

# Axis 0 = Stack vertically (add rows)
full_df = pd.concat([df1, df2], axis=0)

Scenario B: The Database Join (`pd.merge`)

This is like a SQL Join. I have a Users Table (ID, Name) and a Sales Table (ID, Purchase). I need to connect them via the ID.

users = pd.DataFrame({'ID': [1, 2], 'Name': ['Max', 'Jaden']})
sales = pd.DataFrame({'ID': [1, 2], 'Amount': [500, 700]})

# Merge on the 'ID' column
# 'inner' means only keep rows where ID exists in BOTH tables
merged_df = pd.merge(users, sales, on='ID', how='inner')

2. Handling Empty Cells (`NaN`)

"NaN" stands for Not a Number. It’s a hole in the dataset. We have two choices: Remove it or Fix it.

Option A: The Nuclear Option (`dropna`)

If a row is missing critical data, it's garbage. Throw it out.

# Drop rows containing ANY missing values
new_df = df.dropna()

# Drop rows only if 'Calories' is missing
df.dropna(subset=['Calories'], inplace=True)

Pros: 100% clean data.
Cons: You might delete 50% of your dataset if you aren't careful.

Option B: The Surgeon's Option (`fillna`)

Instead of deleting, we make an educated guess. This is called Imputation. For numerical data, we usually fill the hole with the Mean (average) or Median (middle value).

# Calculate the mean (average)
x = df["Calories"].mean()

# Fill empty cells with that average
df["Calories"].fillna(x, inplace=True)

Why Median? If you have outliers (e.g., one person burned 10,000 calories), the Mean will be skewed. The Median is safer.

2. Cleaning Wrong Formats

Sometimes data exists, but it's in the wrong costume.

Dates as Text: "2023/12/01" is a String, not a Date.
Numbers as Text: "450" (string) cannot be multiplied.

# Convert to DateTime format
df['Date'] = pd.to_datetime(df['Date'])

# Fix specific errors (e.g., typo in row 7)
df.loc[7, 'Date'] = '2023-12-15'

3. Converting Text to Numbers

This was the hardest concept for me to grasp. Machine Learning models is just math. You cannot multiply "Cat" * 5. You have to turn "Cat" into a number.

Why not just label them 0, 1, 2?

If I have Red, Green, Blue, and I make them 0, 1, 2:

The model thinks Blue (2) is "greater than" Red (0).
It thinks Green (1) is the average of Red and Blue.
This implies a ranking that doesn't exist.

The Solution: One-Hot Encoding

We create a new column for every category and mark it with a 1 or 0.

Color	Is_Red	Is_Green	Is_Blue
Red	1	0	0
Green	0	1	0
Blue	0	0	1

# Convert categorical variables into dummy/indicator variables
df = pd.get_dummies(df, columns=['Color'])

print(df.head())
# Now I have columns like 'Color_Red', 'Color_Green' with 1s and 0s.

Pro Tip: In Scikit-Learn (which I'll use later), this is done using OneHotEncoder, but for data analysis, Pandas get_dummies is the quick fix.

4. Removing Duplicates

Sometimes you accidentally scrape the same data twice. Duplicate rows bias the model (it memorizes that specific example).

# Check for duplicates
print(df.duplicated().sum())

# Remove them
df.drop_duplicates(inplace=True)

5. Grouping (`groupby`)

This is where Pandas becomes powerful. groupby allows us to split the data into categories, apply a function, and combine the results. It’s essential for analysis and smarter cleaning.

Basic Analysis: "What is the average calorie burn for each Workout Type?"

# Group by 'Type' and calculate the mean of 'Calories'
print(df.groupby('Type')['Calories'].mean())

Advanced Cleaning (Smart Imputation): The problem: Filling a missing "Age" with the global average might be wrong. The solution: Fill missing "Age" with the average age of that person's specific Job Title.

# Fill missing values based on group averages
df['Age'] = df['Age'].fillna(df.groupby('Job')['Age'].transform('mean'))

The Output

My dataset started as a mess. Now:

Merged into a single Master DataFrame.
Empty cells are filled with the Median.
Dates are actual Date Objects.
Imputed missing values (using Group statistics).
"Categories" are now mathematically readable "One-Hot" vectors.
Duplicates are gone.

Data Wrangling — Cleaning, Merging, and Grouping

1. Assembling the Data: Merging & Concat

Scenario A: Stacking (`pd.concat`)

Scenario B: The Database Join (`pd.merge`)

2. Handling Empty Cells (`NaN`)

Option A: The Nuclear Option (`dropna`)

Option B: The Surgeon's Option (`fillna`)

2. Cleaning Wrong Formats

3. Converting Text to Numbers

Why not just label them 0, 1, 2?

The Solution: One-Hot Encoding

4. Removing Duplicates

5. Grouping (`groupby`)

The Output

Comments

Machine Learning

Pandas: A Deep Dive

More from this blog

Convexity, Logistic Loss, and the Overfitting Battle

Logistic Regression and Decision Surfaces

Optimization Mechanics — Feature Scaling, Convergence Testing, and Polynomial Mapping

Upgrading to Multiple Linear Regression

gradient_descent.py

Command Palette

1. Assembling the Data: Merging & Concat

Scenario A: Stacking (pd.concat)

Scenario B: The Database Join (pd.merge)

2. Handling Empty Cells (NaN)

Option A: The Nuclear Option (dropna)

Option B: The Surgeon's Option (fillna)

2. Cleaning Wrong Formats

3. Converting Text to Numbers

Why not just label them 0, 1, 2?

The Solution: One-Hot Encoding

4. Removing Duplicates

5. Grouping (groupby)

The Output

Comments

Machine Learning

Pandas: A Deep Dive

More from this blog

Scenario A: Stacking (`pd.concat`)

Scenario B: The Database Join (`pd.merge`)

2. Handling Empty Cells (`NaN`)

Option A: The Nuclear Option (`dropna`)

Option B: The Surgeon's Option (`fillna`)

5. Grouping (`groupby`)