π Day 25: Data Cleaning - The Most Important Skill in Analytics
It's often said that data analysts spend about 80% of their time cleaning and preparing data. Messy, inconsistent data leads to incorrect analysis and bad business decisions. Learning to clean data effectively is a true superpower.
Common Data Cleaning Tasks
- Correcting Data Types: Columns are often loaded with the wrong type (e.g., a 'Price' column with '$' symbols is read as a string). Use
.astype()to convert columns to the correct type (e.g.,float,datetime64[ns]). - String Manipulation: Use the
.straccessor on a Series to apply string methods to every element at once (e.g.,df['Category'].str.lower(),df['Region'].str.strip()). - Standardizing Categories: Use the
.replace()method to consolidate inconsistent values (e.g., mapping "USA" and "United States" to a single category). - Handling Duplicates: Use
df.drop_duplicates()to remove duplicate rows. Thesubsetparameter lets you define which columns to check for duplicates (e.g.,subset=['OrderID']).
Environment Setup
Before you begin, ensure you have followed the setup instructions in the main README.md to set up your virtual environment and install the required libraries.
Exploring the Refactored Code
The script for this lesson, data_cleaning.py, has been refactored to encapsulate the entire cleaning process into a single, reusable function.
- Review the Code: Open
Day_25_Data_Cleaning/data_cleaning.py. Examine theclean_sales_data()function, which performs all the cleaning steps on a DataFrame. - Run the Script: From the root directory of the project (
Coding-For-MBA), run the script. It will load the messy CSV, pass it to the cleaning function, and print the results.python Day_25_Data_Cleaning/data_cleaning.py - Run the Tests: The tests use a sample messy DataFrame created in memory to verify that the entire cleaning pipeline works as expected.
pytest tests/test_day_25.py
π» Exercises: Day 25
For these exercises, you will use the provided messy_sales_data.csv file.
-
Load and Clean:
-
In a new script (
my_solutions_25.py), importpandasand theclean_sales_datafunction from the lesson script. - Load the
messy_sales_data.csvfile into a DataFrame. -
Pass your DataFrame to the
clean_sales_datafunction to get a cleaned version. -
Verify the Cleaning:
-
On your new
cleaned_df, perform the following checks and print the results:- Use
.info()to confirm that 'Order Date' is a datetime and 'Price' is a float. - Print the unique values of the 'Product' column (
cleaned_df['Product'].unique()) to confirm they are all lowercase. - Check the shape of the original DataFrame versus the cleaned one to see how many rows were removed.
- Use
π Incredible work! Being able to take a messy, real-world dataset and turn it into a clean, analysis-ready format is arguably the most valuable skill a data analyst can possess.
Previous: Day 24 β Day 24: Advanced Pandas - Working with Real Data β’ Next: Day 26 β Day 26: Practical Statistics for Business Analysis
You are on lesson 25 of 108.
Additional Materials
- data_cleaning.ipynb π View on GitHub π Run in Google Colab βοΈ Run in Binder
- solutions.ipynb π View on GitHub π Run in Google Colab βοΈ Run in Binder
data_cleaning.py
"""
Day 25: Data Cleaning in Practice (Optimized)
This script demonstrates common data cleaning techniques on a
messy, real-world-style dataset using Pandas. This version includes
performance optimizations.
"""
from pathlib import Path
import pandas as pd
def clean_sales_data(df: pd.DataFrame) -> pd.DataFrame:
"""
Cleans the sales data by correcting data types, standardizing text,
and removing duplicates.
"""
df_clean = df.copy()
# --- 1. Correcting Data Types ---
df_clean["Order Date"] = pd.to_datetime(df_clean["Order Date"])
# Optimized price cleaning using a single regex
df_clean["Price"] = (
df_clean["Price"].str.replace(r"[$,]", "", regex=True).astype(float)
)
# --- 2. Cleaning and Standardizing Text Data ---
df_clean["Region"] = df_clean["Region"].str.strip().str.lower()
df_clean["Product"] = df_clean["Product"].str.lower()
df_clean["Region"] = df_clean["Region"].replace({"usa": "united states"})
# --- 3. Handling Duplicates ---
df_clean.drop_duplicates(inplace=True)
df_clean.drop_duplicates(subset=["Order ID"], keep="first", inplace=True)
return df_clean
def main(): # pragma: no cover
"""
Main function to load, clean, and inspect the data.
"""
# --- Load the Messy Data ---
resource_dir = Path(__file__).resolve().parent
data_path = resource_dir / "messy_sales_data.csv"
print("--- Loading and Inspecting Messy Data ---")
try:
df = pd.read_csv(data_path)
print("Original data types (df.info()):")
df.info()
print("\nOriginal data head:")
print(df.head())
except FileNotFoundError:
print(
"Error: messy_sales_data.csv not found in the Day_25_Data_Cleaning folder."
)
return
# --- Clean the Data ---
df_cleaned = clean_sales_data(
df.copy()
) # Use a copy to avoid SettingWithCopyWarning
# --- Inspect Cleaned Data ---
print("\n--- Inspecting Cleaned Data ---")
print("\nCleaned data types (df.info()):")
df_cleaned.info()
print("\nCleaned data head:")
print(df_cleaned.head())
print("\nUnique values in 'Region' column:", df_cleaned["Region"].unique())
if __name__ == "__main__":
main()
solutions.py
"""
Day 25: Solutions to Exercises
"""
from pathlib import Path
import pandas as pd
# --- Exercise 1: Load and Initial Clean ---
print("--- Solution to Exercise 1 ---")
resource_dir = Path(__file__).resolve().parent
data_path = resource_dir / "messy_sales_data.csv"
try:
# Load the data
df = pd.read_csv(data_path)
print("Original DataFrame info:")
df.info()
# Convert 'Order Date' to datetime
df["Order Date"] = pd.to_datetime(df["Order Date"])
print("\n'Order Date' column converted to datetime.")
# Clean and convert 'Price' to float
df["Price"] = df["Price"].str.replace("$", "").str.replace(",", "").astype(float)
print("'Price' column cleaned and converted to float.")
# Clean 'Region' column whitespace
df["Region"] = df["Region"].str.strip()
print("'Region' column whitespace stripped.")
print("\nDataFrame info after initial cleaning:")
df.info()
except FileNotFoundError:
print(
"Error: messy_sales_data.csv not found in the Day_25_Data_Cleaning folder."
" Keep the CSV beside this script."
)
df = pd.DataFrame()
print("-" * 20)
# --- Exercise 2: Standardize Categories ---
print("--- Solution to Exercise 2 ---")
if not df.empty:
# Standardize 'Product' column to lowercase
df["Product"] = df["Product"].str.lower()
print("'Product' column standardized to lowercase.")
print(f"Unique product values: {df['Product'].unique()}")
# Standardize 'Region' column to 'USA'
df["Region"] = df["Region"].replace({"United States": "USA"})
print("'Region' column standardized to 'USA'.")
print(f"Unique region values: {df['Region'].unique()}")
else:
print("DataFrame not available for this exercise.")
print("-" * 20)
# --- Exercise 3: Handle Duplicates ---
print("--- Solution to Exercise 3 ---")
if not df.empty:
# Check for and count fully duplicate rows
num_duplicates = df.duplicated().sum()
print(f"Number of fully duplicate rows found: {num_duplicates}")
# Create df_cleaned by removing full duplicates
df_cleaned = df.drop_duplicates()
print(f"Shape of original df: {df.shape}")
print(f"Shape after dropping duplicates (df_cleaned): {df_cleaned.shape}")
# Check for duplicate Order IDs
num_duplicate_ids = df_cleaned.duplicated(subset=["Order ID"]).sum()
print(f"\nNumber of duplicate Order IDs found: {num_duplicate_ids}")
# Create df_final by removing duplicate Order IDs
df_final = df_cleaned.drop_duplicates(subset=["Order ID"], keep="first")
print(f"Shape after dropping duplicate Order IDs (df_final): {df_final.shape}")
print("\nFinal cleaned DataFrame head:")
print(df_final.head())
else:
print("DataFrame not available for this exercise.")
print("-" * 20)