← All Assignments
Build a Data Cleaning Pipeline
Problem Statement
Write a Python script that reads a messy CSV file `raw_employees.csv` and produces a cleaned version `clean_employees.csv`. The raw data has these problems: - Some rows have missing salary (empty string) - Email column has inconsistent casing (ALICE@gmail.com, bob@GMAIL.COM) - Some names have extra whitespace (' Alice ') - Salary column has a currency symbol ('₹90,000' instead of 90000) Your pipeline should: 1. Remove rows with missing salary 2. Lowercase all emails 3. Strip whitespace from names 4. Convert salary to integer (remove ₹ and commas) 5. Save the cleaned data to clean_employees.csv 6. Print a summary: total rows read, rows dropped, rows saved
Sample Data
raw_employees.csv: name,email,department,salary ' Alice ','ALICE@gmail.com','Engineering','₹90,000' 'Bob','bob@GMAIL.COM','Marketing','' ' Carol','carol@x.com','Engineering','₹85,000' 'Dave','dave@x.com','HR','₹70,000'
Expected Output
clean_employees.csv: name,email,department,salary Alice,alice@gmail.com,Engineering,90000 Carol,carol@x.com,Engineering,85000 Dave,dave@x.com,HR,70000 Summary printed: Rows read: 4 Rows dropped: 1 Rows saved: 3