SQL Data Cleaning Project: Layoffs Data

Welcome to my SQL data cleaning project! In this project, I tackled a dataset with over 2,300 records detailing layoffs across various industries. Using MySQL, I imported, cleaned, and standardized the data to ensure its usability and accuracy. Below, you can read more about the challenges I faced, the steps I took, and the tools and techniques I used to achieve my goals.

Objective

Clean and standardize a dataset of 2,361 records detailing layoffs across various industries to ensure data quality and usability.

Challenges

Data Import Issues: Initially, MySQL only imported 540 of the 2,361 records from a .csv file. After troubleshooting, I resolved this by converting the file to .json format, which allowed all records to be imported successfully.

Steps Taken

  1. Data Import:

    Imported a .json file to successfully include all 2,361 records into the MySQL database.

  2. Data Cleaning:
    • Removing Duplicates: Identified and deleted duplicate records using a ROW_NUMBER() function with a PARTITION BY clause.
    • Standardizing Data:
      • Trimmed unnecessary spaces from text fields.
      • Standardized industry names (e.g., consolidated variations of "Crypto").
      • Reformatted date values to a consistent DATE format using STR_TO_DATE.
    • Handling Missing/Null Values:
      • Identified and replaced missing or blank values where possible by cross-referencing other entries.
      • Deleted records where critical fields (e.g., total_laid_off, percentage_laid_off) were missing.
    • Column Optimization: Dropped unnecessary columns (row_num after deduplication was complete).
  3. Validation: Verified data consistency and integrity through targeted queries.

Tools Used

  • Database: MySQL
  • File Formats: .csv and .json

Key SQL Techniques

  • ROW_NUMBER() with PARTITION BY for deduplication.
  • String functions (TRIM, LIKE) for standardization.
  • STR_TO_DATE for date conversion.
  • Joins and subqueries for filling missing values.

Results

Cleaned and standardized a dataset of 2,361 records. Prepared data for further analysis and visualization.