Strategies for Handling Missing Values in Data Cleaning

Introduction

Finding and Handling missing values is the First step in the data-cleaning process. We perform all these processes with the help of some libraries, methods, and functions of Python. If we are not getting any missing value while checking the dataset, then we proceed to the next step of data cleaning, but if we find any missing values then, we have some technique to handle that missing value according to the dataset.

Types of Missing Values

On the basis of the relationship between missing and present values in the same column of the dataset, the missing value is divided into three types.

Types of Missing Values 

MCAR

It stands for Missing completely at random. In this type, the whole row in the dataset will be missing, even the serial number. This type of missing value doesn’t make any impact on the dataset & doesn’t have any relation with the other data of the same column. E.g: Like, if Adarsh is a college student, but his class coordinator forgot to enter his details in the student file, then there will be no impact on the result of the analysis of Student data because of the missing data of Adarsh in the student file.

MAR

It stands for Missing at Random. In this type, we find a strong relation between the missing values and other available values in the same column; because of this, we can determine the missing values on the basis of other values of that column .E.g: In college, 75% attendance is mandatory to give the exam, and if the name of Adarsh best friend Prabhat is not in attendance, the seat of the Exam hall shows that attendance of Prabhat is below 75%.

MNAR

It stands for Missing Not at Random . In this type, the missing value is closely related to itself ( data of, other columns or any other information ) . E.g: The rule of Adarsh college is regular students will get full marks in the practical then If Adarsh practical mark is missing in student marks dataset then we can predict his practical marks on the basis of regularity .

How to check missing values in a dataset?

We have many ways to check the null values available in the dataset, but we need to import some libraries before using them.

Firstly we will import the pandas library.

import pandas as pd

Now we will load our dataset.

datasetname = pd.read.csv(r"path\of\dataset") 
datasetname.head() #it will return top five row of dataset 

Now the missing values will be denoted as NaN.

datasetname.isnull()       #datasetname.notnull() 

It will return a table of boolean values where all true will represent missing values in the table.

datasetname.isnull().sum() 

It will return the count of the number of missing values column-wise.

datasetname.isnull().sum().sum() 
# It will return the total sum missing values in the dataset 
datasetname.shape
# It will return shape of dataset i.e (No. of rows , No. of column )
(datasetname.isnull().sum()/datasetname.shape)*100
# It will return the percent of data missing in each column

We can also check the missing values in the dataset through the graph and plot after importing Seaborn libraries and matplotlib.

Import seaborn as sns
import matplotlib.pyplot as plt 
sns.heatmap(datasetname.isnull())
plt.show()

It will return a plot where all null values will be represented by white color and available values will be denoted by black color .

Technique of handling the missing values based on their behavior

Which technique we will use depends on the characteristics of that column, like whether that column has numeric value or categorical data.

  1. Categorical Missing values: It is very easy to handle the categorical missing values in the dataset because, in this type, we replace the missing values with the most repeated categorical value in the column. For: In the given table, we can see that the technology preferred by Lekhika is not given, but out of the other three, two prefer Python, so we will fill the missing value with python.
  2. Numerical Missing values: Handling numeric missing values is a little bit crucial because it depends on which type of numeric value is missing. Generally, we replace it with a constant value or by the mean or median value of all the data of that column and sometimes we try to predict it by the other data of the dataset.

Example: To understand it with a practical example we will take a dataset to understand the process of finding and handling missing values while data cleaning.

Firstly we will import libraries of python.

import pandas as pd
import numpy as np

Now, we will create a data set that has all types of data issues because we will discuss them in upcoming articles.

data = {
    'Name': ['John', 'Jane', 'Alice', 'Bob', 'Chris', np.nan] * 1000,
    'Age': [35, 28, 42, 50, 45, np.nan] * 1000,
    'Salary': [60000, 75000, 90000, 120000, 200000, 1500000] * 1000,
    'Department_ID': [101, 102, 103, 104, 105, 106] * 1000
}

Convert this data into the data frame.

df = pd.DataFrame(data)
df.head()

Convert this data into the data frame

To find the missing values, we can use .isnull() or .notnull(), but we will get a boolean table in the output, and then we will be required to count all the missing values. To avoid this, we will use .isnull().sum() or .notnull().sum().

df.isnull().sum()

find the missing values

If you want to know the percentage of missing values, then we can.

null_percent =df.isnull().sum()/df.shape[0]*100
null_percent

percentage of missing values

Here we find that only the two columns have the missing values Name and Age. Firstly, we will handle the Name column; as name columns have categorical values, it will not be a good approach to fill all the missing name values with the most frequent name values because we have 1000 missing name values, so we will give “UNKNOWN” as value.

df['Name'].fillna('Unknown', inplace=True)

Now, the Age column also has 1000 missing values. Here, age is a numeric value so we can fill the mean of all the age values available in the missing age values.

df['Age'].fillna(df['Age'].median(), inplace=True)

We have successfully handled all the missing values to verify we can check it through.

df.isnull().sum()

 successfully handle the all missing values

Conclusion

Handling missing values is essential in data cleaning to ensure dataset integrity. We classify missing values as MCAR, MAR, or MNAR based on their relationship with other data. Techniques include replacing missing categorical values with the most common category and numerical values with constants, means, or predictions from other data. Overall, addressing missing values ensures reliable analysis and insights.


Similar Articles