Handling Missing Values in Machine Learning with Python

Handling missing values is a crucial step in preparing data for machine learning. This tutorial provides examples of how to manage missing values using Python, focusing on the Pandas library. We'll import the necessary libraries, read the data, and explore various methods to handle missing values.

Machine Learning Handling Missing Values

You can check the full code in the Jupyter Notebook

Importing Libraries

We begin by importing the necessary libraries for our data manipulation and analysis tasks.

import numpy as np
import pandas as pd

NumPy: A fundamental package for scientific computing in Python. It provides support for arrays, matrices, and numerous mathematical functions.
Pandas: A powerful data manipulation and analysis library that provides data structures and functions needed to work with structured data seamlessly.

References:

Reading the Data

We read the CSV file containing the NFL play-by-play data.

data = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")

You can download the dataset from the Kaggle website

During the import, a warning indicates that some columns have mixed data types. This can be addressed by specifying the dtype option or setting low_memory=False.

Output:

/tmp/ipykernel_23803/1150844578.py:1: DtypeWarning: Columns (25,51) have mixed types. Specify dtype option on import or set low_memory=False.
  data = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")

References:

Pandas read_csv Documentation

Inspecting the Data

To get an overview of the data format, we inspect the first few rows of the dataframe.

data.head()

Output:

         Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...  yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA  airWPA  yacWPA  Season
0  2009-09-10  2009091000      1    1   NaN  15:00         15     3600           0.0          TEN  ...     NaN     0.485675     0.514325     0.546433     0.453567  0.485675  0.060758     NaN     NaN    2009
1  2009-09-10  2009091000      1    1   1.0  14:53         15     3593           7.0          PIT  ...  1.146076     0.546433     0.453567     0.551088     0.448912  0.546433  0.004655 -0.032244  0.036899    2009
2  2009-09-10  2009091000      1    1   2.0  14:16         15     3556          37.0          PIT  ...     NaN     0.551088     0.448912     0.510793     0.489207  0.551088 -0.040295     NaN     NaN    2009
3  2009-09-10  2009091000      1    1   3.0  13:35         14     3515          41.0          PIT  ... -5.031425     0.510793     0.489207     0.461217     0.538783  0.510793 -0.049576  0.106663 -0.156239    2009
4  2009-09-10  2009091000      1    1   4.0  13:27         14     3507           8.0          PIT  ...     NaN     0.461217     0.538783     0.558929     0.441071  0.461217  0.097712     NaN     NaN    2009

Counting Missing Values

We calculate the number of missing values in each column.

missing_values_per_column = data.isnull().sum()
missing_values_per_column[0:10] # taking the first ten columns

Output:

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

Total Missing Values

To understand the proportion of missing data, we calculate the total number of cells and the percentage of missing values.

total_cells = np.product(data.shape)
total_missing = missing_values_per_column.sum()

print('total_missing', total_missing)
print('total_cells', total_cells)
print('percentage missing', (total_missing / total_cells) * 100)

Output:

total_missing 11505187
total_cells 41584176
percentage missing 27.66722370547874

Counting Non-Missing Values

We can also count the non-missing values in each column.

data[:].count()

Output:

Date        407688
GameID      407688
Drive       407688
qtr         407688
down        346534
             ...
Win_Prob    382679
WPA         402147
airWPA      159187
yacWPA      158926
Season      407688
Length: 102, dtype: int64

Removing Rows with Missing Values

While not recommended, one way to handle missing values is to remove rows that contain them.

removed_rows_empty_data = data.dropna()
print(removed_rows_empty_data)

Output:

Empty DataFrame
Columns: [Date, GameID, Drive, qtr, down, time, TimeUnder, TimeSecs, PlayTimeDiff, SideofField, yrdln, yrdline100, ydstogo, ydsnet, GoalToGo, FirstDown, posteam, DefensiveTeam, desc, PlayAttempted, Yards.Gained, sp, Touchdown, ExPointResult, TwoPointConv, DefTwoPoint, Safety, Onsidekick, PuntResult, PlayType, Passer, Passer_ID, PassAttempt, PassOutcome, PassLength, AirYards, YardsAfterCatch, QBHit, PassLocation, InterceptionThrown, Interceptor, Rusher, Rusher_ID, RushAttempt, RunLocation, RunGap, Receiver, Receiver_ID, Reception, ReturnResult, Returner, BlockingPlayer, Tackler1, Tackler2, FieldGoalResult, FieldGoalDistance, Fumble, RecFumbTeam, RecFumbPlayer, Sack, Challenge.Replay, ChalReplayResult, Accepted.Penalty, PenalizedTeam, PenaltyType, PenalizedPlayer, Penalty.Yards, PosTeamScore, DefTeamScore, ScoreDiff, AbsScoreDiff, HomeTeam, AwayTeam, Timeout_Indicator, Timeout_Team, posteam_timeouts_pre, HomeTimeouts_Remaining_Pre, AwayTimeouts_Remaining_Pre, HomeTimeouts_Remaining_Post, AwayTimeouts_Remaining_Post, No_Score_Prob, Opp_Field_Goal_Prob, Opp_Safety_Prob, Opp_Touchdown_Prob, Field_Goal_Prob, Safety_Prob, Touchdown_Prob, ExPoint_Prob, TwoPoint_Prob, ExpPts, EPA, airEPA, yacEPA, Home_WP_pre, Away_WP_pre, Home_WP_post, Away_WP_post, Win_Prob, WPA, airWPA, yacWPA, ...]
Index: []
[0 rows x 102 columns]

Removing Columns with Missing Values

A more common approach is to remove columns that contain missing values.

removed_columns_empty_data = data.dropna(axis=1)
print(removed_columns_empty_data)

Output:

            Date      GameID  Drive  qtr  TimeUnder  ydstogo  ydsnet  PlayAttempted  Yards.Gained  sp  ...  AwayTeam  Timeout_Indicator  posteam_timeouts_pre HomeTimeouts_Remaining_Pre  AwayTimeouts_Remaining_Pre  HomeTimeouts_Remaining_Post  AwayTimeouts_Remaining_Post  ExPoint_Prob  TwoPoint_Prob  Season
0      2009-09-10  2009091000      1    1         15        0       0               1            39   0  ...       TEN                  0                        3                          3                          3                            3                            3           0.0            0.0    2009
1      2009-09-10  2009091000      1    1         15       10       5

 1             5   0  ...       TEN                  0                        3                          3                          3                            3                            3           0.0            0.0    2009
2      2009-09-10  2009091000      1    1         15        5       2               1            -3   0  ...       TEN                  0                        3                          3                          3                            3                            3           0.0            0.0    2009
3      2009-09-10  2009091000      1    1         14        8       2               1             0   0  ...       TEN                  0                        3                          3                          3                            3                            3           0.0            0.0    2009
4      2009-09-10  2009091000      1    1         14        8       2               1             0   0  ...       TEN                  0                        3                          3                          3                            3                            3           0.0            0.0    2009
...           ...         ...    ...  ...        ...      ...     ...             ...           ...  ..  ...       ...                ...                      ...                        ...                          ...                          ...                          ...           ...            ...     ...
407683 2017-12-31  2017123101     29    4          1        0      -4               1             0   0  ...       CIN                  1                        0                          3                          0                            2                            0           0.0            0.0    2017
407684 2017-12-31  2017123101     29    4          1       14      -4               1             0   0  ...       CIN                  0                        2                          2                          0                            2                            0           0.0            0.0    2017
407685 2017-12-31  2017123101     29    4          1       14       9               1            13   0  ...       CIN                  0                        2                          2                          0                            2                            0           0.0            0.0    2017
407686 2017-12-31  2017123101     30    4          1       10      -1               1            -1   0  ...       CIN                  0                        0                          2                          0                            2                            0           0.0            0.0    2017
407687 2017-12-31  2017123101     30    4          0        0      -1               1             0   0  ...       CIN                  0                        0                          2                          0                            2                            0           0.0            0.0    2017
[407688 rows x 37 columns]

We then calculate the impact of this operation by comparing the number of columns before and after.

print("original columns: %d \n" % data.shape[1])
print("cleaned columns: %d \n" % removed_columns_empty_data.shape[1])

Output:

original columns: 102
cleaned columns: 37

Subsetting the Data

To focus on a smaller portion of the dataset, we can create a subset.

subset_nfl_data = data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

Output:

         EPA    airEPA    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0  2.014474       NaN       NaN     0.485675     0.514325     0.546433     0.453567  0.485675  0.060758       NaN       NaN    2009
1  0.077907 -1.068169  1.146076     0.546433     0.453567     0.551088     0.448912  0.546433  0.004655 -0.032244  0.036899    2009
2 -1.402760       NaN       NaN     0.551088     0.448912     0.510793     0.489207  0.551088 -0.040295       NaN       NaN    2009
3 -1.712583  3.318841 -5.031425     0.510793     0.489207     0.461217     0.538783  0.510793 -0.049576  0.106663 -0.156239    2009
4  2.097796       NaN       NaN     0.461217     0.538783     0.558929     0.441071  0.461217  0.097712       NaN       NaN    2009

Basic Filling of Missing Values

A straightforward method for handling missing values is to fill them with a specific value, such as zero.

filled_basic_data = data.fillna(0)
filled_basic_data.head()

Output:

         Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0  2009-09-10  2009091000      1    1   0.0  15:00         15     3600           0.0          TEN  ...  0.000000     0.485675     0.514325     0.546433     0.453567  0.485675  0.060758  0.000000  0.000000    2009
1  2009-09-10  2009091000      1    1   1.0  14:53         15     3593           7.0          PIT  ...  1.146076     0.546433     0.453567     0.551088     0.448912  0.546433  0.004655 -0.032244  0.036899    2009
2  2009-09-10  2009091000      1    1   2.0  14:16         15     3556          37.0          PIT  ...  0.000000     0.551088     0.448912     0.510793     0.489207  0.551088 -0.040295  0.000000  0.000000    2009
3  2009-09-10  2009091000      1    1   3.0  13:35         14     3515          41.0          PIT  ... -5.031425     0.510793     0.489207     0.461217     0.538783  0.510793 -0.049576  0.106663 -0.156239    2009
4  2009-09-10  2009091000      1    1   4.0  13:27         14     3507           8.0          PIT  ...  0.000000     0.461217     0.538783     0.558929     0.441071  0.461217  0.097712  0.000000  0.000000    2009

Column-Based Filling

Another approach is to fill missing values based on the next valid observation in the column.

column_based_fill = data.bfill(axis=0).fillna(0)
column_based_fill.head()

Output:

         Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0  2009-09-10  2009091000      1    1   1.0  15:00         15     3600           0.0          TEN  ...  1.146076     0.485675     0.514325     0.546433

 0.453567  0.485675  0.060758 -0.032244  0.036899    2009
1  2009-09-10  2009091000      1    1   1.0  14:53         15     3593           7.0          PIT  ...  1.146076     0.546433     0.453567     0.551088      0.448912  0.546433  0.004655 -0.032244  0.036899    2009
2  2009-09-10  2009091000      1    1   2.0  14:16         15     3556          37.0          PIT  ... -5.031425     0.551088     0.448912     0.510793      0.489207  0.551088 -0.040295  0.106663 -0.156239    2009
3  2009-09-10  2009091000      1    1   3.0  13:35         14     3515          41.0          PIT  ... -5.031425     0.510793     0.489207     0.461217      0.538783  0.510793 -0.049576  0.106663 -0.156239    2009
4  2009-09-10  2009091000      1    1   4.0  13:27         14     3507           8.0          PIT  ...  0.163935     0.461217     0.538783     0.558929      0.441071  0.461217  0.097712 -0.010456  0.006029    2009

References:

Pandas Handling Missing Data

These steps provide a comprehensive guide to identifying and handling missing values in a dataset, ensuring the data is ready for analysis and modeling. Each method has its pros and cons, and the choice of method depends on the specific context and requirements of your analysis.

Extra exercise from kaggle

This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.

In this exercise, you'll apply what you learned in the Handling missing values tutorial.

1) Take a first look at the data

Run the next code cell to load in the libraries and dataset you'll use to complete the exercise.

# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
sf_permits = pd.read_csv("./Building_Permits.csv")

# set seed for reproducibility
np.random.seed(0)

/tmp/ipykernel_33/3534875831.py:6: DtypeWarning: Columns (22,32) have mixed types. Specify dtype option on import or set low_memory=False. sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

You can download the dataset from the Kaggle website.

Use the code cell below to print the first five rows of the sf_permits DataFrame.

# TODO: Your code here!
sf_permits.head()

	Permit Number	Permit Type	Permit Type Definition	Permit Creation Date	Block	Lot	Street Number	Street Number Suffix	Street Name	Street Suffix	...	Existing Construction Type	Existing Construction Type Description	Proposed Construction Type	Proposed Construction Type Description	Site Permit	Supervisor District	Neighborhoods - Analysis Boundaries	Zipcode	Location	Record ID
0	201505065519	4	sign - erect	05/06/2015	0326	023	140	NaN	Ellis	St	...	3.0	constr type 3	NaN	NaN	NaN	3.0	Tenderloin	94102.0	(37.785719256680785, -122.40852313194863)	1380611233945
1	201604195146	4	sign - erect	04/19/2016	0306	007	440	NaN	Geary	St	...	3.0	constr type 3	NaN	NaN	NaN	3.0	Tenderloin	94102.0	(37.78733980600732, -122.41063199757738)	1420164406718
2	201605278609	3	additions alterations or repairs	05/27/2016	0595	203	1647	NaN	Pacific	Av	...	1.0	constr type 1	1.0	constr type 1	NaN	3.0	Russian Hill	94109.0	(37.7946573324287, -122.42232562979227)	1424856504716
3	201611072166	8	otc alterations permit	11/07/2016	0156	011	1230	NaN	Pacific	Av	...	5.0	wood frame (5)	5.0	wood frame (5)	NaN	3.0	Nob Hill	94109.0	(37.79595867909168, -122.41557405519474)	1443574295566
4	201611283529	6	demolitions	11/28/2016	0342	001	950	NaN	Market	St	...	3.0	constr type 3	NaN	NaN	NaN	6.0	Tenderloin	94102.0	(37.78315261897309, -122.40950883997789)	144548169992

5 rows × 43 columns

Correct:

The first five rows of the data does show that several columns have missing values. You can see this in the "Street Number Suffix", "Proposed Construction Type" and "Site Permit" columns, among others.

2) How many missing data points do we have?

What percentage of the values in the dataset are missing? Your answer should be a number between 0 and 100. (If 1/4 of the values in the dataset are missing, the answer is 25.)

# TODO: Your code here!
total_cells = np.product(sf_permits.shape)
missing_values_count = sf_permits.isnull().sum();

total_missing_cells = missing_values_count.sum();

# print("shape", sf_permits.shape)
print('total cells', total_cells)
# print('total missing per column', missing_values_count)
print('total missing cells', total_missing_cells)

percent_missing = (total_missing_cells / total_cells) * 100

print('percent missing', percent_missing)

total cells 8552700 total missing cells 2245941 percent missing 26.26002315058403

Correct

3) Figure out why the data is missing

Look at the columns "Street Number Suffix" and "Zipcode" from the San Francisco Building Permits dataset. Both of these contain missing values.

Which, if either, are missing because they don't exist?
Which, if either, are missing because they weren't recorded?

Once you have an answer, run the code cell below.

Correct:

If a value in the "Street Number Suffix" column is missing, it is likely because it does not exist. If a value in the "Zipcode" column is missing, it was not recorded.

4) Drop missing values: rows

If you removed all of the rows of sf_permits with missing values, how many rows are left?

Note: Do not change the value of sf_permits when checking this.

# TODO: Your code here!
total_rows = sf_permits.shape[0]
total_rows_after_drop= sf_permits.dropna().shape[0]

print(total_rows)
print(total_rows_after_drop) # no rows

198900 0

Correct:

There are no rows remaining in the dataset!

5) Drop missing values: columns

Now try removing all the columns with empty values.

Create a new DataFrame called sf_permits_with_na_dropped that has all of the columns with empty values removed.
How many columns were removed from the original sf_permits DataFrame? Use this number to set the value of the dropped_columns variable below.

sf_permits_with_na_dropped = sf_permits.dropna(axis=1)

dropped_columns = sf_permits.shape[1] - sf_permits_with_na_dropped.shape[1]

6) Fill in missing values automatically

Try replacing all the NaN's in the sf_permits data with the one that comes directly after it and then replacing any remaining NaN's with 0. Set the result to a new DataFrame sf_permits_with_na_imputed.

sf_permits_with_na_imputed = sf_permits.bfill(axis=0).fillna(0)

More practice

Check out this noteboook on handling missing values using scikit-learn's imputer.
Look back at the "Zipcode" column in the sf_permits dataset, which has some missing values. How would you go about figuring out what the actual zipcode of each address should be? (You might try using another dataset. You can search for datasets about San Fransisco on the Datasets listing.)