Chapter 7: Wrangling Data

What you will learn

Data wrangling is the processing of raw data into a useable form. In this chapter we will go over data cleaning and filtering techniques to prepare the data for use in our dashboards.

What you will learn

  • common data exploration methods

  • How to clean data

  • How to filter data

  • Additional learning resources

7.1 Exploring Data

One of the most effective ways of exploring a data set is by using the pandas library. Pandas is an open source data analysis and manipulation tool, written in Python. Before you start, remember to pip install pandas in the terminal if you haven’t done so yet. Let’s review a few common pandas functions for data exploration.

Shape

shape is typically used to learn more about he dimensions of the data.

import pandas as pd

url = 'https://raw.githubusercontent.com/open-resources/dash_curriculum/main/tutorial/part2/ch7_files/temp_data.csv'
raw_data = pd.read_csv(url)

print(raw_data.head())
print(raw_data.shape)

shape

110 represents the number of rows, while 3 reflects the number of columns in the dataframe.

Info

The Pandas info method will return information on the dataframe such as:

  • data types

  • column names

  • memory usage

import pandas as pd

url = 'https://raw.githubusercontent.com/open-resources/dash_curriculum/main/tutorial/part2/ch7_files/temp_data.csv'
raw_data = pd.read_csv(url)

print(raw_data.info())

info

You see from the image above that the temp column has 106 non-null values, which means that 3 values are null. If you look carefully at the csv sheet, you would see that the data has 3 rows with no values in the temp column.

Describe

The Pandas describe method will return statistics on the dataframe such as:

  • min/max

  • unique values

  • count of non-null values in column

import pandas as pd

url = 'https://raw.githubusercontent.com/open-resources/dash_curriculum/main/tutorial/part2/ch7_files/temp_data.csv'
raw_data = pd.read_csv(url)

print(raw_data.describe())

describe

7.2 Cleaning Data

Dropna

By exploring the datafarme in the section above, we saw that 3 of the values are null. Because there are only 3 values and because temperature values don’t vary significantly, let’s remove those rows with the missing values under the temp column. We’ll use the dropna for this task.

import pandas as pd

url = 'https://raw.githubusercontent.com/open-resources/dash_curriculum/main/tutorial/part2/ch7_files/temp_data.csv'
raw_data = pd.read_csv(url)

print(raw_data.shape)

raw_data.dropna(axis=0,inplace=True)

print(raw_data.shape)

dropna

There are now 3 fewer rows after droping the null value rows.

Iterrows and Try-Except

There are several non-numeric values that we’d like to drop from the dataframe as well. First, we need to iterate through each row in the dataframe with iterrows(). Then we’ll use Python’s try-expect logic to test if the row has a valid float data type. If it does not have a valid float data type, we will use dropna to remove that row:

import pandas as pd

url = 'https://raw.githubusercontent.com/open-resources/dash_curriculum/main/tutorial/part2/ch7_files/temp_data.csv'
raw_data = pd.read_csv(url)

raw_data.dropna(axis=0, inplace=True)

for index, col in raw_data.iterrows():
    try:
        float(col[2])  # 'temp' column is index 2
    except:
        raw_data.drop(index, axis=0, inplace=True)

print(raw_data.head())

isnumeric

Notice from the image above that rows 2 and 5 were dropped. This is because neither of them had a float data type in the temp column.

Reset Index

Given that we’ve dropped a few rows, we see that the index of the dataframe is off. We’ll use the reset_index method to reset the index and drop the old index with drop=True:

import pandas as pd

url = 'https://raw.githubusercontent.com/open-resources/dash_curriculum/main/tutorial/part2/ch7_files/temp_data.csv'
raw_data = pd.read_csv(url)

raw_data.dropna(axis=0, inplace=True)

for index, col in raw_data.iterrows():
    try:
        float(col[2])  # 'temp' column is index 2
    except:
        raw_data.drop(index, axis=0, inplace=True)

print(raw_data.head())
raw_data.reset_index(drop=True, inplace=True)
print(raw_data.head())

reset_index

7.3 Filter Data

In the next step of the data wrangling process, we will start filtering the data.

Filter by value & astype

Let’s filter the data for temperatures over 18.5C. Notice that we need to use the pandas method astype() to convert the temp column data to float type instead of the current string type.

import pandas as pd

url = 'https://raw.githubusercontent.com/open-resources/dash_curriculum/main/tutorial/part2/ch7_files/temp_data.csv'
raw_data = pd.read_csv(url)

raw_data.dropna(axis=0,inplace=True)

for index, col in raw_data.iterrows():
    try:
        float(col[2]) # 'temp' column is index 2
    except:
        raw_data.drop(index, axis=0, inplace=True)

raw_data.reset_index(drop=True, inplace=True)

print(raw_data.describe())
fltr_df = raw_data[raw_data['temp'].astype(float) > 18.5]
print(fltr_df.describe())

filter_by_value

The result is a new dataframe (fltr_df) with 73 rows where the temp data is bigger than 18.5

Multiple logical operators

You might have noticed that the temp column has one row with the value of 300.2. This must be a mistake because temperatures on earth do not reach that level. Therefore, we should remove this row so it doesn’t contaminate the data. We can use logic operators, such as the & (AND) or the | (OR), to apply multiple conditions on the temp column.

import pandas as pd

url = 'https://raw.githubusercontent.com/open-resources/dash_curriculum/main/tutorial/part2/ch7_files/temp_data.csv'
raw_data = pd.read_csv(url)

raw_data.dropna(axis=0,inplace=True)

for index, col in raw_data.iterrows():
    try:
        float(col[2]) # 'temp' column is index 2
    except:
        raw_data.drop(index, axis=0, inplace=True)

raw_data.reset_index(drop=True, inplace=True)

fltr_df = raw_data[(raw_data['temp'].astype(float) > 18.5) & (raw_data['temp'].astype(float) < 50)]
print(fltr_df.describe())

the_logic_operators

iloc

The Pandas method iloc is used to filter data by index value. For example, if we wanted to limit the dataframe to the first 20 values, we could do the following:

import pandas as pd

url = 'https://raw.githubusercontent.com/open-resources/dash_curriculum/main/tutorial/part2/ch7_files/temp_data.csv'
raw_data = pd.read_csv(url)

raw_data.dropna(axis=0,inplace=True)

for index, col in raw_data.iterrows():
    try:
        float(col[2]) # 'temp' column is index 2
    except:
        raw_data.drop(index, axis=0, inplace=True)

raw_data.reset_index(drop=True, inplace=True)

fltr_df = raw_data[(raw_data['temp'].astype(float) > 18.5) & (raw_data['temp'].astype(float) < 50)]

first_20 = fltr_df.iloc[:20]
print(first_20.shape)

first 20 values

7.4 Other Resources

The pandas features we chose to highlight in this chapter should be enough to help you start building Dash apps. Nonetheless, if you would like to explore the many other features pandas has to offer, we’ve added a few resources we found helpful.

Summary

In this chapter we learned to explore, clean, and filter data to prepare it for use in Dash app. In the next chapter, we’ll dive into data visualization with the Plotly Express library.