Pandas#
In this lesson we will learn the basics of data manipulation using the Pandas library.
Set up#
[1]:
import numpy as np
import pandas as pd
[2]:
# Set seed for reproducibility
np.random.seed(seed=1234)
Load data#
We’re going to work with the Titanic dataset which has data on the people who embarked the RMS Titanic in 1912 and whether they survived the expedition or not. It’s a very common and rich dataset which makes it very apt for exploratory data analysis with Pandas.
Let’s load the data from the CSV file into a Pandas dataframe. The header=0 signifies that the first row (0th index) is a header row which contains the names of each column in our dataset.
These are the different features: * class: class of travel * name: full name of the passenger * sex: gender * age: numerical age * sibsp: # of siblings/spouse aboard * parch: number of parents/child aboard * ticket: ticket number * fare: cost of the ticket * cabin: location of room * emarked: port that the passenger embarked at (C - Cherbourg, S - Southampton, Q - Queenstown) * survived: survial metric (0 - died, 1 - survived)
Exploratory data analysis (EDA)#
Now that we loaded our data, we’re ready to start exploring it to find interesting information.
[3]:
import matplotlib.pyplot as plt
We can use .describe() to extract some standard details about our numerical features.
We can also use .hist() to view the histrogram of values for each feature.
Indexing#
We can use iloc to get rows or columns at particular positions in the dataframe.
Preprocessing#
After exploring, we can clean and preprocess our dataset.
Be sure to check out our entire lesson focused on preprocessing in our mlops course.
Feature engineering#
We’re now going to use feature engineering to create a column called family_size. We’ll first define a function called get_family_size that will determine the family size using the number of parents and siblings.
Once we define the function, we can use lambda to apply that function on each row (using the numbers of siblings and parents in each row to determine the family size for each row).
Save data#
Finally, let’s save our preprocessed data into a new CSV file to use later.