Pandas and manipulating data frames

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How do I manage tabular data efficiently?

Objectives

Understand how Pandas can be used to store, manipulate and subset tabular data.

Inspecting Datasets

Pandas is a Python library for data manipulation and analysis, providing powerful data structures like DataFrame and Series along with a wide range of functions for tasks such as data cleaning, preparation, and exploration. It is widely used in data science and machine learning workflows for its ease of use and flexibility.

We can now explore the Iris dataset to gain insights into its structure and contents.

Inspecting a Dataset

To understand the structure of the Iris dataset, we can use various methods provided by Pandas:

print(iris_df.head())

print(iris_df.info())

print(iris_df.describe())

Understanding the contents and data types of a data set is important for accurate analysis.

Manipulating DataFrames

Pandas provides powerful functionalities to manipulate DataFrames. Here are some examples:

Adding and Removing Columns

Adding:

iris_df['sepal.ratio'] = iris_df['sepal.length'] / iris_df['sepal.width']

Removing:

iris_df.drop('variety', axis=1, inplace=True)

Adding and removing rows

Adding:

new_row = {'sepal.length': 5.1, 'sepal.width': 3.5, 'petal.length': 1.4, 'petal.width': 0.2,}
iris_df.loc[len(iris_df)] = new_row

Removing:

iris_df.drop(0, inplace=True)
iris_df.reset_index(drop=True, inplace=True)

Subsetting Data

Subsetting allows us to select specific rows or columns based on conditions:

iris_df = pd.read_csv("data/iris.csv") #reset the dataset
# Select rows where petal_length is greater than 5
subset_df = iris_df[iris_df['petal.length'] > 5]

# Select rows where species is 'setosa' and petal_length is less than 1.5
subset_df = iris_df[(iris_df['variety'] == 'Setosa') & (iris_df['petal.length'] < 1.5)]

# Select rows where sepal_length is greater than 5 and species is either 'setosa' or 'versicolor'
subset_df = iris_df[(iris_df['sepal.length'] > 5) & (iris_df['variety'].isin(['Setosa', 'Versicolor']))]

# Select rows where species is 'setosa' or 'versicolor' and petal_length is not equal to 1.5
subset_df = iris_df[(iris_df['variety'].isin(['Setosa', 'Versicolor'])) & (iris_df['petal.length'] != 1.5)]

Applying a function

To apply a function to a DataFrame column in Pandas, you can use the .apply() method.

# Define a custom function
def square_value(x):
    return x ** 2

iris_df['petal.length.squared'] = iris_df['petal.length'].apply(square_value)

Pandas is highly efficient for DataFrame manipulation due to its intuitive syntax and powerful functionalities. It offers a wide range of built-in functions for data selection, filtering, and transformation, making tasks like data cleaning and preprocessing streamlined.

Key Points

Pandas provides powerful data structures like Series and DataFrame for easy manipulation and analysis of structured data.

It offers extensive functionality for data cleaning, selection, transformation, and integration with other libraries like NumPy and Matplotlib.

With Pandas, you can efficiently handle data tasks such as handling missing values, merging datasets, and preparing data for machine learning models.

previous episode

Introduction to Python - theory sessions

next episode