Introduction to Pandas
Created by: Nika Gagua
Data Manipulation
with Pandas I
• Introduction to
Pandas
• Series
• DataFrame
• Reading CSV Files
• Basic Data
Exploration
What is Pandas
•
•
•
Pandas is a Python library that provides fast, flexible, and expressive data structures like Series and
DataFrame. It is widely used for data manipulation and analysis in Python.
Pandas is essential for efficiently handling and analyzing structured data.
By providing versatile data structures such as Series and DataFrame, Pandas offers significant
benefits:
• How We Use Pandas:
•
•
•
•
•
Data Cleaning and Preparation: Pandas allows you to handle messy, missing, or irregular data
easily, using methods like dropna() (to remove missing data) or fillna() (to replace missing values).
Data Transformation: You can reshape and manipulate data, such as filtering and selecting specific
data points, pivoting, and transposing data.
Data Exploration and Analysis: Pandas provides easy-to-use methods to explore datasets using
functions like .head(), .info(), and .describe() to quickly understand the dataset structure, get
summary statistics, and check for null values.
Data Input and Output: It seamlessly integrates with many file formats (CSV, Excel, JSON, SQL
databases) for importing and exporting data.
Efficient Data Computations: Vectorized operations allow fast mathematical computations and
transformations across entire data frames and series without writing for loops.
Benefits We Get From Pandas:
Improved Productivity: Pandas simplifies common data manipulation tasks with concise, readable
syntax, saving time and effort.
Efficiency: Built on top of NumPy, Pandas supports fast, memory-efficient computations, even
with large datasets.
Ease of Use: Pandas provides intuitive and flexible tools to work with data, making complex tasks
simple to implement with minimal code.
Powerful Data Manipulation: Pandas supports complex data manipulation tasks (e.g., merging,
joining, group operations) that would otherwise require manual effort.
Integration with Other Libraries: Since Pandas is a core part of the Python data ecosystem, it
integrates well with other libraries like Matplotlib (for plotting), Seaborn (for statistical
visualization), and scikit-learn (for machine learning).
What is a Series?
•
Pandas Series is a onedimensional labeled array
capable of holding data of any
type (integer, string, float, python
objects, etc.).
• The axis labels are collectively
called index. Pandas Series is
nothing but a column in an excel
sheet.
•
Labels need not be unique but
must be a hashable type. The
object supports both integer and
label-based indexing and
provides a host of methods for
performing operations involving
the index.
Python Pandas Series
In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database,
CSV file, and Excel file.
Pandas Series can be created from the lists, dictionary, and from a scalar value etc.
Creating a series from array: In order to create a series from array, we have to import a numpy module and have to use array()
function.
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
>>> data =
np.array([‘p',’a’,’n',’d',’a’,’s’])
>>> ser = pd.Series(data)
>>> print(ser)
Output:
0
1
2
3
4
5
p
a
n
d
a
s
dtype: object
Creating a series from Lists:
In order to create a series from list, we have to first create a list after that we can create a series from list.
import pandas as pd
Output:
# a simple list
>>> list = [‘p', ‘a', ‘n', ‘d', ‘a’, ‘s’]
0
1
2
3
4
5
# create series form a list
>>> ser = pd.Series(list)
>>> print(ser)
p
a
n
d
a
s
dtype: object
Accessing element of Series
There are two ways through which we can access element of series, they are :
•
•
Accessing Element from Series with Position
Accessing Element Using Label (index)
Accessing Element from Series with Position: In order to access the series element refers to the index number.
Use the index operator [ ] to access an element in a series.
The index must be an integer. In order to access multiple elements from a series, we use Slice operation.
import pandas as pd
import numpy as np
>>> data = np.array([
'k','u','n','g','f','u','p','a','n','d','a’’
])
>>> ser = pd.Series(data)
#retrieve the first element
>>> print(ser[:6])
Output:
0
k
1
u
2
n
3
g
4
f
5
u
dtype: object
Accessing Element Using Label (index) :
In order to access an element from series, we have to set values by index label.
A Series is like a fixed-size dictionary in that you can get and set values by index label.
Accessing a single element using index label.
import pandas as pd
import numpy as np
data = np.array(['k','u','n','g','f','u','p','a','n','d','a’'])
>>> ser = pd.Series(data,index=[10,11,12,13,14,15,16,17,18,19,20])
# accessing an element using index element
>>> print(ser[16])
Output: p
Indexing and Selecting Data in Series
Indexing in pandas means simply selecting particular data from a Series.
Indexing could mean selecting all the data, some of the data from particular columns.
Indexing can also be known as Subset Selection.
Indexing a Series using indexing operator [] :
Indexing operator is used to refer to the square brackets following an object.
import pandas as pd
# making data frame
>> df = pd.read_csv("nba.csv")
>>> ser = pd.Series(df['Name'])
>>> data = ser.head(10)
>>> data
# Now we access the element of
series using index operator [ ].
# using indexing operator
>>> data[3:6]
The .loc and .iloc indexers also use the indexing operator to make
selections.
Indexing a Series using .loc[ ] :
Indexing a Series using .iloc[ ] :
This function allows us to retrieve data by position.
In order to do that, we’ll need to specify the positions of the data that we want.
The df.iloc indexer is very similar to df.loc but only uses integer locations to make its
selections.
import pandas as pd
# making data frame
>>> df = pd.read_csv("nba.csv")
>>> ser = pd.Series(df['Name'])
>>> data = ser.head(10)
>>> print(data)
Now we access the element of Series
using .iloc[] function.
import pandas as pd
# making data frame
>>> df = pd.read_csv("nba.csv")
>>> ser = pd.Series(df['Name'])
>>> data = ser.head(10)
>>> print(data)
# In this case, it uses labels 3 to 5
(inclusive)
>>> selected_data = data.loc[3:5]
>>> print(selected_data)
# In this case, it selects rows at
positions 3 to 5 (exclusive)
>>> selected_data = data.iloc[3:6]
>>> print(selected_data)
Binary Operation on Series
We can perform binary operation on series like addition, subtraction and many other operation.
In order to perform binary operation on series we have to use some function like .add(),.sub(),.mul(),.div(), etc..
# importing pandas module
import pandas as pd
# creating a series data
>>> data = pd.Series([5, 2, 3,7],
index=['a', 'b', 'c', 'd'])
# creating a series data1
>>> data1 = pd.Series([1, 6, 4, 9],
index=['a', 'b', 'd', 'e'])
>>> print(data, "\n\n", data1)
Output:
a
5
b
2
c
3
d
7
dtype: int64
a
1
b
6
d
4
e
9
dtype: int64
Addition (add)
Multiplication (mul)
# addition operation
>>> result_add = data.add(data1)
>>> print(result_add)
# multiplication operation
>>> result_mul = data.mul(data1)
>>> print(result_mul)
Output:
a
6.0
b
8.0
c
NaN
d
11.0
e
NaN
dtype: float64
Output:
a
5.0
b
12.0
c
NaN
d
63.0
e
NaN
dtype: float64
Subtraction (sub)
Division (div)
# subtraction operation
>>> result_sub = data.sub(data1)
>>> print(result_sub)
Output:
a
4.0
b
-4.0
c
NaN
d
-2.0
e
NaN
dtype: float64
# division operation
>>> result_div = data.div(data1)
>>> print(result_div)
Output:
a
5.000000
b
0.333333
c
NaN
d
0.777778
e
NaN
dtype: float64
Explanation of NaN Values
NaN (Not a Number) appears in the output when the two Series do not share a common index for certain elements.
Pandas aligns elements based on index labels during operations like add, sub, mul, or div.
If an index exists in one Series but is missing in the other, pandas returns NaN for that operation.
How to Handle NaN:
Fill Missing Values: Use fill_value argument to replace NaN with a default value (e.g., 0) during the operation.
Drop NaN Values: Use .dropna() to remove NaN values if they are not needed.
# Handling NaN by filling it with 0.
>>> result_add = data.add(data1,
fill_value=0)
>>> print(result_add)
Output:
a
6.0
b
8.0
c
3.0
d
11.0
e
9.0
dtype: float64
# Using .dropna() to Remove NaN Values
>>> cleaned_data = data.dropna()
>>> print(cleaned_data)
Output:
a
5.0
b
2.0
c
3.0
dtype: float64
Conversion Operation on Series
In conversion operation we perform various operation like changing datatype of series, changing a series to list etc.
In order to perform conversion operation we have various function which help in conversion like .astype(), .tolist() etc.
# Python program using astype
# to convert a datatype of series
import pandas as pd
# reading csv file from url
data = pd.read_csv("nba.csv")
# dropping null value columns to avoid
errors
data.dropna(inplace = True)
# storing dtype before converting
before = data.dtypes
# converting dtypes using astype
data["Salary"]= data["Salary"].astype(int)
data["Number"]= data["Number"].astype(str)
# storing dtype after converting
after = data.dtypes
# printing to compare
print("BEFORE CONVERSION\n", before, "\n")
print("AFTER CONVERSION\n", after, "\n")
Conversion Operation on Series
In conversion operation we perform various operation like changing datatype of series, changing a series to list etc.
In order to perform conversion operation we have various function which help in conversion like .astype(), .tolist() etc.
# Python program converting
# a series into list
import pandas as pd
import re
>>> data = pd.read_csv("nba.csv")
# removing null values to avoid errors
>>> data.dropna(inplace = True)
# storing dtype before operation
>>> dtype_before = type(data["Salary"])
# converting to list
>>> salary_list = data["Salary"].tolist()
# storing dtype after operation
>>> dtype_after = type(salary_list)
# printing dtype
>>> print("Data type before converting =
{}\nData type after converting = {}"
.format(dtype_before, dtype_after))
# displaying list
>>> salary_list
Binary operation methods on series:
FUNCTION DESCRIPTION
• add()
Method is used to add series or list like objects with same length to the caller series
• sub()
Method is used to subtract series or list like objects with same length from the caller series
• mul()
Method is used to multiply series or list like objects with same length with the caller series
• div()
Method is used to divide series or list like objects with same length by the caller series
• sum()
Returns the sum of the values for the requested axis
• prod()
Returns the product of the values for the requested axis
• mean() Returns the mean of the values for the requested axis
• pow()
Method is used to put each element of passed series as exponential power of caller series
and returned the results
• abs()
Method is used to get the absolute numeric value of each element in Series/DataFrame
• cov()
Method is used to find covariance of two series
Pandas series method:
FUNCTION
DESCRIPTION
•
Series()
A pandas Series can be created with the Series() constructor method. This constructor method accepts a variety of inputs
•
combine_first() Method is used to combine two series into one
•
count()
Returns number of non-NA/null observations in the Series
•
size()
Returns the number of elements in the underlying data
•
name()
Method allows to give a name to a Series object, i.e. to the column
•
is_unique()
Method returns boolean if values in the object are unique
•
idxmax()
Method to extract the index positions of the highest values in a Series
•
idxmin()
Method to extract the index positions of the lowest values in a Series
•
sort_values() Method is called on a Series to sort the values in ascending or descending order
•
sort_index()
Method is called on a pandas Series to sort it by the index instead of its values
•
head()
Method is used to return a specified number of rows from the beginning of a Series. The method returns a brand new Series
•
tail()
Method is used to return a specified number of rows from the end of a Series. The method returns a brand new Series
•
le()
Used to compare every element of Caller series with passed series.It returns True for every element which is Less than or Equal to the element in passed series
•
ne()
Used to compare every element of Caller series with passed series. It returns True for every element which is Not Equal to the element in passed series
•
ge()
Used to compare every element of Caller series with passed series. It returns True for every element which is Greater than or Equal to the element in passed series
•
eq()
Used to compare every element of Caller series with passed series. It returns True for every element which is Equal to the element in passed series
•
gt()
Used to compare two series and return Boolean value for every respective element
•
lt()
Used to compare two series and return Boolean value for every respective element
•
clip()
Used to clip value below and above to passed Least and Max value
•
clip_lower()
Used to clip values below a passed least value
•
clip_upper()
Used to clip values above a passed maximum value
•
astype()
Method is used to change data type of a series
•
tolist()
Method is used to convert a series to list
•
get()
Method is called on a Series to extract values from a Series. This is alternative syntax to the traditional bracket syntax
•
unique()
Pandas unique() is used to see the unique values in a particular column
•
nunique()
Pandas nunique() is used to get a count of unique values
•
value_counts() Method to count the number of the times each unique value occurs in a Series
•
factorize()
Method helps to get the numeric representation of an array by identifying distinct values
•
map()
Method to tie together the values from one object to another
•
between()
Pandas between() method is used on series to check which values lie between first and second argument
•
apply()
Method is called and feeded a Python function as an argument to use the function on every Series value. This method is helpful for executing custom operations that are not included
in pandas or numpy
What is a DataFrame?
•
•
•
Pandas DataFrame is twodimensional size-mutable,
potentially heterogeneous
tabular data structure with
labeled axes (rows and
columns).
A DataFrame is a twodimensional data structure,
i.e., data is aligned in a
tabular fashion in rows and
columns.
Pandas DataFrame consists
of three principal
components, the data,
rows, and columns.
Advantages of creating and using Pandas DataFrames:
•
•
•
•
•
•
Easy Data Manipulation: Sort, filter, and group data easily.
Efficient with Large Data: Handles millions of rows/columns efficiently.
Flexible Data Structure: Create from lists, dictionaries, CSV, Excel, etc.
Built-in Functions: Powerful methods like .groupby(), .merge(), .apply().
Missing Data Handling: Easily manage missing values with .fillna() or
.dropna().
Integration: Works seamlessly with libraries like NumPy and Matplotlib.
Creating a Pandas DataFrame
In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL
Database, CSV file, and Excel file.
Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc.
Dataframe can be created in different ways here are some ways by which we create a dataframe:
import pandas as pd
import pandas as pd
# Create a list of lists
>>> data = [
['Alice', 24, 'New York'],
['Bob', 27, 'Los Angeles'],
['Charlie', 22, 'Chicago'],
['David', 32, 'Houston'],
['Eva', 29, 'Phoenix’]
]
# Create a list of dictionaries
>>> data = [
{'Name': 'Alice', 'Age': 24,
'City': 'New York'},
{'Name': 'Bob', 'Age': 27,
'City': 'Los Angeles'},
{'Name': 'Charlie', 'Age': 22,
'City': 'Chicago'},
{'Name': 'David', 'Age': 32,
'City': 'Houston'},
{'Name': 'Eva', 'Age': 29,
'City': 'Phoenix'}
]
# Create a DataFrame from the list
of lists
>>> df = pd.DataFrame(data,
columns=['Name', 'Age', 'City'])
# Display the DataFrame
>>> print(df)
# Create a DataFrame from the list
of dictionaries
>>> df = pd.DataFrame(data)
# Display the DataFrame
>>> print(df)
Column Selection:
Row Selection:
In Order to select a column in
Pandas DataFrame, we can either
access the columns by calling them
by their columns name.
Pandas provide a unique method to retrieve
rows from a Data frame.
DataFrame.loc[] method is used to retrieve
rows from Pandas DataFrame.
Rows can also be selected by passing integer
location to an iloc[] function.
import pandas as pd
>>> data = {
'Name': ['Nika', 'Tornike', 'Salome',
'Ana'],
'Age': [27, 33, 24, 30],
'Address': ['Tbilisi', 'Kutaisi', 'Batumi',
'Rustavi'],
'Qualification': ['MBA', 'PhD', 'BSc',
'MSc']
}
# Convert the dictionary into DataFrame
>>> df = pd.DataFrame(data)
# Select two columns: 'Name' and
'Qualification’
>>> print(df[['Name', 'Qualification']])
import pandas as pd
>>> data = {
'Name': ['Nika', 'Tornike', 'Salome',
'Ana'],
'Age': [27, 33, 24, 30],
'Address': ['Tbilisi', 'Kutaisi',
'Batumi', 'Rustavi'],
'Qualification': ['MBA', 'PhD', 'BSc',
'MSc']
}
>>> df = pd.DataFrame(data)
>>>> df.set_index('Name', inplace=True) # Set
'Name' as the index for better row selection
# retrieving row by loc method
>>> nika_row = df.loc['Nika’]
>>> salome_row = df.loc['Salome’]
>>> print(nika_row, "\n\n", salome_row)
# Retrieve rows by integer location using iloc
>>> first_row = df.iloc[0]
>>> third_row = df.iloc[2]
>>> print(first_row, "\n\n", third_row)
Iterating over rows and columns
Iteration is a general term for taking each item of something, one after another.
Pandas DataFrame consists of rows and columns so, in order to iterate over dataframe, we have to iterate a dataframe like a dictionary.
Iterating over rows :
In order to iterate over rows, we can use three function iteritems(), iterrows(), itertuples() .
These three function will help in iteration over rows.
import pandas as pd
data = {
'name': ["Nika", "Tornike", "Salome", "Ana"],
'degree': ["MBA", "PhD", "BSc", "MSc"],
'score': [90, 85, 78, 92]
}
# Iterating over DataFrame columns using iteritems()
>>> for label, content in df.iteritems():
print(f"Column: {label}")
print(content[0:2]) # Displaying first 2 rows of
each column
# Creating a DataFrame from the dictionary
df = pd.DataFrame(data)
# Creating a list of DataFrame columns
>>> columns = list(df)
# Iterating over columns and printing the third
element of each column
>>> for col in columns:
print(f"The third element of the '{col}'
column is: {df[col][0]}")
# Iterating over DataFrame rows using iterrows()
>>> for index, row in df.iterrows():
print(f"Row {index}: Name={row['name']},
Degree={row['degree']}, Score={row['score']}")
# Iterating over rows using itertuples()
>>> for row in df.itertuples():
print(f"Name: {row.name}, Degree: {row.degree},
Score: {row.score}")
DataFrame Methods:
FUNCTION
DESCRIPTION
• index()
Method returns index (row labels) of the DataFrame
• insert()
Method inserts a column into a DataFrame
• add()
Method returns addition of dataframe and other, element-wise (binary operator add)
• sub()
Method returns subtraction of dataframe and other, element-wise (binary operator sub)
• mul()
Method returns multiplication of dataframe and other, element-wise (binary operator mul)
• div()
Method returns floating division of dataframe and other, element-wise (binary operator truediv)
• unique()
Method extracts the unique values in the dataframe
• nunique() Method returns count of the unique values in the dataframe
• value_counts() Method counts the number of times each unique value occurs within the Series
• columns() Method returns the column labels of the DataFrame
• axes()
Method returns a list representing the axes of the DataFrame
• isnull()
Method creates a Boolean Series for extracting rows with null values
• notnull()
Method creates a Boolean Series for extracting rows with non-null values
• between() Method extracts rows where a column value falls in between a predefined range
• isin()
Method extracts rows from a DataFrame where a column value exists in a predefined collection
• dtypes()
Method returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns
• astype()
Method converts the data types in a Series
• values()
Method returns a Numpy representation of the DataFrame i.e. only the values in the DataFrame will be returned, the axes
labels will be removed
• sort_values()- Set1, Set2
Method sorts a data frame in Ascending or Descending order of passed Column
• sort_index() Method sorts the values in a DataFrame based on their index positions or labels instead of their values but sometimes a
data frame is made out of two or more data frames and hence later index can be changed using this method
DataFrame Methods:
FUNCTION
DESCRIPTION
• loc[]
Method retrieves rows based on index label
• iloc[]
Method retrieves rows based on index position
• ix[]
Method retrieves DataFrame rows based on either index label or index position. This method combines the best features of the .loc[]
and .iloc[] methods
• rename()
Method is called on a DataFrame to change the names of the index labels or column names
• columns() Method is an alternative attribute to change the coloumn name
• drop()
Method is used to delete rows or columns from a DataFrame
• pop()
Method is used to delete rows or columns from a DataFrame
• sample()
Method pulls out a random sample of rows or columns from a DataFrame
• nsmallest() Method pulls out the rows with the smallest values in a column
• nlargest() Method pulls out the rows with the largest values in a column
• shape()
Method returns a tuple representing the dimensionality of the DataFrame
• ndim()
Method returns an ‘int’ representing the number of axes / array dimensions. Returns 1 if Series, otherwise returns 2 if DataFrame
• dropna()
Method allows the user to analyze and drop Rows/Columns with Null values in different ways
• fillna()
Method manages and let the user replace NaN values with some value of their own
• rank()
Values in a Series can be ranked in order with this method
• query()
Method is an alternate string-based syntax for extracting a subset from a DataFrame
• copy()
Method creates an independent copy of a pandas object
• duplicated() Method creates a Boolean Series and uses it to extract rows that have duplicate values
• drop_duplicates() Method is an alternative option to identifying duplicate rows and removing them through filtering
• set_index() Method sets the DataFrame index (row labels) using one or more existing columns
• reset_index() Method resets index of a Data Frame. This method sets a list of integer ranging from 0 to length of data as index
• where()
Method is used to check a Data Frame for one or more condition and return the result accordingly. By default, the rows not satisfying
the condition are filled with NaN value
import pandas as pd
>>> data = {
'name': ['Nika', 'Tornike', 'Salome', 'Ana’],
'age': [27, 24, 22, 32]
}
index()
# Returns the index (row labels) of the DataFrame.
>>> df = pd.DataFrame(data)
>>> print(df.index)
nunique()
# Counts the unique values in the DataFrame column.
>>> print(df['age'].nunique())
value_counts()
# Counts the occurrences of each unique value in the
column.
>>> print(df['age'].value_counts())
insert()
# Inserts a new column in the DataFrame.
>>> df.insert(2, 'score', [90, 85, 78, 92])
>>> print(df)
isnull() and notnull()
# Creates a Boolean Series for rows with null and
non-null values.
>>> print(df.isnull())
>>> print(df.notnull())
add(), sub(), mul(), div()
# Performs element-wise addition, subtraction,
multiplication, and division.
>>> df['double_score'] = df['score'].add(10)
>>> print(df[['score', 'double_score']])
astype()
# Converts the data type of a column.
>>> df['age'] = df['age'].astype(float)
>>> print(df.dtypes)
unique()
# Finds unique values in a DataFrame column.
>>> print(df['age'].unique())
Pandas Read CSV in Python
CSV files are the Comma Separated Files.
To access data from the CSV file, we require a function read_csv() from Pandas that retrieves data in the form of the data frame.
import pandas as pd
# reading csv file
>>> df = pd.read_csv("people.csv")
>>> print(df.head())
# In this example, we will take a CSV file and then add some special characters to see how the
sep parameter works.
>>> data = pd.read_csv(‘people.csv',
sep='[:, |_]',
engine='python')
Using usecols in read_csv()
import pandas as pd
# Here, we are specifying only 3 columns,i.e.[“First Name”, “Email”, “Job Title”] to load and
we use the header 0 as its default header.
>>> df = pd.read_csv("people.csv",
header=0,
usecols=["First Name", "Job Title", "Email"])
>>> print(df.head())
Using index_col in read_csv()
# Here, we use the “Email” index first and then the “Job Title” index, we can simply reindex
the header with index_col parameter.
>>> df = pd.read_csv('people.csv',
header=0,
index_col=["Date of Birth", "Job Title"],
usecols=["Date of Birth", "Job Title", "Email"])
>>> print(df.head())
Using nrows in read_csv()
import pandas as pd
# Here, we just display only 5 rows using nrows parameter.
>>> df = pd.read_csv('people.csv',
header=0,
index_col=[”First Name", "Job Title"],
usecols=[”First Name", "Job Title", "Email"],
nrows=3)
>>> print(df)
Using skiprows in read_csv()
# The skiprows help to skip some rows in CSV, i.e, here you will observe that the rows mentioned in skiprows have
been skipped from the original dataset.
>>> df = pd.read_csv("people.csv", skiprows = [1,5])
>>> print(df)
Pandas DataFrame describe() Method
Pandas describe() is used to view some basic statistical details like percentile, mean, std, etc. of a data frame or a series of
numeric values.
When this method is applied to a series of strings, it returns a different output:
import pandas as pd
# Using Describe function in Pandas
>>> data = pd.read_csv('people.csv’)
>>> data.describe()
# Pandas describe() behavior for numeric dtypes
>>> data.dropna(inplace=True)
# Define percentiles list for the describe method
percentiles = [.20, .40, .60, .80]
# List of dtypes to include in the describe method
include = ['object', 'float', 'int']
# Calling the describe method with custom percentiles and data types
desc = data.describe(percentiles=percentiles, include=include)
desc
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )