Pandas Basics — 3

Devil’s Advocate
3 min readMar 3, 2023

--

Continuing from our earlier posts on the Python Pandas library, let’s look at a few more helpful functions for managing data frames.

fillna(): fills missing values in a DataFrame with a specified value.

import pandas as pd
data = {'A': [1, 2, None, 4],
'B': [5, None, 7, 8]}
df = pd.DataFrame(data)
# Fill missing values with 0
df.fillna(0, inplace=True)
print(df)
# A B
# 0 1.0 5.0
# 1 2.0 0.0
# 2 0.0 7.0
# 3 4.0 8.0

The “inplace=True” parameter specifies that the fill operation should be performed in place, which means that the original DataFrame is modified and no new object is returned.

groupby(): groups a DataFrame by one or more columns and allows you to perform aggregate functions on the groups.

Let’s use fillna() and groupby() together.

import pandas as pd
data = {'A': [1, 2, None, 1],
'B': [5, None, 7, 8]}
df = pd.DataFrame(data)
df.fillna(0, inplace=True)
df.groupby('A').sum()
# A B
# 0.0 7.0
# 1.0 13.0
# 2.0 0.0

The other aggregation functions that can be used with groupby() are:

  • count(): Returns the count of non-null values in each group.
  • sum(): Returns the sum of values in each group.
  • mean(): Returns the mean value of each group.
  • median(): Returns the median value of each group.
  • min(): Returns the minimum value of each group.
  • max(): Returns the maximum value of each group.
  • std(): Returns the standard deviation of each group.
  • var(): Returns the variance of each group.
import pandas as pd
data = {'name': ['A', 'B', 'C', 'A', 'B', 'C'],
'age': [25, 30, 35, 40, 45, 50],
'salary': [5000, 6000, 7000, 8000, 9000, 10000]}
df = pd.DataFrame(data)
# Group by name and calculate for each group
print( df.groupby('name').sum() )
print( df.groupby('name').min() )
print( df.groupby('name').max() )
print( df.groupby('name').mean() )

join(): combines two DataFrames based on the index of each Data Frame.

It only allows you to join two DataFrames based on their indices and supports only the left and inner join types.

import pandas as pd
data1 = {'value1': [1, 2, 3, 4],
'value2': [5, 6, 7, 8]}
df1 = pd.DataFrame(data1, index=['A', 'B', 'C', 'D'])
data2 = {'value3': [9, 10, 11, 12],
'value4': [13, 14, 15, 16]}
df2 = pd.DataFrame(data2, index=['B', 'D', 'E', 'F'])
# Join df1 and df2 based on their indices
joined = df1.join(df2)
print(joined)
# Output:
# value1 value2 value3 value4
# A 1 5 NaN NaN
# B 2 6 9.0 13.0
# C 3 7 NaN NaN
# D 4 8 10.0 14.0

merge(): It allows you to merge DataFrames based on multiple columns or indices, perform various types of joins (inner, outer, left, right), and customize how the join operation is performed (suffixes, handling missing values, etc.).

import pandas as pd
data1 = {'key': ['A', 'B', 'C', 'D'],
'value': [1, 2, 3, 4]}
df1 = pd.DataFrame(data1)
data2 = {'key': ['B', 'D', 'E', 'F'],
'value': [5, 6, 7, 8]}
df2 = pd.DataFrame(data2)
# Merge df1 and df2 based on the 'key' column into a new dataframe
merged = pd.merge(df1, df2, on='key')
print(merged)
# Output:
# key value_x value_y
# 0 B 2 5
# 1 D 4 6

--

--

Devil’s Advocate

Seeker for life. Looking to make technology simpler for everyone.