Pandas Cut Function Tutorial | pd.cut() Explained with Examples

Introduction

Pandas Cut function lets you segment your data into intervals or bins making it easier to analyze data into discrete groups or visualize the data with the help of Histogram. In this tutorial, we will understand the syntax of pd.cut() function and explain its various functionality with the help of examples.

Syntax of pandas.cut()

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’)

Parameters:

  • x:  This is the input array or series containing the continuous data that you want to bin.
  • bins:  It defines the intervals into which data should be divided. You can specify the number of bins or provide an array of bin edges.
  • right:  A boolean parameter indicating whether the intervals should be right-closed (includes the right bin edge) or left-closed (excludes the right bin edge).
  • labels:  An optional array of labels to assign to the bins. If not provided, the default labels are integer numbers.
  • retbins:  A boolean parameter that, if set to True, returns the bin edges in addition to the binned data.
  • precision:  The number of decimal places to which the bin edges should be rounded.
  • include_lowest:  If True, it includes the leftmost bin edge in the bins.
  • duplicates:  How to handle duplicate bin edges. Options are ‘raise’, ‘drop’, and ‘interval’

Examples of Pandas Cut Function

Example 1: Basic Use of Pandas Cut

In this example, we start with a list of ages as sample data. Our aim is to categorize these ages into three bins: 20-40, 40-60, and 60-80. We pass data and bin intervals to pd.cut() function that assigns each age to its corresponding bin. The output DataFrame shows the original data and the assigned bin for each age.

In[0]:

import pandas as pd

# Sample data
data = [25, 30, 35, 40, 45, 50, 55, 60, 65]

# Creating bins
bins = [20, 40, 60, 80]

# Using pd.cut() to bin the data
bins_result = pd.cut(data, bins)

# Create a DataFrame for representation
result_df = pd.DataFrame({'Data': data, 'Bin': bins_result})

print(result_df)

Out[0]:

   Data       Bin
0    25  (20, 40]
1    30  (20, 40]
2    35  (20, 40]
3    40  (20, 40]
4    45  (40, 60]
5    50  (40, 60]
6    55  (40, 60]
7    60  (40, 60]
8    65  (60, 80]

Example 2: Use of Custom Labels in pd.cut()

Here, we again have a list of ages, but this time, we not only define the bins (20-40, 40-60, and 60-80) but also provide custom labels for these bins: “Young,” “Middle-aged,” and “Senior.” The pd.cut() function uses these custom labels to categorize the ages, resulting in a DataFrame that includes the original data and the custom labels for each age group.

In[1]:

import pandas as pd

# Sample data
data = [25, 30, 35, 40, 45, 50, 55, 60, 65]

# Creating bins and custom labels
bins = [20, 40, 60, 80]
labels = ['Young', 'Middle-aged', 'Senior']

# Using pd.cut() with custom labels
bins_result = pd.cut(data, bins, labels=labels)

# Create a DataFrame for representation
result_df = pd.DataFrame({'Data': data, 'Bin': bins_result})

print(result_df)

Out[1]:

   Data          Bin
0    25        Young
1    30        Young
2    35        Young
3    40        Young
4    45  Middle-aged
5    50  Middle-aged
6    55  Middle-aged
7    60  Middle-aged
8    65       Senior

Example 3: Right-Closed Binning

In this example, we continue to work with ages and the same bin intervals (20-40, 40-60, and 60-80). However, we set the right parameter to False. This means that the intervals are now left-closed, and the right bin edge is excluded. The output DataFrame shows the data with left-closed intervals.

In[2]:

import pandas as pd

# Sample data
data = [25, 30, 35, 40, 45, 50, 55, 60, 65]

# Creating bins with right-closed intervals
bins = [20, 40, 60, 80]

# Using pd.cut() with right-closed intervals
bins_result = pd.cut(data, bins, right=False)

# Create a DataFrame for representation
result_df = pd.DataFrame({'Data': data, 'Bin': bins_result})
print(result_df)
Out[2]:
   Data       Bin
0    25  [20, 40)
1    30  [20, 40)
2    35  [20, 40)
3    40  [40, 60)
4    45  [40, 60)
5    50  [40, 60)
6    55  [40, 60)
7    60  [60, 80)
8    65  [60, 80)

Example 4: Including Lowest Value

Here, we once again have ages and bin intervals (20-40, 40-60, and 60-80). However, we set include_lowest to True. This ensures that the lowest value (20) is included in the first bin. The resulting DataFrame displays the data with the lowest value included in the appropriate bin.

In[3]:

import pandas as pd

# Sample data
data = [25, 30, 35, 40, 45, 50, 55, 60, 65]

# Creating bins and including the lowest value
bins = [20, 40, 60, 80]

# Using pd.cut() with include_lowest=True
bins_result = pd.cut(data, bins, include_lowest=True)

# Create a DataFrame for representation
result_df = pd.DataFrame({'Data': data, 'Bin': bins_result})

print(result_df)
Out[3]:
   Data             Bin
0    25  (19.999, 40.0]
1    30  (19.999, 40.0]
2    35  (19.999, 40.0]
3    40  (19.999, 40.0]
4    45    (40.0, 60.0]
5    50    (40.0, 60.0]
6    55    (40.0, 60.0]
7    60    (40.0, 60.0]
8    65    (60.0, 80.0]

Example 5: Handling Duplicate Bins

In this example, we work with ages and define bins with duplicate values (20-40 and 40-80). We set the duplicates parameter to ‘drop,’ which means that duplicate bins are dropped. The DataFrame shows the data categorized into non-duplicate bins.

In[4]:

import pandas as pd

# Sample data
data = [25, 30, 35, 40, 45, 50, 55, 60, 65]

# Creating bins with duplicate values
bins = [20, 40, 40, 80]

# Using pd.cut() with duplicate bins and 'drop' strategy
bins_result = pd.cut(data, bins, duplicates='drop')

# Create a DataFrame for representation
result_df = pd.DataFrame({'Data': data, 'Bin': bins_result})

print(result_df)
Out[4]:
   Data       Bin
0    25  (20, 40]
1    30  (20, 40]
2    35  (20, 40]
3    40  (20, 40]
4    45  (40, 80]
5    50  (40, 80]
6    55  (40, 80]
7    60  (40, 80]
8    65  (40, 80]

Example 6: Returning Bin Edges

In this example, we again have ages and the bin intervals (20-40, 40-60, and 60-80). We set retbins to True to return the bin edges in addition to the categorized data. The resulting DataFrames show both the categorized data and the bin edges, allowing us to see the intervals used for binning.

In[5]:

import pandas as pd

# Sample data
data = [25, 30, 35, 40, 45, 50, 55, 60, 65]

# Creating bins and returning bin edges
bins = [20, 40, 60, 80]

# Using pd.cut() with retbins=True
bins_result, bin_edges = pd.cut(data, bins, retbins=True)

# Create DataFrames for representation
result_df = pd.DataFrame({'Data': data, 'Bin': bins_result})

bin_edges_df = pd.DataFrame({'Bin Edges': bin_edges})
print("Binned Data:")
print(result_df)
print("\nBin Edges:")
print(bin_edges_df)
Out[5]:
Binned Data:
   Data       Bin
0    25  (20, 40]
1    30  (20, 40]
2    35  (20, 40]
3    40  (20, 40]
4    45  (40, 60]
5    50  (40, 60]
6    55  (40, 60]
7    60  (40, 60]
8    65  (60, 80]

Bin Edges:
   Bin Edges
0         20
1         40
2         60
3         80

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *