Introduction
Pandas Cut function lets you segment your data into intervals or bins making it easier to analyze data into discrete groups or visualize the data with the help of Histogram. In this tutorial, we will understand the syntax of pd.cut() function and explain its various functionality with the help of examples.
Syntax of pandas.cut()
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’)
Parameters:
- x: This is the input array or series containing the continuous data that you want to bin.
- bins: It defines the intervals into which data should be divided. You can specify the number of bins or provide an array of bin edges.
- right: A boolean parameter indicating whether the intervals should be right-closed (includes the right bin edge) or left-closed (excludes the right bin edge).
- labels: An optional array of labels to assign to the bins. If not provided, the default labels are integer numbers.
- retbins: A boolean parameter that, if set to True, returns the bin edges in addition to the binned data.
- precision: The number of decimal places to which the bin edges should be rounded.
- include_lowest: If True, it includes the leftmost bin edge in the bins.
- duplicates: How to handle duplicate bin edges. Options are ‘raise’, ‘drop’, and ‘interval’
Examples of Pandas Cut Function
Example 1: Basic Use of Pandas Cut
In this example, we start with a list of ages as sample data. Our aim is to categorize these ages into three bins: 20-40, 40-60, and 60-80. We pass data and bin intervals to pd.cut() function that assigns each age to its corresponding bin. The output DataFrame shows the original data and the assigned bin for each age.
In[0]:
import pandas as pd # Sample data data = [25, 30, 35, 40, 45, 50, 55, 60, 65] # Creating bins bins = [20, 40, 60, 80] # Using pd.cut() to bin the data bins_result = pd.cut(data, bins) # Create a DataFrame for representation result_df = pd.DataFrame({'Data': data, 'Bin': bins_result}) print(result_df)
Out[0]:
Data Bin 0 25 (20, 40] 1 30 (20, 40] 2 35 (20, 40] 3 40 (20, 40] 4 45 (40, 60] 5 50 (40, 60] 6 55 (40, 60] 7 60 (40, 60] 8 65 (60, 80]
Example 2: Use of Custom Labels in pd.cut()
Here, we again have a list of ages, but this time, we not only define the bins (20-40, 40-60, and 60-80) but also provide custom labels for these bins: “Young,” “Middle-aged,” and “Senior.” The pd.cut() function uses these custom labels to categorize the ages, resulting in a DataFrame that includes the original data and the custom labels for each age group.
In[1]:
import pandas as pd # Sample data data = [25, 30, 35, 40, 45, 50, 55, 60, 65] # Creating bins and custom labels bins = [20, 40, 60, 80] labels = ['Young', 'Middle-aged', 'Senior'] # Using pd.cut() with custom labels bins_result = pd.cut(data, bins, labels=labels) # Create a DataFrame for representation result_df = pd.DataFrame({'Data': data, 'Bin': bins_result}) print(result_df)
Out[1]:
Data Bin 0 25 Young 1 30 Young 2 35 Young 3 40 Young 4 45 Middle-aged 5 50 Middle-aged 6 55 Middle-aged 7 60 Middle-aged 8 65 Senior
Example 3: Right-Closed Binning
In this example, we continue to work with ages and the same bin intervals (20-40, 40-60, and 60-80). However, we set the right parameter to False. This means that the intervals are now left-closed, and the right bin edge is excluded. The output DataFrame shows the data with left-closed intervals.
In[2]:
import pandas as pd # Sample data data = [25, 30, 35, 40, 45, 50, 55, 60, 65] # Creating bins with right-closed intervals bins = [20, 40, 60, 80] # Using pd.cut() with right-closed intervals bins_result = pd.cut(data, bins, right=False) # Create a DataFrame for representation result_df = pd.DataFrame({'Data': data, 'Bin': bins_result}) print(result_df)
Data Bin 0 25 [20, 40) 1 30 [20, 40) 2 35 [20, 40) 3 40 [40, 60) 4 45 [40, 60) 5 50 [40, 60) 6 55 [40, 60) 7 60 [60, 80) 8 65 [60, 80)
Example 4: Including Lowest Value
Here, we once again have ages and bin intervals (20-40, 40-60, and 60-80). However, we set include_lowest to True. This ensures that the lowest value (20) is included in the first bin. The resulting DataFrame displays the data with the lowest value included in the appropriate bin.
In[3]:
import pandas as pd # Sample data data = [25, 30, 35, 40, 45, 50, 55, 60, 65] # Creating bins and including the lowest value bins = [20, 40, 60, 80] # Using pd.cut() with include_lowest=True bins_result = pd.cut(data, bins, include_lowest=True) # Create a DataFrame for representation result_df = pd.DataFrame({'Data': data, 'Bin': bins_result}) print(result_df)
Data Bin 0 25 (19.999, 40.0] 1 30 (19.999, 40.0] 2 35 (19.999, 40.0] 3 40 (19.999, 40.0] 4 45 (40.0, 60.0] 5 50 (40.0, 60.0] 6 55 (40.0, 60.0] 7 60 (40.0, 60.0] 8 65 (60.0, 80.0]
Example 5: Handling Duplicate Bins
In this example, we work with ages and define bins with duplicate values (20-40 and 40-80). We set the duplicates parameter to ‘drop,’ which means that duplicate bins are dropped. The DataFrame shows the data categorized into non-duplicate bins.
In[4]:
import pandas as pd # Sample data data = [25, 30, 35, 40, 45, 50, 55, 60, 65] # Creating bins with duplicate values bins = [20, 40, 40, 80] # Using pd.cut() with duplicate bins and 'drop' strategy bins_result = pd.cut(data, bins, duplicates='drop') # Create a DataFrame for representation result_df = pd.DataFrame({'Data': data, 'Bin': bins_result}) print(result_df)
Data Bin 0 25 (20, 40] 1 30 (20, 40] 2 35 (20, 40] 3 40 (20, 40] 4 45 (40, 80] 5 50 (40, 80] 6 55 (40, 80] 7 60 (40, 80] 8 65 (40, 80]
Example 6: Returning Bin Edges
In this example, we again have ages and the bin intervals (20-40, 40-60, and 60-80). We set retbins to True to return the bin edges in addition to the categorized data. The resulting DataFrames show both the categorized data and the bin edges, allowing us to see the intervals used for binning.
In[5]:
import pandas as pd # Sample data data = [25, 30, 35, 40, 45, 50, 55, 60, 65] # Creating bins and returning bin edges bins = [20, 40, 60, 80] # Using pd.cut() with retbins=True bins_result, bin_edges = pd.cut(data, bins, retbins=True) # Create DataFrames for representation result_df = pd.DataFrame({'Data': data, 'Bin': bins_result}) bin_edges_df = pd.DataFrame({'Bin Edges': bin_edges}) print("Binned Data:") print(result_df) print("\nBin Edges:") print(bin_edges_df)
Binned Data: Data Bin 0 25 (20, 40] 1 30 (20, 40] 2 35 (20, 40] 3 40 (20, 40] 4 45 (40, 60] 5 50 (40, 60] 6 55 (40, 60] 7 60 (40, 60] 8 65 (60, 80] Bin Edges: Bin Edges 0 20 1 40 2 60 3 80