Introduction
In our machine learning or data science projects when we work with the Pandas library, we majorly focus on the handling of dataframes. In a lot of scenarios apart from the general view of dataframes, we look for alternative views or sometimes have to sort the data. In this tutorial, we will be learning about pandas functions of crosstab(), sample(), and sort_values() that can help us to represent the dataframes in alternative ways for getting for insights.
Importing Pandas Library
We will commence this tutorial by importing the pandas library.
import pandas as pd
Pandas Crosstab : crosstab()
By using crosstab() function, we can compute cross-tabulation of two or more different factors.
Syntax
pandas.crosstab(index,columns,values=None,rownames,colnames,dropna)
index : array-like, Series, or list of arrays/Series – These values are used for grouping in the rows.
columns : array-like, Series, or list of arrays/Series – The values are used for grouping in the columns.
values : array-like,optional – The are the array of values used for aggregating.
rownames : sequence,optional – It is used for matching number of row arrays passed.
colnames : sequence,optional – It is used for matching number of column arrays passed.
dropna : bool,default True – This parameter ensures that columns with NaN values are not considered.
The result of this function is a dataframe with the cross-tabulation of data.
Example 1: Simple example of pandas crosstab function
Here three arrays are built and then using pandas crosstab function, we are viewing these arrays in different ways.
import numpy as np
a = np.array(["mango", "mango", "mango", "mango", "orange", "orange",
"orange", "orange", "mango", "mango", "mango"], dtype=object)
a
array(['mango', 'mango', 'mango', 'mango', 'orange', 'orange', 'orange', 'orange', 'mango', 'mango', 'mango'], dtype=object)
b = np.array(["one", "one", "one", "two", "one", "one",
"one", "two", "two", "two", "one"], dtype=object)
b
array(['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', 'two', 'two', 'one'], dtype=object)
c = np.array(["watermelon", "watermelon", "strawberry", "watermelon", "watermelon", "strawberry",
"strawberry", "watermelon", "strawberry", "strawberry", "strawberry"],dtype=object)
c
array(['watermelon', 'watermelon', 'strawberry', 'watermelon', 'watermelon', 'strawberry', 'strawberry', 'watermelon', 'strawberry', 'strawberry', 'strawberry'], dtype=object)
As we can see the three arrays are mapped to rows and columns for viewing these arrays in tabular format
pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
b | one | two | ||
---|---|---|---|---|
c | strawberry | watermelon | strawberry | watermelon |
a | ||||
mango | 2 | 2 | 2 | 1 |
orange | 2 | 1 | 0 | 1 |
Example 2: Using Categorical and NaN values with pandas crosstab
Here the pandas crosstab function is used with categorical and NaN values.
first = pd.Categorical(['p', 'q'], categories=['p', 'q', 'r'])
first
[p, q] Categories (3, object): [p, q, r]
second = pd.Categorical(['x', 'y'], categories=['x', 'y', 'z'])
second
[x, y] Categories (3, object): [x, y, z]
Since crosstab() function has the default value of dropna parameter as True, “c” and “f” are dropped from the data.
pd.crosstab(first, second)
col_0 | x | y |
---|---|---|
row_0 | ||
p | 1 | 0 |
q | 0 | 1 |
To get “r” and “z” in the output, dropna parameter is passed “False” value. After this, we can see the two values which were not present earlier.
pd.crosstab(first, second, dropna=False)
col_0 | x | y | z |
---|---|---|---|
row_0 | |||
p | 1 | 0 | 0 |
q | 0 | 1 | 0 |
r | 0 | 0 | 0 |
[adrotate banner=”3″]
Pandas Sample : sample()
The pandas sample() function is used for returning a random sample of items from an axis of the object.
Syntax
pandas.DataFrame.sample(n,frac,replace,random_state,axis)
n : int,optional – This value specifies the number of items to be returned from the axis of the object.
frac : float,optional – This value tells us the fraction of axis items to return.
replace : bool,default false – By this parameter, the functions come to know whether to allow or disallow sampling of the same row more than once.
random_state : int or numpy.random.RandomState, optional – We can specify a random number generator for fetching desired values.
axis : {0 or ‘index’, 1 or ‘columns’, None}, default None – This is the axis from where sample is taken.
Example 1: Simple example of pandas sample function
We will now look at some examples of pandas sample function, here in this 1st example, after creating a DataFrame, the sample is taken by specifying “n” as 3.
df = pd.DataFrame({'seed_count': [16, 40, 0, 2],
'water_content': [20, 50, 10, 30],
'quantity': [10, 2, 1, 8]},
index=['orange', 'watermelon', 'pineapple', 'apple'])
df
seed_count | water_content | quantity | |
---|---|---|---|
orange | 16 | 20 | 10 |
watermelon | 40 | 50 | 2 |
pineapple | 0 | 10 | 1 |
apple | 2 | 30 | 8 |
df['water_content'].sample(n=3, random_state=1)
apple 30 pineapple 10 orange 20 Name: water_content, dtype: int64
Example 2: Using random and replace parameters in pandas sample
In this 2nd example, the random and replace parameters are provided and we can clearly see there is a different output. The frac parameter fetches a part of values present on the axis of the object.
df.sample(frac=0.5, replace=True, random_state=1)
seed_count | water_content | quantity | |
---|---|---|---|
watermelon | 40 | 50 | 2 |
apple | 2 | 30 | 8 |
Pandas Sort_Values : sort_values()
This function of pandas is used to perform the sorting of values on either axes.
Syntax
pandas.DataFrame.sort_values(by,axis,ascending,inplace,kind,na_position,ignore_index)
by : str or list of str – Here a single list or multiple lists are provided for performing sorting operation.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0 – This is the axis where sorting should take place.
ascending : bool or list of bool, default True – Here the way sorting should be executed is specified. It can be either ascending and descending.
inplace : bool, default False – This parameter ensures that the results are in-place, if specified as true.
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’ – Here we can choose the kind of sorting which will be performed.
na_position : {‘first’, ‘last’}, default ‘last’ – This parameter either shifts all NaN’s to beginning or at the end.
ignore_index : bool, default False – If passed as true, the original index will be ignored and new index for the object will be provided.
Finally, a DataFrame with sorted values is returned by this function.
Example 1: Simple example of sort_values() function in pandas
After creating a DataFrame, in this example, we are performing the sorting on the column titled “col1”.
df = pd.DataFrame({'col1': ['P', 'Q', 'A', np.nan, 'R', 'C'],
'col2': [7, np.nan, 9, 2, 8, 5],
'col3': [2, 9, 7, 9, np.nan, 1],})
df
col1 | col2 | col3 | |
---|---|---|---|
0 | P | 7.0 | 2.0 |
1 | Q | NaN | 9.0 |
2 | A | 9.0 | 7.0 |
3 | NaN | 2.0 | 9.0 |
4 | R | 8.0 | NaN |
5 | C | 5.0 | 1.0 |
As we can see the alphabets have been sorted in correct alphabetical order.
df.sort_values(by=['col1'])
col1 | col2 | col3 | |
---|---|---|---|
2 | A | 9.0 | 7.0 |
5 | C | 5.0 | 1.0 |
0 | P | 7.0 | 2.0 |
1 | Q | NaN | 9.0 |
4 | R | 8.0 | NaN |
3 | NaN | 2.0 | 9.0 |
Example 2: Using ascending parameter as false in pandas sort
In this example, the ascending parameter is specified with false as the value, so the results are as shown.
df.sort_values(by='col1', ascending=False)
col1 | col2 | col3 | |
---|---|---|---|
4 | R | 8.0 | NaN |
1 | Q | NaN | 9.0 |
0 | P | 7.0 | 2.0 |
5 | C | 5.0 | 1.0 |
2 | A | 9.0 | 7.0 |
3 | NaN | 2.0 | 9.0 |
Example 3: Using na_position parameter in pandas sort
When we use the na_positon parameter, we can shift the NaN values to the starting, different from default value i.e. last.
NOTE – This parameter and other operations of sort_values are applied to only the column specified in by parameter.
df.sort_values(by='col1', ascending=False, na_position='first')
col1 | col2 | col3 | |
---|---|---|---|
3 | NaN | 2.0 | 9.0 |
4 | R | 8.0 | NaN |
1 | Q | NaN | 9.0 |
0 | P | 7.0 | 2.0 |
5 | C | 5.0 | 1.0 |
2 | A | 9.0 | 7.0 |
So this is the reason, we can see that now col2 has NaN value at the start.
df.sort_values(by='col2', ascending=False, na_position='first')
col1 | col2 | col3 | |
---|---|---|---|
1 | Q | NaN | 9.0 |
2 | A | 9.0 | 7.0 |
4 | R | 8.0 | NaN |
0 | P | 7.0 | 2.0 |
5 | C | 5.0 | 1.0 |
3 | NaN | 2.0 | 9.0 |
Conclusion
We will end this article here. In this tutorial, we have discussed pandas functions which are useful in providing a different view or a subset of a DataFrame. The pandas functions we have learned are crosstab(), sample() and sort_values(), these functions have helped us in viewing the dataframes differently for extracting information.
- Also Read – Tutorial – Pandas Drop, Pandas Dropna, Pandas Drop Duplicate
- Also Read – Pandas Visualization Tutorial – Bar Plot, Histogram, Scatter Plot, Pie Chart
- Also Read – Tutorial – Pandas Concat, Pandas Append, Pandas Merge, Pandas Join
- Also Read – Pandas DataFrame Tutorial – Selecting Rows by Value, Iterrows and DataReader
Reference – https://pandas.pydata.org/docs/
-
I am Palash Sharma, an undergraduate student who loves to explore and garner in-depth knowledge in the fields like Artificial Intelligence and Machine Learning. I am captivated by the wonders these fields have produced with their novel implementations. With this, I have a desire to share my knowledge with others in all my capacity.
View all posts