Pandas Tutorial – crosstab(), sample() and sort_values()

Introduction

In our machine learning or data science projects when we work with the Pandas library, we majorly focus on the handling of dataframes. In a lot of scenarios apart from the general view of dataframes, we look for alternative views or sometimes have to sort the data.  In this tutorial, we will be learning about pandas functions of crosstab(), sample(), and sort_values() that can help us to represent the dataframes in alternative ways for getting for insights.

Importing Pandas Library

We will commence this tutorial by importing the pandas library.

In [1]:
import pandas as pd

Pandas Crosstab : crosstab()

By using crosstab() function, we can compute cross-tabulation of two or more different factors.

Syntax

pandas.crosstab(index,columns,values=None,rownames,colnames,dropna)

index : array-like, Series, or list of arrays/Series – These values are used for grouping in the rows.

columns : array-like, Series, or list of arrays/Series – The values are used for grouping in the columns.

values : array-like,optional – The are the array of values used for aggregating.

rownames : sequence,optional – It is used for matching number of row arrays passed.

colnames : sequence,optional – It is used for matching number of column arrays passed.

dropna : bool,default True – This parameter ensures that columns with NaN values are not considered.

The result of this function is a dataframe with the cross-tabulation of data.

Example 1: Simple example of pandas crosstab function

Here three arrays are built and then using pandas crosstab function, we are viewing these arrays in different ways.

In [2]:
import numpy as np

a = np.array(["mango", "mango", "mango", "mango", "orange", "orange",
              "orange", "orange", "mango", "mango", "mango"], dtype=object)

a
Out[2]:
array(['mango', 'mango', 'mango', 'mango', 'orange', 'orange', 'orange',
       'orange', 'mango', 'mango', 'mango'], dtype=object)
In [3]:
b = np.array(["one", "one", "one", "two", "one", "one",
               "one", "two", "two", "two", "one"], dtype=object)

b
Out[3]:
array(['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', 'two',
       'two', 'one'], dtype=object)
In [4]:
c = np.array(["watermelon", "watermelon", "strawberry", "watermelon", "watermelon", "strawberry",
               "strawberry", "watermelon", "strawberry", "strawberry", "strawberry"],dtype=object)

c
Out[4]:
array(['watermelon', 'watermelon', 'strawberry', 'watermelon',
       'watermelon', 'strawberry', 'strawberry', 'watermelon',
       'strawberry', 'strawberry', 'strawberry'], dtype=object)

As we can see the three arrays are mapped to rows and columns for viewing these arrays in tabular format

In [5]:
pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
Out[5]:
b one two
c strawberry watermelon strawberry watermelon
a
mango 2 2 2 1
orange 2 1 0 1

Example 2: Using Categorical and NaN values with pandas crosstab

Here the pandas crosstab function is used with categorical and NaN values.

In [6]:
first = pd.Categorical(['p', 'q'], categories=['p', 'q', 'r'])
In [7]:
first
Out[7]:
[p, q]
Categories (3, object): [p, q, r]
In [8]:
second = pd.Categorical(['x', 'y'], categories=['x', 'y', 'z'])
In [9]:
second
Out[9]:
[x, y]
Categories (3, object): [x, y, z]

Since crosstab() function has the default value of dropna parameter as True, “c” and “f” are dropped from the data.

In [10]:
pd.crosstab(first, second)
Out[10]:
col_0 x y
row_0
p 1 0
q 0 1

To get “r” and “z” in the output, dropna parameter is passed “False” value. After this, we can see the two values which were not present earlier.

In [11]:
pd.crosstab(first, second, dropna=False)
Out[11]:
col_0 x y z
row_0
p 1 0 0
q 0 1 0
r 0 0 0

[adrotate banner=”3″]

Pandas Sample : sample()

The pandas sample() function is used for returning a random sample of items from an axis of the object.

Syntax

pandas.DataFrame.sample(n,frac,replace,random_state,axis)

n : int,optional – This value specifies the number of items to be returned from the axis of the object.

frac : float,optional – This value tells us the fraction of axis items to return.

replace : bool,default false – By this parameter, the functions come to know whether to allow or disallow sampling of the same row more than once.

random_state : int or numpy.random.RandomState, optional – We can specify a random number generator for fetching desired values.

axis : {0 or ‘index’, 1 or ‘columns’, None}, default None – This is the axis from where sample is taken.

Example 1: Simple example of pandas sample function

We will now look at some examples of pandas sample function, here in this 1st example, after creating a DataFrame, the sample is taken by specifying “n” as 3.

In [12]:
df = pd.DataFrame({'seed_count': [16, 40, 0, 2],
                   'water_content': [20, 50, 10, 30],
                    'quantity': [10, 2, 1, 8]},
                  index=['orange', 'watermelon', 'pineapple', 'apple'])
In [13]:
df
Out[13]:
seed_count water_content quantity
orange 16 20 10
watermelon 40 50 2
pineapple 0 10 1
apple 2 30 8
In [14]:
df['water_content'].sample(n=3, random_state=1)
Out[14]:
apple        30
pineapple    10
orange       20
Name: water_content, dtype: int64

Example 2: Using random and replace parameters in pandas sample

In this 2nd example, the random and replace parameters are provided and we can clearly see there is a different output. The frac parameter fetches a part of values present on the axis of the object.

In [15]:
df.sample(frac=0.5, replace=True, random_state=1)
Out[15]:
seed_count water_content quantity
watermelon 40 50 2
apple 2 30 8

Pandas Sort_Values : sort_values()

This function of pandas is used to perform the sorting of values on either axes.

Syntax

pandas.DataFrame.sort_values(by,axis,ascending,inplace,kind,na_position,ignore_index)

by : str or list of str – Here a single list or multiple lists are provided for performing sorting operation.

axis : {0 or ‘index’, 1 or ‘columns’}, default 0 – This is the axis where sorting should take place.

ascending : bool or list of bool, default True – Here the way sorting should be executed is specified. It can be either ascending and descending.

inplace : bool, default False – This parameter ensures that the results are in-place, if specified as true.

kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’ – Here we can choose the kind of sorting which will be performed.

na_position : {‘first’, ‘last’}, default ‘last’ – This parameter either shifts all NaN’s to beginning or at the end.

ignore_index : bool, default False – If passed as true, the original index will be ignored and new index for the object will be provided.

Finally, a DataFrame with sorted values is returned by this function.

Example 1: Simple example of sort_values() function in pandas

After creating a DataFrame, in this example, we are performing the sorting on the column titled “col1”.

In [16]:
df = pd.DataFrame({'col1': ['P', 'Q', 'A', np.nan, 'R', 'C'],
    'col2': [7, np.nan, 9, 2, 8, 5],
     'col3': [2, 9, 7, 9, np.nan, 1],})
In [17]:
df
Out[17]:
col1 col2 col3
0 P 7.0 2.0
1 Q NaN 9.0
2 A 9.0 7.0
3 NaN 2.0 9.0
4 R 8.0 NaN
5 C 5.0 1.0

As we can see the alphabets have been sorted in correct alphabetical order.

In [18]:
df.sort_values(by=['col1'])
Out[18]:
col1 col2 col3
2 A 9.0 7.0
5 C 5.0 1.0
0 P 7.0 2.0
1 Q NaN 9.0
4 R 8.0 NaN
3 NaN 2.0 9.0

Example 2: Using ascending parameter as false in pandas sort

In this example, the ascending parameter is specified with false as the value, so the results are as shown.

In [19]:
 df.sort_values(by='col1', ascending=False)
Out[19]:
col1 col2 col3
4 R 8.0 NaN
1 Q NaN 9.0
0 P 7.0 2.0
5 C 5.0 1.0
2 A 9.0 7.0
3 NaN 2.0 9.0

Example 3: Using na_position parameter in pandas sort

When we use the na_positon parameter, we can shift the NaN values to the starting, different from default value i.e. last.

NOTE – This parameter and other operations of sort_values are applied to only the column specified in by parameter.

In [20]:
df.sort_values(by='col1', ascending=False, na_position='first')
Out[20]:
col1 col2 col3
3 NaN 2.0 9.0
4 R 8.0 NaN
1 Q NaN 9.0
0 P 7.0 2.0
5 C 5.0 1.0
2 A 9.0 7.0

So this is the reason, we can see that now col2 has NaN value at the start.

In [21]:
df.sort_values(by='col2', ascending=False, na_position='first')
Out[21]:
col1 col2 col3
1 Q NaN 9.0
2 A 9.0 7.0
4 R 8.0 NaN
0 P 7.0 2.0
5 C 5.0 1.0
3 NaN 2.0 9.0

Conclusion

We will end this article here. In this tutorial, we have discussed pandas functions which are useful in providing a different view or a subset of a DataFrame. The pandas functions we have learned are crosstab(), sample() and sort_values(), these functions have helped us in viewing the dataframes differently for extracting information.

Reference – https://pandas.pydata.org/docs/

  • Palash Sharma

    I am Palash Sharma, an undergraduate student who loves to explore and garner in-depth knowledge in the fields like Artificial Intelligence and Machine Learning. I am captivated by the wonders these fields have produced with their novel implementations. With this, I have a desire to share my knowledge with others in all my capacity.

    View all posts

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *