Review: Pandas .loc vs. iloc

When you use Python (3.6.2) for data analysis, the Pandas library (0.20.3) is typically used to navigate efficiently through your datasets. You select single values, slice the datasets by row or column or transfer a subset of data to a different variable.

When you do that, the chances are very high that you will be using Pandas’ .loc[], .iloc[]method*. With this blog post, rather than writing a straightforward tutorial, I reviewed what is already out there and summed it up. I listed the links at the end of this blog post and tried to make a clear link to them.

I will first address the difference between .iloc and .loc. Unfortunately, the Pandas documentation on .loc and .iloc can be a little unclear to the beginner, because it is without examples.

Thankfully, the people doing the documentation did a better job in their section on Indexing and Selecting Data. But even if that is still unclear, the answers to these 2 StackOverflow questions (SO #1 & SO #2) explain it very well.

There are already several wordy articles about this topic, which I linked above, so I will try to sum it up through examples.

Let’s create a 4×4 DataFrame of random floats as shown below.

In[1]
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(4,4), columns=['a', 'b', 'c', 'd'], index=['W', 'X', 'Y', 'Z'])

df

Out[2]
          a         b         c         d
W -0.626746 -0.692536  0.565077  0.056630
X -0.426528 -0.058166  0.338856  0.958092
Y  0.088790  0.038131  1.738721  0.069772
Z  0.542006 -1.592502  1.241519  0.323274

.loc works with labels – .loc[<row>, <column>]4

Basics

################
## Single Row ##
################

df.loc['W']
Out[5]: 
a   -0.626746
b   -0.692536
c    0.565077
d    0.056630
Name: W, dtype: float64

##################
## Slice of Row ##
##################

df.loc['W':'Y']
Out[6]: 
          a         b         c         d
W -0.626746 -0.692536  0.565077  0.056630
X -0.426528 -0.058166  0.338856  0.958092
Y  0.088790  0.038131  1.738721  0.069772

Please note in the latter example, how on slicing the second label is included in the slice. This is in contrast to what Python indexing does for lists and tuples.

###################################################
## Pass a list  of labels for a custom selection ##
###################################################

df.loc[['X', 'Z']]
Out[7]: 
          a         b         c         d
X -0.426528 -0.058166  0.338856  0.958092
Z  0.542006 -1.592502  1.241519  0.323274

Please note here, that you have to pass in a list of labels inside the original squared brackets

#####################################################
## All rows but only one column to obtain a series ##
#####################################################

In [8]: df.loc[:, 'a']
Out[8]: 
W   -0.626746
X   -0.426528
Y    0.088790
Z    0.542006
Name: a, dtype: float64

########################################################
## Slicing to obtain a specific part of the DataFrame ##
########################################################

In [9]: df.loc['X':'Z', 'a':'b']
Out[9]: 
          a         b
X -0.426528 -0.058166
Y  0.088790  0.038131
Z  0.542006 -1.592502

In [19]: df.loc[:, :'c']
Out[10]: 
          a         b         c
W -0.626746 -0.692536  0.565077
X -0.426528 -0.058166  0.338856
Y  0.088790  0.038131  1.738721
Z  0.542006 -1.592502  1.241519

Now, just to expand on this concept a little and to challenge yourself, what do you think will be the output for the three examples below? Toggle the button to see if you were correct:

df.loc[:'Y']

 

          a         b         c         d
W -0.626746 -0.692536  0.565077  0.056630
X -0.426528 -0.058166  0.338856  0.958092
Y  0.088790  0.038131  1.738721  0.069772

 

 df.loc['X':]

 

          a         b         c         d
X -0.426528 -0.058166  0.338856  0.958092
Y  0.088790  0.038131  1.738721  0.069772
Z  0.542006 -1.592502  1.241519  0.323274

 

df.loc['W':, 'b':'c']

 

          b        c
W -0.692536 0.565077
X -0.058166 0.338856
Y  0.038131 1.738721
Z -1.592502 1.241519

Boolean Arrays

When setting a DataFrame or Series into a conditional, it is possible to receive a table of Booleans, e.g. check if values are larger than 0.

[IN]
df.loc['Y']<0

[OUT]
a    False
b    False
c     True
d     True
Name: Y, dtype: bool

This is helpful as you can pass this into another .loc method to filter out specific columns/rows.

In [18]: df.loc[:, df.loc['X']>0]
Out[18]: 
          c         d
W  0.565077  0.056630
X  0.338856  0.958092
Y  1.738721  0.069772
Z  1.241519  0.323274

In [21]: df.loc[df.loc[:, 'a']>0]
Out[21]: 
 a b c d
Y 0.088790 0.038131 1.738721 0.069772
Z 0.542006 -1.592502 1.241519 0.323274

Read this as “in all rows of DataFrame df, show the columns in which the value in row ‘X’ is larger than 0” and “in all columns, show only the rows in which values of column ‘a’ are larger than 0, respectively.

Now see if you got the concept.

df.loc[df.loc[:, 'c']<0.5]

 

          a         b        c        d
X -0.426528 -0.058166 0.338856 0.958092

 

Callable Functions

This is possible when the function takes one argument and returns an output that can be used by .loc, i.e. an existing label (Example taken from the pandas documentation)

In [23]: df.loc[lambda x: ['X', 'Z']]
Out[23]: 
          a         b         c         d
X -0.426528 -0.058166  0.338856  0.958092
Z  0.542006 -1.592502  1.241519  0.323274

In [24]: df.loc[lambda df: df.a > 0, :]
Out[24]: 
 a b c d
Y 0.088790 0.038131 1.738721 0.069772
Z 0.542006 -1.592502 1.241519 0.323274

.iloc() method works on the positions in the index (integers only)4

Basics

################
## Single Row ##
################

In [39]: df.iloc[2]
Out[39]: 
a 0.088790
b 0.038131
c 1.738721
d 0.069772
Name: Y, dtype: float64

###################
## Slice of Rows ##
###################


In [40]: df.iloc[1:3]
Out[40]: 
          a         b        c        d
X -0.426528 -0.058166 0.338856 0.958092
Y  0.088790  0.038131 1.738721 0.069772

Please note that for iloc the last position is NOT included in the slice! This is in contrast to .loc, but like the behaviour that can be observed in Python lists and tuples.

The next examples are similar to the section in .loc so they do not need much explanation.

####################################################
## Pass a list  of indexes for a custom selection ##
####################################################

In [41]: df.iloc[[1,3]]
Out[41]: 
          a         b        c        d
X -0.426528 -0.058166 0.338856 0.958092
Z  0.542006 -1.592502 1.241519 0.323274


#####################################################
## All rows but only one column to obtain a series ##
#####################################################

In [42]: df.iloc[:, 0]
Out[42]: 
W -0.626746
X -0.426528
Y 0.088790
Z 0.542006
Name: a, dtype: float64

########################################################
## Slicing to obtain a specific part of the DataFrame ##
########################################################

In [43]: df.iloc[2:4, 1:]
Out[43]: 
          b        c        d
Y  0.038131 1.738721 0.069772
Z -1.592502 1.241519 0.323274

Now, test yourself

df.iloc[:3]

 

          a         b         c         d
W -0.626746 -0.692536  0.565077  0.056630
X -0.426528 -0.058166  0.338856  0.958092
Y  0.088790  0.038131  1.738721  0.069772

 

 df.iloc[0:]

 

          a         b         c         d
W -0.626746 -0.692536  0.565077  0.056630
X -0.426528 -0.058166  0.338856  0.958092
Y  0.088790  0.038131  1.738721  0.069772
Z  0.542006 -1.592502  1.241519  0.323274

 

df.iloc[0:3, ::-1]

 

         d        c         b         a
W 0.056630 0.565077 -0.692536 -0.626746
X 0.958092 0.338856 -0.058166 -0.426528
Y 0.069772 1.738721  0.038131  0.088790

 

Boolean Arrays

Just like for .loc, it is possible to pass Boolean arrays into .iloc. There is one caveat, however. It has to be a numpy Boolean array. Which means the following:

In [48]: df.iloc[:]<0
Out[48]: 
       a      b      c      d
W   True   True  False  False
X   True   True  False  False
Y  False  False  False  False
Z  False   True  False  False


In [49]: df.iloc[:, df.iloc[1]<0]
Out[49]: ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In order to get this to work, you will need to use .values in order to convert it to a numpy array and then you can do the same as in .loc. Having said that, it has to be a series of Booleans, a DataFrame will still give you an error.

So to get the above to work, the following needs to be done.

In [53]: df.iloc[:, (df.iloc[1]<0).values]
Out[53]: 
          a         b
W -0.626746 -0.692536
X -0.426528 -0.058166
Y  0.088790  0.038131
Z  0.542006 -1.592502

Callable Functions

Here the same applies as for .loc. As long as the function takes on argument and returns a usable output, it will work.

In [54]: df.iloc[lambda df: [1, 3]]
Out[54]: 
          a         b         c         d
X -0.426528 -0.058166  0.338856  0.958092
Z  0.542006 -1.592502  1.241519  0.323274

Why does .loc exist?

While I was trawling through the StackOverflow questions and answers for this particular topic, I came across a rather interesting question. Essentially, it was asked why .loc is actually needed, as the following actually gives the same output and also runs at similar speeds.

df_user1 = df.loc[df.user_id=='5561']

df_user1_noloc = df[df.user_id=='5561']

I invite you all to read the accepted answer to the question, which is beautifully explained and gives great insights into the language. The general gist is that as with most things in coding in general and Python in particular, explicit is better than implicit.

But also, there is in fact a specific corner-case, where the columns are named True and False (for example in classifications). In this particular case, the two approaches yield different outputs. Observe the code and output below (example taken from the link above).

In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df
Out[229]: 
   False  True 
0      3      1
1      4      2
2      5      3

In [230]: df[[True]]
Out[230]: ValueError: Item wrong length 1 instead of 3.

In [231]: df.loc[[True]]
Out[231]: 
   False  True 
0      3      1

Summary

At the end of this section, I would like to give a little tip of the hat to a blog post on Shane Lynn’s blog that summed up the above in a nice little picture.

loc & iloc usage summary from Shane Lynn's Blog

I hope this blog post was helfpul to you. Let me know what you think, in the comment section below. Take care.

*Please note: There is also the .ix method that does/did similar things, but it is deprecated since Pandas 0.20, therefore I have decided not to cover it.

Links
  1. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html
  2. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
  3. https://pandas.pydata.org/pandas-docs/stable/indexing.html
  4. https://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation
  5. https://stackoverflow.com/questions/28757389/loc-vs-iloc-vs-ix-vs-at-vs-iat
  6. https://stackoverflow.com/questions/41491574/calling-iloc-with-a-boolean-array
  7. https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc
  8. https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/
  9. https://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
This entry was posted in Data Science, Python, Tutorial and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *