When you use Python (3.6.2) for data analysis, the Pandas library (0.20.3) is typically used to navigate efficiently through your datasets. You select single values, slice the datasets by row or column or transfer a subset of data to a different variable.

When you do that, the chances are very high that you will be using Pandas’ `.loc[]`

, `.iloc[]`

method*. With this blog post, rather than writing a straightforward tutorial, I reviewed what is already out there and summed it up. I listed the links at the end of this blog post and tried to make a clear link to them.

I will first address the difference between .iloc and .loc. Unfortunately, the Pandas documentation on .loc and .iloc can be a little unclear to the beginner, because it is without examples.

Thankfully, the people doing the documentation did a better job in their section on Indexing and Selecting Data. But even if that is still unclear, the answers to these 2 StackOverflow questions (SO #1 & SO #2) explain it very well.

There are already several wordy articles about this topic, which I linked above, so I will try to sum it up through examples.

Let’s create a 4×4 DataFrame of random floats as shown below.

In[1] import numpy as np import pandas as pd df = pd.DataFrame(np.random.randn(4,4), columns=['a', 'b', 'c', 'd'], index=['W', 'X', 'Y', 'Z']) df Out[2] a b c d W -0.626746 -0.692536 0.565077 0.056630 X -0.426528 -0.058166 0.338856 0.958092 Y 0.088790 0.038131 1.738721 0.069772 Z 0.542006 -1.592502 1.241519 0.323274

**.loc works with labels – .loc[<row>, <column>]**^{4}

^{4}

*Basics*

################ ## Single Row ## ################ df.loc['W'] Out[5]: a -0.626746 b -0.692536 c 0.565077 d 0.056630 Name: W, dtype: float64 ################## ## Slice of Row ## ################## df.loc['W':'Y'] Out[6]: a b c d W -0.626746 -0.692536 0.565077 0.056630 X -0.426528 -0.058166 0.338856 0.958092 Y 0.088790 0.038131 1.738721 0.069772

Please note in the latter example, how on slicing the second label is included in the slice. This is in contrast to what Python indexing does for lists and tuples.

################################################### ## Pass a list of labels for a custom selection ## ################################################### df.loc[['X', 'Z']] Out[7]: a b c d X -0.426528 -0.058166 0.338856 0.958092 Z 0.542006 -1.592502 1.241519 0.323274

Please note here, that you have to pass in a list of labels inside the original squared brackets

##################################################### ## All rows but only one column to obtain a series ## ##################################################### In [8]: df.loc[:, 'a'] Out[8]: W -0.626746 X -0.426528 Y 0.088790 Z 0.542006 Name: a, dtype: float64 ######################################################## ## Slicing to obtain a specific part of the DataFrame ## ######################################################## In [9]: df.loc['X':'Z', 'a':'b'] Out[9]: a b X -0.426528 -0.058166 Y 0.088790 0.038131 Z 0.542006 -1.592502 In [19]: df.loc[:, :'c'] Out[10]: a b c W -0.626746 -0.692536 0.565077 X -0.426528 -0.058166 0.338856 Y 0.088790 0.038131 1.738721 Z 0.542006 -1.592502 1.241519

Now, just to expand on this concept a little and to challenge yourself, what do you think will be the output for the three examples below? Toggle the button to see if you were correct:

df.loc[:'Y']

df.loc['X':]

df.loc['W':, 'b':'c']

*Boolean Arrays*

When setting a DataFrame or Series into a conditional, it is possible to receive a table of Booleans, e.g. check if values are larger than 0.

[IN] df.loc['Y']<0 [OUT] a False b False c True d True Name: Y, dtype: bool

This is helpful as you can pass this into another .loc method to filter out specific columns/rows.

In [18]: df.loc[:, df.loc['X']>0] Out[18]: c d W 0.565077 0.056630 X 0.338856 0.958092 Y 1.738721 0.069772 Z 1.241519 0.323274 In [21]: df.loc[df.loc[:, 'a']>0] Out[21]: a b c d Y 0.088790 0.038131 1.738721 0.069772 Z 0.542006 -1.592502 1.241519 0.323274

Read this as *“in all rows of DataFrame df, show the columns in which the value in row ‘X’ is larger than 0” *and *“in all columns, show only the rows in which values of column ‘a’ are larger than 0*, respectively.

Now see if you got the concept.

df.loc[df.loc[:, 'c']<0.5]

*Callable Functions*

This is possible when the function takes one argument and returns an output that can be used by .loc, i.e. an existing label (Example taken from the pandas documentation)

In [23]: df.loc[lambda x: ['X', 'Z']] Out[23]: a b c d X -0.426528 -0.058166 0.338856 0.958092 Z 0.542006 -1.592502 1.241519 0.323274 In [24]: df.loc[lambda df: df.a > 0, :] Out[24]: a b c d Y 0.088790 0.038131 1.738721 0.069772 Z 0.542006 -1.592502 1.241519 0.323274

**.iloc() method works on the ***positions* in the index (integers only)^{4}

*positions*in the index (integers only)

^{4}

*Basics*

################ ## Single Row ## ################ In [39]: df.iloc[2] Out[39]: a 0.088790 b 0.038131 c 1.738721 d 0.069772 Name: Y, dtype: float64 ################### ## Slice of Rows ## ################### In [40]: df.iloc[1:3] Out[40]: a b c d X -0.426528 -0.058166 0.338856 0.958092 Y 0.088790 0.038131 1.738721 0.069772

Please note that for iloc the last position is NOT included in the slice! This is in contrast to .loc, but like the behaviour that can be observed in Python lists and tuples.

The next examples are similar to the section in .loc so they do not need much explanation.

#################################################### ## Pass a list of indexes for a custom selection ## #################################################### In [41]: df.iloc[[1,3]] Out[41]: a b c d X -0.426528 -0.058166 0.338856 0.958092 Z 0.542006 -1.592502 1.241519 0.323274 ##################################################### ## All rows but only one column to obtain a series ## ##################################################### In [42]: df.iloc[:, 0] Out[42]: W -0.626746 X -0.426528 Y 0.088790 Z 0.542006 Name: a, dtype: float64 ######################################################## ## Slicing to obtain a specific part of the DataFrame ## ######################################################## In [43]: df.iloc[2:4, 1:] Out[43]: b c d Y 0.038131 1.738721 0.069772 Z -1.592502 1.241519 0.323274

Now, test yourself

df.iloc[:3]

df.iloc[0:]

df.iloc[0:3, ::-1]

*Boolean Arrays*

Just like for .loc, it is possible to pass Boolean arrays into .iloc. There is one caveat, however. It has to be a numpy Boolean array. Which means the following:

In [48]: df.iloc[:]<0 Out[48]: a b c d W True True False False X True True False False Y False False False False Z False True False False In [49]: df.iloc[:, df.iloc[1]<0] Out[49]: ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In order to get this to work, you will need to use .values in order to convert it to a numpy array and then you can do the same as in .loc. Having said that, it has to be a series of Booleans, a DataFrame will still give you an error.

So to get the above to work, the following needs to be done.

In [53]: df.iloc[:, (df.iloc[1]<0).values] Out[53]: a b W -0.626746 -0.692536 X -0.426528 -0.058166 Y 0.088790 0.038131 Z 0.542006 -1.592502

#### Callable Functions

Here the same applies as for .loc. As long as the function takes on argument and returns a usable output, it will work.

In [54]: df.iloc[lambda df: [1, 3]] Out[54]: a b c d X -0.426528 -0.058166 0.338856 0.958092 Z 0.542006 -1.592502 1.241519 0.323274

### Why does .loc exist?

While I was trawling through the StackOverflow questions and answers for this particular topic, I came across a rather interesting question. Essentially, it was asked why .loc is actually needed, as the following actually gives the same output and also runs at similar speeds.

df_user1 = df.loc[df.user_id=='5561'] df_user1_noloc = df[df.user_id=='5561']

I invite you all to read the accepted answer to the question, which is beautifully explained and gives great insights into the language. The general gist is that as with most things in coding in general and Python in particular, explicit is better than implicit.

But also, there is in fact a specific corner-case, where the columns are named True and False (for example in classifications). In this particular case, the two approaches yield different outputs. Observe the code and output below (example taken from the link above).

In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df Out[229]: False True 0 3 1 1 4 2 2 5 3 In [230]: df[[True]] Out[230]: ValueError: Item wrong length 1 instead of 3. In [231]: df.loc[[True]] Out[231]: False True 0 3 1

### Summary

At the end of this section, I would like to give a little tip of the hat to a blog post on Shane Lynn’s blog that summed up the above in a nice little picture.

I hope this blog post was helfpul to you. Let me know what you think, in the comment section below. Take care.

**Please note: There is also the .ix method that does/did similar things, but it is deprecated since Pandas 0.20, therefore I have decided not to cover it.*

###### Links

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
- https://pandas.pydata.org/pandas-docs/stable/indexing.html
- https://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation
- https://stackoverflow.com/questions/28757389/loc-vs-iloc-vs-ix-vs-at-vs-iat
- https://stackoverflow.com/questions/41491574/calling-iloc-with-a-boolean-array
- https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc
- https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/
- https://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated