avatarAnmol Tomar

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3210

Abstract

an>, <span class="hljs-keyword">row</span> <span class="hljs-keyword">in</span> df.iterrows(): df.loc[<span class="hljs-keyword">index</span>,<span class="hljs-string">'c'</span>] = <span class="hljs-keyword">row</span>.a + <span class="hljs-keyword">row</span>.b </pre></div><div id="ee67"><pre><span class="hljs-keyword">end</span> = <span class="hljs-built_in">time</span>.<span class="hljs-built_in">time</span>() <span class="hljs-built_in">print</span>(<span class="hljs-keyword">end</span> - start)</pre></div><div id="3b62"><pre>### Time taken: <span class="hljs-number">2414</span> seconds</pre></div><p id="e12b">The time taken to iterate and update values using loc is around <b>40 minutes, </b>which is a lot.</p><h2 id="1f04">Alternative: Using ‘at’ in place of ‘loc’</h2><p id="a80a">We can perform the same manipulation by replacing ‘loc’ with ‘at’ (or replacing ‘iloc’ with ‘iat’) as shown below.</p><div id="7e52"><pre><span class="hljs-keyword">import</span> time</pre></div><div id="1e33"><pre><span class="hljs-built_in">start</span> = <span class="hljs-built_in">time</span>.<span class="hljs-built_in">time</span>()</pre></div><div id="b3f6"><pre># Iterating through DataFrame <span class="hljs-keyword">for</span> <span class="hljs-keyword">index</span>, <span class="hljs-keyword">row</span> <span class="hljs-keyword">in</span> df.iterrows(): df.at[<span class="hljs-keyword">index</span>,<span class="hljs-string">'c'</span>] = <span class="hljs-keyword">row</span>.a + <span class="hljs-keyword">row</span>.b</pre></div><div id="96c6"><pre><span class="hljs-keyword">end</span> = <span class="hljs-built_in">time</span>.<span class="hljs-built_in">time</span>() <span class="hljs-built_in">print</span>(<span class="hljs-keyword">end</span> - start)</pre></div><div id="e054"><pre>### Time taken: <span class="hljs-number">40</span> seconds</pre></div><p id="2b04">The code gets executed in <b>~ 0.7 minutes </b>which is<b> 60 times faster </b>as compared to the time taken by the loc function.</p><h2 id="ca6c">‘loc’ vs ‘at’ why the difference in the runtime?</h2><ul><li><b>‘at’/ ‘iat’</b></li></ul><p id="c092"><code>at</code> and <code>iat</code> are meant to access a scalar, that is, a single element in the DataFrame, as shown below:</p><div id="d853"><pre>df.at[<span class="hljs-number">2</span>,<span class="hljs-string">'a'</span>]

Output: <span class="hljs-number">22</span></pre></div><div id="6117"><pre><span class="hljs-attribute">df</span>.iat[<span class="hljs-number">2</span>,<span class="hljs-number">0</span>]

<span class="hljs-comment">### Output: 22</span></pre></div><p id="58ce">If we try to access a series using <code>at</code> and <code>iat</code>, then it throws an error as shown below:</p><div id="1088"><pre>## This will give an error <span class="hljs-keyword">as</span> we are trying <span class="hljs-keyword">to</span> <span class="hljs-keyword">access</span> multiple <span class="hljs-keyword">rows</span> df.at[:<span class="hljs-number">3</span>,<span class="hljs-string">'a'</span>]

Output: ValueError: At based indexing <span class="hljs-keyword">on</span> an <span class="hljs-type">integer</span> <span class="hljs-ke

Options

yword">index</span> can <span class="hljs-keyword">only</span> have <span class="hljs-type">integer</span> indexers</pre></div><ul><li><b>‘loc’/ ‘iloc’</b></li></ul><p id="b5d4"><code>loc</code> and <code>iloc</code> are meant to access multiple elements(series/dataframe) at the same time, potentially to perform vectorized operations.</p><div id="899e"><pre>df.loc[:<span class="hljs-number">3</span>,<span class="hljs-string">'a'</span>]

Output

##<span class="hljs-number">0</span> <span class="hljs-number">26</span> ##<span class="hljs-number">1</span> <span class="hljs-number">10</span> ##<span class="hljs-number">2</span> <span class="hljs-number">22</span> ##<span class="hljs-number">3</span> <span class="hljs-number">22</span></pre></div><div id="6656"><pre>df.loc[:<span class="hljs-number">3</span>,<span class="hljs-number">0</span>]

Output

##<span class="hljs-number">0</span> <span class="hljs-number">26</span> ##<span class="hljs-number">1</span> <span class="hljs-number">10</span> ##<span class="hljs-number">2</span> <span class="hljs-number">22</span> ##<span class="hljs-number">3</span> <span class="hljs-number">22</span></pre></div><p id="3229">As, <code>at</code> is used to access a scaler value so it is lightweight (implementation is fast) as compared to <code>loc</code> which is used to access series/datafame and thus takes more space and time.</p><blockquote id="a71e"><p>The following blog talks about the best practices of iterating through a pandas dataframe. I would recommend you to skim through this.</p></blockquote><div id="f7da" class="link-block"> <a href="https://towardsdatascience.com/dont-use-apply-in-python-there-are-better-alternatives-dc6364968f44"> <div> <div> <h2>Don’t use Apply in Python, follow these Best Practices!</h2> <div><h3>Alternatives to the Apply function to improve the performance by 700x</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*xJknqmZzptRvuQRZp_nFCg.jpeg)"></div> </div> </div> </a> </div><h2 id="9d88">Conclusion</h2><p id="3eeb">Using ‘loc’/’iloc’ within the loops in python is not optimal and should be avoided. Instead, we should use ‘at’ / ‘iat’ wherever required as they are much faster as compared to ‘loc’ / ‘iloc’.</p><blockquote id="4ad1"><p>Also, please keep in mind that ‘loc’/’iloc’ works amazingly well ‘outside’ the loops in python when we apply vectorized operations.</p></blockquote><h2 id="dfa0">Thank You!</h2><p id="00fe"><i>I hope you found the story useful. You can get all my posts in your inbox.<a href="https://anmol3015.medium.com/subscribe"><b> Do that here</b>!</a>If you like to experience Medium yourself, consider supporting me and thousands of other writers by <a href="https://anmol3015.medium.com/membership"><b>signing up for a membership</b></a>. It only costs $5 per month, it supports us, writers, greatly, and you get to access all the amazing stories on Medium.</i></p></article></body>

Don’t use loc/iloc with Loops In Python, Instead, Use This!

Run your loops at a 60X faster speed

Pic Credit: Unsplash

Recently, I was experimenting with loops in python and I realized that using ‘iloc’/ ‘loc’ within the loops takes a lot of time to execute. The immediate next question was why is ‘loc’ taking too much time and what is the alternative to ‘loc’?

In this blog, we will answer these questions by looking at some practical examples.

What is loc — if you don’t know already!

The loc[] function is a pandas function that is used to access the values within a DataFrame using the row index and column name. It is used when you know which row and column you want to access.

Let’s understand loc using an example. We have the following pandas DataFrame named df(shown below) and we want to access the value corresponding to the 2nd row in the column ‘a’ i.e. 10.

DataFrame df (Image by Author)

We can access the value using the following code:

##df.loc[index, column_name]
df.loc[1,'a']
### Output: 10 

Similarly, iloc is used to access the value using index and column number.

##df.loc[index, column_number]
df.iloc[1,0]
### Output: 10

So, the loc function is used to access columns using column names while the iloc function is used to access columns using column indexes.

What happens if you use loc/iloc with loops in Python?

Imagine, we want to add a new column ‘c’, which is equal to the sum of values of column ‘a’ and column ‘b’, to our DataFrame df.

Using the ‘for’ loop, we can iterate through our DataFrame and add a new column ‘c’ using the loc function as shown below:

import time
start = time.time()
# Iterating through the DataFrame df
for index, row in df.iterrows():
        df.loc[index,'c'] = row.a + row.b
end = time.time()
print(end - start)
### Time taken: 2414 seconds

The time taken to iterate and update values using loc is around 40 minutes, which is a lot.

Alternative: Using ‘at’ in place of ‘loc’

We can perform the same manipulation by replacing ‘loc’ with ‘at’ (or replacing ‘iloc’ with ‘iat’) as shown below.

import time
start = time.time()
# Iterating through DataFrame 
for index, row in df.iterrows():
    df.at[index,'c'] = row.a + row.b
end = time.time()
print(end - start)
### Time taken: 40 seconds

The code gets executed in ~ 0.7 minutes which is 60 times faster as compared to the time taken by the loc function.

‘loc’ vs ‘at’ why the difference in the runtime?

  • ‘at’/ ‘iat’

at and iat are meant to access a scalar, that is, a single element in the DataFrame, as shown below:

df.at[2,'a']
### Output: 22
df.iat[2,0]
### Output: 22

If we try to access a series using at and iat, then it throws an error as shown below:

## This will give an error as we are trying to access multiple rows
df.at[:3,'a']
### Output: ValueError: At based indexing on an integer index can only have integer indexers
  • ‘loc’/ ‘iloc’

loc and iloc are meant to access multiple elements(series/dataframe) at the same time, potentially to perform vectorized operations.

df.loc[:3,'a']
### Output
##0    26
##1    10
##2    22
##3    22
df.loc[:3,0]
### Output
##0    26
##1    10
##2    22
##3    22

As, at is used to access a scaler value so it is lightweight (implementation is fast) as compared to loc which is used to access series/datafame and thus takes more space and time.

The following blog talks about the best practices of iterating through a pandas dataframe. I would recommend you to skim through this.

Conclusion

Using ‘loc’/’iloc’ within the loops in python is not optimal and should be avoided. Instead, we should use ‘at’ / ‘iat’ wherever required as they are much faster as compared to ‘loc’ / ‘iloc’.

Also, please keep in mind that ‘loc’/’iloc’ works amazingly well ‘outside’ the loops in python when we apply vectorized operations.

Thank You!

I hope you found the story useful. You can get all my posts in your inbox. Do that here!If you like to experience Medium yourself, consider supporting me and thousands of other writers by signing up for a membership. It only costs $5 per month, it supports us, writers, greatly, and you get to access all the amazing stories on Medium.

Python
Programming
Data Science
Data Analysis
Python Programming
Recommended from ReadMedium