avatarAlex Mitchell

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3926

Abstract

egorical variables. If the number of categories are few compared to the total number values, it is better to use the category data type instead of object. It saves a great amount of memory depending on the data size.</p><p id="4b45">The following code will go over columns with object data type. If the number of categories are less than 5 percent of the total number of values, the data type of the column will be changed to category.</p><div id="0b82"><pre>cols = marketing<span class="hljs-selector-class">.select_dtypes</span>(include=<span class="hljs-string">'object'</span>)<span class="hljs-selector-class">.columns</span> <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> cols: ratio = <span class="hljs-built_in">len</span>(marketing<span class="hljs-selector-attr">[col]</span><span class="hljs-selector-class">.value_counts</span>()) / <span class="hljs-built_in">len</span>(marketing) <span class="hljs-keyword">if</span> ratio < <span class="hljs-number">0.05</span>: marketing<span class="hljs-selector-attr">[col]</span> = marketing<span class="hljs-selector-attr">[col]</span><span class="hljs-selector-class">.astype</span>(<span class="hljs-string">'category'</span>)</pre></div><p id="56aa">We have done three steps of data cleaning and manipulation. Depending on the task, the number of steps might be more.</p><p id="7ed1">Let’s create a pipe that accomplish all these tasks.</p><p id="af89">The pipe function takes functions as inputs. These functions need to take a dataframe as input and return a dataframe. Thus, we need to define functions for each task.</p><div id="4e5c"><pre>def drop_missing(df): thresh = len(df) * 0.6 df.dropna(<span class="hljs-attribute">axis</span>=1, <span class="hljs-attribute">thresh</span>=thresh, <span class="hljs-attribute">inplace</span>=<span class="hljs-literal">True</span>) return df </pre></div><div id="3a6e"><pre>def remove_outliers(df, <span class="hljs-built_in">column_name</span>): low = np.quantile(df[<span class="hljs-built_in">column_name</span>], <span class="hljs-number">0.05</span>) high = np.quantile(df[<span class="hljs-built_in">column_name</span>], <span class="hljs-number">0.95</span>) <span class="hljs-keyword">return</span> df[df[<span class="hljs-built_in">column_name</span>].<span class="hljs-keyword">between</span>(low, high, inclusive=<span class="hljs-keyword">True</span>)]</pre></div><div id="dcd6"><pre>def to_category(<span class="hljs-built_in">df</span>): cols = df.select_dtypes(include=<span class="hljs-string">'object'</span>).columns <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> cols: ratio = len(<span class="hljs-built_in">df</span>[col].value_counts()) / len(<span class="hljs-built_in">df</span>) <span class="hljs-keyword">if</span> ratio < 0.05: <span class="hljs-built_in">df</span>[col] = <span class="hljs-built_in">df</span>[col].astype(<span class="hljs-string">'category'</span>) <span class="hljs-built_in">return</span> <span class="hljs-built_in">df</span></pre></div><p id="4e1f">You may argue that what the point is if we need to define functions. It does not seem like simplifying the workflow. You are right for one particular task but we need to think more generally. Consider you are doing the same operations many times. In such case, creating a pipe makes the process easier and also provides cleaner code.</p><p id="f187">We have mentioned that the pipe function takes a function as input. If the function we pass to the pipe function has any arguments, we can pass it to the pipe function along with the function. It makes the pipe function even more efficient.</p><p id="4d8c">For instance, the remove_outliers function takes a column name as argument. The function removes the outliers in that column.</p><p id="b1c0">We can now create our pipe.</

Options

p><div id="df66"><pre>marketing_cleaned = (<span class="hljs-name">marketing</span>. pipe(<span class="hljs-name">drop_missing</span>). pipe(<span class="hljs-name">remove_outliers</span>, 'Salary'). pipe(<span class="hljs-name">to_category</span>))</pre></div><p id="10f5">It looks neat and clean. We can add as many steps as needed. The only criterion is that the functions in the pipe should take a dataframe as argument and return a dataframe. Just like with the remove_outliers function, we can pass the arguments of the functions to the pipe function as an argument. This flexibility makes the pipes more useful.</p><p id="b9fe">One important thing to mention is that the pipe function modifies the original dataframe. We should avoid changing the original dataset if possible.</p><p id="2d9c">To overcome this issue, we can use a copy of the original dataframe in the pipe. Furthermore, we can add a step that makes a copy of the dataframe in the beginning of the pipe.</p><div id="95d3"><pre><span class="hljs-keyword">def</span> <span class="hljs-title function_">copy_df</span>(<span class="hljs-params">df</span>): <span class="hljs-keyword">return</span> df.copy()</pre></div><div id="f711"><pre>marketing_cleaned = (<span class="hljs-name">marketing</span>. pipe(<span class="hljs-name">copy_df</span>). pipe(<span class="hljs-name">drop_missing</span>). pipe(<span class="hljs-name">remove_outliers</span>, 'Salary'). pipe(<span class="hljs-name">to_category</span>))</pre></div><p id="0425">Our pipeline is complete now. Let’s compare the original dataframe with the cleaned to confirm it is working.</p><div id="b937"><pre>marketing.<span class="hljs-built_in">shape</span> (<span class="hljs-number">1000</span>,<span class="hljs-number">10</span>)</pre></div><div id="e53b"><pre>marketing.dtypes <span class="hljs-type">Age</span> <span class="hljs-keyword">object</span> <span class="hljs-type">Gender</span> <span class="hljs-keyword">object</span> <span class="hljs-type">OwnHome</span> <span class="hljs-keyword">object</span> <span class="hljs-type">Married</span> <span class="hljs-keyword">object</span> <span class="hljs-type">Location</span> <span class="hljs-keyword">object</span> <span class="hljs-type">Salary</span> <span class="hljs-type">int64</span> <span class="hljs-type">Children</span> <span class="hljs-type">int64</span> <span class="hljs-type">History</span> <span class="hljs-keyword">object</span> <span class="hljs-type">Catalogs</span> <span class="hljs-type">int64</span> <span class="hljs-type">AmountSpent</span> <span class="hljs-type">int64</span> </pre></div><div id="3641"><pre><span class="hljs-title">marketing_cleaned</span>.dtypes (<span class="hljs-number">900</span>,<span class="hljs-number">10</span>)</pre></div><div id="e065"><pre>marketing_cleaned.dtypes Age category Gender category OwnHome category Married category Location category Salary <span class="hljs-built_in">int64</span> Children <span class="hljs-built_in">int64</span> History category Catalogs <span class="hljs-built_in">int64</span> AmountSpent <span class="hljs-built_in">int64</span></pre></div><p id="9b63">The pipeline is working as expected.</p><h2 id="8b85">Conclusion</h2><p id="2c89">The pipes provide cleaner and more maintainable syntax for data analysis. Another advantage is that they automatize the steps of data cleaning and manipulation.</p><p id="a29e">If you are doing the same operations over and over, you should definitely consider creating a pipeline.</p><p id="ae6e">Thank you for reading. Please let me know if you have any feedback.</p></article></body>

My Secret Leveling-Up Weapon: Coffee Meetings

As I look back on the last 3 years of my Product career compared to the first few years, I see acceleration. While some of my growth has been due to skill development and working for a few great DC tech companies (shout out to Vistaprint Digital + Upside Travel!), there was something else behind this leveling-up.

Beginning about 2 years ago, I made a conscious decision that I wanted to get a coffee or a drink with someone (old or new) every week.

While this sounds simple, at first, it was very difficult for me. My mornings are my most productive time, how could I possibly sacrifice 30–45 minutes of that time to simply talk with someone!?

However, I quickly recognized the power of this simple habit and over the past 2 years, it’s been my secret weapon.

As I look back at my calendar, I’ve met 112 times for coffee or drinks (!) over the last 2 years with people old and new.

5 Things I’ve Learned From My Coffees

1. You never know who will help you and you never know when

As organized as you try to be, you simply can’t plan for everything. You don’t know all of the opportunities that exist out there in the world and you won’t without talking to people!

Coffees will allow you to express your interests and goals for your personal life and your professional career. Often, others can help you achieve those faster.

2. It’s fun and rewarding helping others

For the past 9 months, I’ve been meeting monthly with the founders of Stratus, a financial planning platform for Millennials, by Millennials.

Their energy and excitement is infectious and it’s been immensely rewarding helping them with at least a few (hopefully decent) ideas on how they can continue to level-up themselves and their company.

3. Challenging your brain is healthy

As you set up your own coffees, make sure not to restrict yourself to people in your direct field or sub-industry.

I’ve found some of my most interesting conversations have been with people at companies that have no connection whatsoever with mine.

These conversations activate parts of your brain that your day-to-day conversations and job may not.

4. DC Tech/Tech is a small world

As you meet with more and more people for coffee, you’ll learn how connected the world is. I’m constantly surprised when new people I meet have worked or met with people I’ve built relationships with (although LinkedIn sometimes spoils the surprise…)

Because of this “graph” of connections you’ll build with coffees, each subsequent connection is likely to strengthen your network by more than the previous one did.

5. Your productivity will actually get a boost from other’s ideas

I saved this one for last because it completely refuted my initial worry of these coffees being time poorly spent. Rather, the truth was completely the opposite: These coffees have increased my productivity.

Through the course of conversation, I’d share some of the hard problems I was working on and, often, I would get a completely new perspective from the person I was talking with.

They see my company, my challenges, with different eyes and I’ve been amazed (and very thankful) for the ideas they’ve given me to attack these problems.

What’s Your Leveling-Up Secret Weapon?

Share it in the comments! Or share it with me on Twitter: @amitch5903

More From Alex Mitchell

Startup
Productivity
Coffee
Technology
Networking
Recommended from ReadMedium