Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1971

Abstract

ic</h1><p id="2496">First, we need to add SQL Aliases to our input sources so they can be referenced in our SQL code. In the image below, you can see that I labeled mine ”profile” and “orders”. Now we can add our SQL Query to the SQL Query box. In my specific use case, we are filtering orders that are greater than $500 and grouping by customer id.</p><div id="c8f6"><pre><span class="hljs-keyword">select</span> id, <span class="hljs-keyword">first</span>(first_name) <span class="hljs-keyword">as</span> first_name, <span class="hljs-keyword">first</span>(last_name) <span class="hljs-keyword">as</span> last_name, <span class="hljs-built_in">count</span>(id) <span class="hljs-keyword">as</span> total_orders <span class="hljs-keyword">from</span> orders <span class="hljs-keyword">inner</span> <span class="hljs-keyword">join</span> profile <span class="hljs-keyword">ON</span> orders.customer_id = profile.id <span class="hljs-keyword">where</span> total_amount > <span class="hljs-number">500</span> <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> (id) <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> total_orders <span class="hljs-keyword">desc</span></pre></div><figure id="8b41"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*DPL2cFPwoRavg2iIaZDYCA.png"><figcaption></figcaption></figure><p id="a340">We can ensure our SQL Query is accurate by clicking on the data preview button to the sample results being returned. This can help us identify if there are any errors in our logic.</p><h1 id="03c7">Write results to a target</h1><p id="07a2">If all you needed to do was perform a single SQL Query, you are ready to write the results of your analysis. In my case, I decided to write them back to an S3 Bucket and add the location to the glue catalog so I could later query these results in AWS Athena.</p><figure id="bae4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fi

Options

t:800/1*eiwmR1zTn8qs-4vxVSMfvA.png"><figcaption></figcaption></figure><p id="c3d8">If you click on the script tab, you can see the spark code that was created based on the transform and parameters you added.</p><figure id="5434"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*qmAzzANDQzDC4JbEJHbviw.png"><figcaption></figcaption></figure><p id="89fe">And that’s it! You have just created a AWS Glue job that can perform a distributed spark SQL query on your dataset. Simple as that.</p><p id="d9ca">I have also created a video tutorial explaining all the steps if you want to follow along there.</p> <figure id="b4d3"> <div> <div> <img class="ratio" src="http://placehold.it/16x9"> <iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F7jkrQzlmNv4&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D7jkrQzlmNv4&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtube" allowfullscreen="" frameborder="0" height="480" width="854"> </div> </div> </figure></iframe></div></div></figure><p id="ca9c">If this video is helpful, consider subscribing to <a href="https://www.youtube.com/channel/UCNbfqCkmHEyf1CVKjuhEW_A">my youtube channel</a>.</p><p id="b41d">I hope you enjoyed reading this. If you’d like to support me as a writer consider signing up to <a href="https://medium.com/@adrianonicolucci/membership">become a Medium member</a>. It’s just $5 a month and you get unlimited access to Medium.</p><p id="0518"><i>More content at <a href="http://plainenglish.io/"><b>plainenglish.io</b></a>. Sign up for our <a href="http://newsletter.plainenglish.io/"><b>free weekly newsletter</b></a>. Get exclusive access to writing opportunities and advice in our <a href="https://discord.gg/GtDtUAvyhW"><b>community Discord</b></a>.</i></p></article></body>

AWS Glue Studio: Perform PySpark SQL Queries Without Knowing Spark

There are a lot more people that know SQL than know how to program in python and are proficient in Spark to perform big data analytics on their data. With AWS Glue Studio, it’s possible to build data pipelines for big data analytics on a distributed cluster without knowing to code a single line of spark code. This tutorial below is a walk-through on how to create a glue job with the Glue Studio Visual editor without knowing how to code. I will be using an example of performing a SQL query to identify customers who have made purchase orders greater than $500.

Create a New Job

We first need to create a new job by selecting the “Visual with a blank canvas” in glue studio.

Add Source Datasets

To start our workflow, we need to bring our source data into the canvas. I added a table containing my customer information and another table with my customer orders

SQL Transform

Next, select the “SQL” Transform from the Transform Drop down and connect it to your two datasets. For my use case, I need to identify my customers by first name and last name so I needed to join these two datasets together which will be performed in the SQL query.

Add SQL Logic

First, we need to add SQL Aliases to our input sources so they can be referenced in our SQL code. In the image below, you can see that I labeled mine ”profile” and “orders”. Now we can add our SQL Query to the SQL Query box. In my specific use case, we are filtering orders that are greater than $500 and grouping by customer id.

select id, first(first_name) as first_name, first(last_name) as last_name, count(id) as total_orders
from orders
inner join profile ON  orders.customer_id = profile.id
where total_amount > 500
group by (id)
order by total_orders desc

We can ensure our SQL Query is accurate by clicking on the data preview button to the sample results being returned. This can help us identify if there are any errors in our logic.

Write results to a target

If all you needed to do was perform a single SQL Query, you are ready to write the results of your analysis. In my case, I decided to write them back to an S3 Bucket and add the location to the glue catalog so I could later query these results in AWS Athena.

If you click on the script tab, you can see the spark code that was created based on the transform and parameters you added.

And that’s it! You have just created a AWS Glue job that can perform a distributed spark SQL query on your dataset. Simple as that.

I have also created a video tutorial explaining all the steps if you want to follow along there.

If this video is helpful, consider subscribing to my youtube channel.

I hope you enjoyed reading this. If you’d like to support me as a writer consider signing up to become a Medium member. It’s just $5 a month and you get unlimited access to Medium.

More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.