avatarGanesh Chandrasekaran

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5051

Abstract

es-1.readmedium.com/v2/resize:fit:800/1*P-vB1savb5LG1WSZ9ASa6w.png"><figcaption></figcaption></figure><blockquote id="b4f7"><p><b>Note: SQL view can be created using the data frame df2</b></p></blockquote><p id="7b03"><b>Method 2: Convert String to Date and also Rename the column using “withColumn”</b></p><p id="5bff">2 new columns “OrderDate” and “ShipDate” are created and old columns “Order Date” and “Ship Date” are dropped from the data frame.</p><div id="459f"><pre>## Renaming <span class="hljs-keyword">Order</span> <span class="hljs-type">Date</span> <span class="hljs-keyword">to</span> OrderDate <span class="hljs-keyword">and</span> Dropping the <span class="hljs-built_in">old</span> <span class="hljs-keyword">columns</span></pre></div><div id="a2a1"><pre>df2 = df
<span class="hljs-selector-class">.withColumn</span>(<span class="hljs-string">"OrderDate"</span>,<span class="hljs-built_in">to_date</span>(<span class="hljs-built_in">col</span>(<span class="hljs-string">"Order Date"</span>),<span class="hljs-string">"MM/dd/yyyy"</span>))
<span class="hljs-selector-class">.withColumn</span>(<span class="hljs-string">"ShipDate"</span>,<span class="hljs-built_in">to_date</span>(<span class="hljs-built_in">col</span>(<span class="hljs-string">"Ship Date"</span>),<span class="hljs-string">"MM/dd/yyyy"</span>))
<span class="hljs-selector-class">.drop</span>(<span class="hljs-built_in">col</span>(<span class="hljs-string">"Order Date"</span>))
<span class="hljs-selector-class">.drop</span>(<span class="hljs-built_in">col</span>(<span class="hljs-string">"Ship Date"</span>))</pre></div><div id="448b"><pre>df2.<span class="hljs-keyword">show</span>()</pre></div><figure id="01a0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*8SnyiDJQohMP1k6z9fXTcA.png"><figcaption></figcaption></figure><figure id="cba4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*26SbIS256a3dlHch_-PnEQ.png"><figcaption>Schema with new column names</figcaption></figure><blockquote id="0599"><p><b>Note: SQL view can be created using the data frame df2</b></p></blockquote><p id="770f"><b>Method 3: Instead of using “withColumn” in data frame, “select” is used with data frames.</b></p><div id="703f"><pre><span class="hljs-keyword">from</span> pyspark.<span class="hljs-keyword">sql</span>.<span class="hljs-keyword">functions</span> <span class="hljs-keyword">import</span> col</pre></div><div id="f0a1"><pre><span class="hljs-keyword">from</span> pyspark.<span class="hljs-keyword">sql</span>.<span class="hljs-keyword">functions</span> <span class="hljs-keyword">import</span> to_date</pre></div><div id="7c79"><pre><span class="hljs-attribute">df2</span> <span class="hljs-operator">=</span> df.select(<span class="hljs-string">"Region"</span></pre></div><div id="fe0c"><pre>,<span class="hljs-string">"Country"</span> ,<span class="hljs-built_in">col</span>(<span class="hljs-string">"Item Type"</span>)<span class="hljs-selector-class">.alias</span>(<span class="hljs-string">"ItemType"</span>) ,<span class="hljs-built_in">col</span>(<span class="hljs-string">"Sales Channel"</span>)<span class="hljs-selector-class">.alias</span>(<span class="hljs-string">"SalesChannel"</span>) ,<span class="hljs-built_in">col</span>(<span class="hljs-string">"Order Priority"</span>)<span class="hljs-selector-class">.alias</span>(<span class="hljs-string">"OrderPriority"</span>) ,<span class="hljs-built_in">to_date</span>(<span class="hljs-string">"Order Date"</span>,<span class="hljs-string">"mm/dd/yyyy"</span>)<span class="hljs-selector-class">.alias</span>(<span class="hljs-string">"OrderDate"</span>) ,<span class="hljs-built_in">col</span>(<span class="hljs-string">"Order ID"</span>)<span class="hljs-selector-class">.alias</span>(<span class="hljs-string">"OrderID"</span>) ,<span class="hljs-built_in">to_date</span>(<span class="hljs-string">"Ship Date"</span>,<span class="hljs-string">"mm/dd/yyyy"</span>)<span class="hljs-selector-class">.alias</span>(<span class="hljs-string">"ShipDate"</span>) ,<span class="hljs-built_in">col</span>(<span class="hljs-string">"Units Sold"</span>)<span class="hljs-selector-class">.alias</span>(<span class="hljs-string">"UnitsSold"</span>) ,<span class="hljs-built_in">col</span>(<span class="hljs-string">"Unit Price"</span>)<span class="hljs-selector-class">.alias</span>(<span class="hljs-string">"UnitPrice"</span>) ,<span class="hljs-built_in">col</span>(<span class="hljs-string">"Unit Cost"</span>)<span class="hljs-selector-class">.alias</span>(<span class="hljs-string">"UnitCost"</span>) ,<span class="hljs-built_in">col</span>(<span class="hljs-string">"Total Revenue"</span>)<span class="hljs-selector-class">.alias</span>(<span class="hljs-string">"TotalRevenue"</span>) ,<span class="hljs-built_in">col</span>(<span class="hljs-string">"Total Cost"</span>)<span class="hljs-selector-class">.alias</span>(<span class="hljs-string">"TotalCost"</span>) ,<span class="hljs-built_in">col</span>(<span class="hljs-string">"Total Profit"</span>)<span cl

Options

ass="hljs-selector-class">.alias</span>(<span class="hljs-string">"TotalProfit"</span>) )</pre></div><div id="3a95"><pre>df2<span class="hljs-selector-class">.printSchema</span>()</pre></div><p id="be69">In the above script, 2 things happen at the same time. Columns are renamed without spaces and the string is converted to date.</p><figure id="ede4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*8V-JAuP1XygJOvlrUsOjwA.png"><figcaption></figcaption></figure><figure id="2cf8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*K6Qu3OZwKbyWJgyr5gkD4g.png"><figcaption>Columns are renamed and data type is converted</figcaption></figure><blockquote id="b072"><p><b>Note: SQL view can be created using the data frame df2</b></p></blockquote><p id="57b9"><b>Method 4: Create a SQL View and convert the datatype.</b></p><p id="5b3e">There is an important step that needs to be done when using SQL. Enabling the Legacy Time Parser.</p><div id="e487"><pre><span class="hljs-attr">temp_table_name</span> = <span class="hljs-string">"sales_100"</span></pre></div><div id="6256"><pre>df.createOrReplaceTempView<span class="hljs-comment">(temp_table_name)</span></pre></div><div id="24f2"><pre>## Setting <span class="hljs-keyword">this</span> to LEGACY <span class="hljs-keyword">is</span> very important</pre></div><div id="86f4"><pre>spark.<span class="hljs-keyword">sql</span>("set spark.sql.legacy.timeParserPolicy=LEGACY")</pre></div><p id="ec26">As a follow-up step, write a select statement to convert the data and rename the columns using a column alias</p><div id="a645"><pre><span class="hljs-tag">%<span class="hljs-selector-tag">sql</span></span></pre></div><div id="c2bd"><pre><span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> new_sales_100 <span class="hljs-keyword">as</span></pre></div><div id="9a31"><pre>create table new_sales_100 as select Region as Region ,Country as Country ,Item Type as ItemType ,Sales Channel as SalesChannel ,Order Priority as OrderPriority ,to_date(Order Date,'MM/dd/yyyy') as OrderDate ,Order ID as OrderID ,to_date(Ship Date,'MM/dd/yyyy') as ShipDate ,Units Sold as UnitsSold ,Unit Price as UnitPrice ,Unit Cost as UnitCot ,Total Revenue as TotalRevenue ,Total Cost as TotalCost ,Total Profit as TotalProfit from sales_100</pre></div><p id="a119">View the Table description and data</p><div id="ae30"><pre><span class="hljs-tag">%<span class="hljs-selector-tag">sql</span></span></pre></div><div id="b58d"><pre><span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> new_sales_100</pre></div><figure id="47d0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*PsaZm2RrQUeV6iX_flRW_A.png"><figcaption></figcaption></figure><div id="cb06"><pre><span class="hljs-tag">%<span class="hljs-selector-tag">sql</span></span></pre></div><div id="108a"><pre>describe <span class="hljs-keyword">new</span><span class="hljs-number">_</span>sales<span class="hljs-number">_100</span></pre></div><figure id="09fc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*3KKDUfIDJwCqSbZdLe0jUA.png"><figcaption>The output of Describe table</figcaption></figure><p id="566a">Notebook version of the article is available here</p><p id="2e39"><a href="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3171750688223597/4357850475329684/1405937485320911/latest.html">https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3171750688223597/4357850475329684/1405937485320911/latest.html</a></p><p id="ac65">If you like the articles please subscribe using the link below.</p><div id="80a9" class="link-block"> <a href="https://ganeshchandrasekaran.com/subscribe"> <div> <div> <h2>Get an email whenever Ganesh Chandrasekaran publishes.</h2> <div><h3>Edit description</h3></div> <div><p>ganeshchandrasekaran.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*CLts0Q3dSJ5YVyyK)"></div> </div> </div> </a> </div><p id="7cad">To read more interesting articles</p><div id="92a1" class="link-block"> <a href="https://ganeshchandrasekaran.com/membership"> <div> <div> <h2>Join Medium with my referral link - Ganesh Chandrasekaran</h2> <div><h3>As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…</h3></div> <div><p>ganeshchandrasekaran.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*dkW3qkq0_ix97zTT)"></div> </div> </div> </a> </div></article></body>

Convert String to Date in Spark using Azure Databricks

Date conversion may seem obvious but it is not. Read through the article to find out why.

The sample CSV used in this article can be downloaded from the link below.

https://raw.githubusercontent.com/gchandra10/awk_scripts_data_science/master/sales_100.csv

Photo by Shubham Dhage on Unsplash

Step 1: Upload the CSV sales_100.csv and Create a Dataframe using Python

# File location and type
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
file_location = "/FileStore/tables/sales_100.csv"
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format("csv") \
.option("inferSchema", "true") \
.option("header", "true") \
.option("sep", ",") \
.load(file_location)
# display(df)
df.printSchema()
Output of Schema
Data Output

The order date and ship date are expected to be of date datatype but spark infers them as a string.

Method 1: Convert String to Date using “withColumn”

## This method uses withColumn feature of DataFrame and converts the String data type to Date
from pyspark.sql.functions import col
from pyspark.sql.functions import to_date
df2 = df \
.withColumn("Order Date",to_date(col("Order Date"),"MM/dd/yyyy")) \
.withColumn("Ship Date",to_date(col("Ship Date"),"MM/dd/yyyy"))
df2.show()

As data frames are immutable new data frame df2 is created from existing data frame df.

The second point to be noted is, the date is in mm/dd/yyyy format, which needs to be specified in the to_date function. This helps to format it the right way.

The output of Order Date and Ship Date is in yyyy-mm-dd which is the perfect date format.

Note: SQL view can be created using the data frame df2

Method 2: Convert String to Date and also Rename the column using “withColumn”

2 new columns “OrderDate” and “ShipDate” are created and old columns “Order Date” and “Ship Date” are dropped from the data frame.

## Renaming Order Date to OrderDate and Dropping the old columns
df2 = df \
.withColumn("OrderDate",to_date(col("Order Date"),"MM/dd/yyyy")) \
.withColumn("ShipDate",to_date(col("Ship Date"),"MM/dd/yyyy")) \
.drop(col("Order Date")) \
.drop(col("Ship Date"))
df2.show()
Schema with new column names

Note: SQL view can be created using the data frame df2

Method 3: Instead of using “withColumn” in data frame, “select” is used with data frames.

from pyspark.sql.functions import col
from pyspark.sql.functions import to_date
df2 = df.select("Region"
,"Country"
,col("Item Type").alias("ItemType")
,col("Sales Channel").alias("SalesChannel")
,col("Order Priority").alias("OrderPriority")
,to_date("Order Date","mm/dd/yyyy").alias("OrderDate")
,col("Order ID").alias("OrderID")
,to_date("Ship Date","mm/dd/yyyy").alias("ShipDate")
,col("Units Sold").alias("UnitsSold")
,col("Unit Price").alias("UnitPrice")
,col("Unit Cost").alias("UnitCost")
,col("Total Revenue").alias("TotalRevenue")
,col("Total Cost").alias("TotalCost")
,col("Total Profit").alias("TotalProfit")
)
df2.printSchema()

In the above script, 2 things happen at the same time. Columns are renamed without spaces and the string is converted to date.

Columns are renamed and data type is converted

Note: SQL view can be created using the data frame df2

Method 4: Create a SQL View and convert the datatype.

There is an important step that needs to be done when using SQL. Enabling the Legacy Time Parser.

temp_table_name = "sales_100"
df.createOrReplaceTempView(temp_table_name)
## Setting this to LEGACY is very important
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

As a follow-up step, write a select statement to convert the data and rename the columns using a column alias

%sql
create table new_sales_100 as
create table new_sales_100 as
select 
  `Region` as Region
  ,`Country` as Country
  ,`Item Type` as ItemType
  ,`Sales Channel` as SalesChannel
  ,`Order Priority` as OrderPriority
  ,to_date(`Order Date`,'MM/dd/yyyy') as OrderDate
  ,`Order ID` as OrderID
  ,to_date(`Ship Date`,'MM/dd/yyyy') as ShipDate
  ,`Units Sold` as UnitsSold
  ,`Unit Price` as UnitPrice
  ,`Unit Cost` as UnitCot
  ,`Total Revenue` as TotalRevenue
  ,`Total Cost` as TotalCost
  ,`Total Profit` as TotalProfit
from `sales_100`

View the Table description and data

%sql
select * from new_sales_100
%sql
describe new_sales_100
The output of Describe table

Notebook version of the article is available here

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3171750688223597/4357850475329684/1405937485320911/latest.html

If you like the articles please subscribe using the link below.

To read more interesting articles

Databricks
Pyspark
Sql
Azure
Programming
Recommended from ReadMedium