avatarGanesh Chandrasekaran

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2960

Abstract

js-string">"Monica"</span>, <span class="hljs-string">"Geller"</span>, <span class="hljs-string">"Spring95"</span>, <span class="hljs-string">"CS101"</span>, <span class="hljs-number">89</span>), (<span class="hljs-number">3</span>, <span class="hljs-string">"Rachel"</span>, <span class="hljs-string">"Green"</span>, <span class="hljs-string">"Spring95"</span>, <span class="hljs-string">"CS101"</span>, <span class="hljs-number">95</span>), (<span class="hljs-number">4</span>, <span class="hljs-string">"Joey"</span>, <span class="hljs-string">"Tribbiani"</span>, <span class="hljs-string">"Spring95"</span>, <span class="hljs-string">"CS101"</span>, <span class="hljs-number">83</span>), (<span class="hljs-number">5</span>, <span class="hljs-string">"Chandler"</span>, <span class="hljs-string">"Bing"</span>, <span class="hljs-string">"Spring95"</span>, <span class="hljs-string">"CS101"</span>, <span class="hljs-number">93</span>), (<span class="hljs-number">6</span>, <span class="hljs-string">"Phoebe"</span>, <span class="hljs-string">"Buffay"</span>, <span class="hljs-string">"Spring95"</span>, <span class="hljs-string">"CS101"</span>, <span class="hljs-number">89</span>), (<span class="hljs-number">1</span>, <span class="hljs-string">"Ross"</span>, <span class="hljs-string">"Geller"</span>, <span class="hljs-string">"Fall95"</span>, <span class="hljs-string">"Science"</span>, <span class="hljs-number">90</span>), (<span class="hljs-number">2</span>, <span class="hljs-string">"Monica"</span>, <span class="hljs-string">"Geller"</span>, <span class="hljs-string">"Fall95"</span>, <span class="hljs-string">"Science"</span>, <span class="hljs-number">99</span>), (<span class="hljs-number">3</span>, <span class="hljs-string">"Rachel"</span>, <span class="hljs-string">"Green"</span>, <span class="hljs-string">"Fall95"</span>, <span class="hljs-string">"Science"</span>, <span class="hljs-number">87</span>), (<span class="hljs-number">4</span>, <span class="hljs-string">"Joey"</span>, <span class="hljs-string">"Tribbiani"</span>, <span class="hljs-string">"Fall95"</span>, <span class="hljs-string">"Science"</span>, <span class="hljs-number">78</span>), (<span class="hljs-number">5</span>, <span class="hljs-string">"Chandler"</span>, <span class="hljs-string">"Bing"</span>, <span class="hljs-string">"Fall95"</span>, <span class="hljs-string">"Science"</span>, <span class="hljs-number">85</span>), (<span class="hljs-number">6</span>, <span class="hljs-string">"Phoebe"</span>, <span class="hljs-string">"Buffay"</span>, <span class="hljs-string">"Fall95"</span>, <span class="hljs-string">"Science"</span>, <span class="hljs-number">89</span>) ]

<span class="hljs-comment"># Schema for the above data</span> schema = StructType([ StructField(<span class="hljs-string">"id"</span>, IntegerType(), <span class="hljs-literal">True</span>), StructField(<span class="hljs-string">"fname"

Options

</span>, StringType(), <span class="hljs-literal">True</span>), StructField(<span class="hljs-string">"lname"</span>, StringType(), <span class="hljs-literal">True</span>), StructField(<span class="hljs-string">"semester"</span>, StringType(), <span class="hljs-literal">True</span>), StructField(<span class="hljs-string">"course"</span>, StringType(), <span class="hljs-literal">True</span>), StructField(<span class="hljs-string">"grade"</span>, IntegerType(), <span class="hljs-literal">True</span>) ])

<span class="hljs-comment"># Create the DataFrame</span> df = spark.createDataFrame(data, schema=schema) display(df)</pre></div><figure id="669f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*eULnG7XLTJZ_qblgITptjA.png"><figcaption></figcaption></figure><p id="868a">Now, it's time to pivot the semester and course so that analysts can use them for their analysis.</p><div id="aed3"><pre><span class="hljs-comment"># Create a new column combining semester and course for easier pivoting</span> df1 = (df.withColumn(<span class="hljs-string">"semester_course"</span>, concat_ws(<span class="hljs-string">"_"</span>, col(<span class="hljs-string">"semester"</span>), col(<span class="hljs-string">"course"</span>))) .withColumn(<span class="hljs-string">"name"</span>, concat_ws(<span class="hljs-string">" "</span>, col(<span class="hljs-string">"fname"</span>), col(<span class="hljs-string">"lname"</span>))) ) <span class="hljs-comment"># Pivot based on the new column</span> pivot_df = (df1.groupBy(<span class="hljs-string">"id"</span>, <span class="hljs-string">"name"</span>) .pivot(<span class="hljs-string">"semester_course"</span>) .agg(first(<span class="hljs-string">"grade"</span>)) .orderBy(<span class="hljs-string">"id"</span>) )

display(pivot_df)</pre></div><figure id="b9bd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Bb-InOT--XumRZ4I8K8PJg.png"><figcaption></figcaption></figure><p id="98ce"><b>Advantages of Pivoting</b></p><ol><li>A smaller number of rows makes it faster to load and visualize. In the above example, 18 rows have been transformed to 6 rows.</li><li>With data in a pivoted format, comparing data side-by-side is much easier. BI tools can highlight the trends/outliers and patterns that might not be evident in a row-by-row approach.</li><li>By organizing data into a matrix-like structure, pivoting makes efficient use of space in visual representations.</li></ol><p id="c340">I tried to do a quick visualization using Databricks Charts,</p><figure id="0b06"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*rbPlH9P9RwOp4puNIs1Dgg.png"><figcaption>Visualization using Databricks Charts</figcaption></figure><p id="3fcc">If you like the article, remember to clap as many times as you can and leave feedback on how this helps in your use case(s).</p></article></body>

Multi-Column Pivots in PySpark

This example is implemented using PySpark, but the concept works with ANSI SQL, too.

Data Engineers and Analysts often need help with structuring data. Data Engineers typically prefer a row-oriented approach for its efficiency in storage and processing. On the other hand, Data Analysts, who extract insights and trends, favor a column-oriented layout that lends itself to easier analysis and visualization.

Img generated by Dall-E

Let’s create sample data with our F*R*I*E*N*D*S cast :)

from pyspark.sql.functions import concat_ws, col, first
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Sample data
data = [
    (1, "Ross", "Geller", "Fall94", "Mathematics", 93),
    (2, "Monica", "Geller", "Fall94", "Mathematics", 99),
    (3, "Rachel", "Green", "Fall94", "Mathematics", 87),
    (4, "Joey", "Tribbiani", "Fall94", "Mathematics", 73),
    (5, "Chandler", "Bing", "Fall94", "Mathematics", 82),
    (6, "Phoebe", "Buffay", "Fall94", "Mathematics", 87),
    (1, "Ross", "Geller", "Spring95", "CS101", 90),
    (2, "Monica", "Geller", "Spring95", "CS101", 89),
    (3, "Rachel", "Green", "Spring95", "CS101", 95),
    (4, "Joey", "Tribbiani", "Spring95", "CS101", 83),
    (5, "Chandler", "Bing", "Spring95", "CS101", 93),
    (6, "Phoebe", "Buffay", "Spring95", "CS101", 89),
    (1, "Ross", "Geller", "Fall95", "Science", 90),
    (2, "Monica", "Geller", "Fall95", "Science", 99),
    (3, "Rachel", "Green", "Fall95", "Science", 87),
    (4, "Joey", "Tribbiani", "Fall95", "Science", 78),
    (5, "Chandler", "Bing", "Fall95", "Science", 85),
    (6, "Phoebe", "Buffay", "Fall95", "Science", 89)
]

# Schema for the above data
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("fname", StringType(), True),
    StructField("lname", StringType(), True),
    StructField("semester", StringType(), True),
    StructField("course", StringType(), True),
    StructField("grade", IntegerType(), True)
])


# Create the DataFrame
df = spark.createDataFrame(data, schema=schema)
display(df)

Now, it's time to pivot the semester and course so that analysts can use them for their analysis.

# Create a new column combining semester and course for easier pivoting
df1 = (df.withColumn("semester_course", concat_ws("_", col("semester"), col("course")))
        .withColumn("name", concat_ws(" ", col("fname"), col("lname")))
      )
# Pivot based on the new column
pivot_df = (df1.groupBy("id", "name")
                .pivot("semester_course")
                .agg(first("grade"))
                .orderBy("id")
           )

display(pivot_df)

Advantages of Pivoting

  1. A smaller number of rows makes it faster to load and visualize. In the above example, 18 rows have been transformed to 6 rows.
  2. With data in a pivoted format, comparing data side-by-side is much easier. BI tools can highlight the trends/outliers and patterns that might not be evident in a row-by-row approach.
  3. By organizing data into a matrix-like structure, pivoting makes efficient use of space in visual representations.

I tried to do a quick visualization using Databricks Charts,

Visualization using Databricks Charts

If you like the article, remember to clap as many times as you can and leave feedback on how this helps in your use case(s).

Pyspark
Bi
Pivot
Databricks
Charts
Recommended from ReadMedium