This article discusses the use of columns and rows in Apache Spark, including how to reference, add, rename, select, change type, and delete columns, as well as how to create, append, sort, limit, filter, and get unique rows.
Abstract
The article provides an overview of working with columns and rows in Apache Spark. It explains how to reference a column using the col() and column() methods, as well as how to add a column with a constant value or based on an existing column. The article also covers renaming column names, selecting a column, changing a column's data type, and deleting a column. In addition, the article discusses how to create a row, create a new DataFrame from a row list, append rows to an existing DataFrame, sort rows, limit row number, filter rows, and get unique rows.
Opinions
The article provides clear and concise explanations of how to work with columns and rows in Apache Spark.
The article is well-organized and easy to follow, with clear headings and bullet points.
The article includes code examples and screenshots to illustrate key concepts.
The article assumes some prior knowledge of Apache Spark and Python programming.
The article provides a helpful list of resources for further reading.
Apache Spark >> Column and Row
In this article, we will talk about the Column and Row of Apache Spark.
※Based on Apache Spark 3.2.1
Prepare data
About Column
About Row
Conclusion
Prepare data
We need to prepare some sample data first for demonstration.
Reference a Column
We can use the col() and column() method in pyspark.sql.functions to construct a Column of Spark. The two methods both return a pyspark.sql.Column instance.
import pyspark.sql.functionsas F
F.col("first_name")
>>Result: Column<‘first_name’>
F.column("first_name")
>>Result: Column<‘first_name’>
We can also use bracket syntax to reference a column of Dataframe as follows.
df["first_name"]
>>Result: Column<‘first_name’>
Or we can reference a column by using the attribute of the Dataframe object.
df.first_name
>>Result: Column<‘first_name’>
Get the column name list of Dataframe
We can get the names of all columns using the columns attribute of Dataframe.
We can also add the boolean flag column which indicates a student’s class is A (true) or not (false).
We can use expr() method to evaluate two columns or one column and whether a value is equal or not.
Add a Column Based on Condition
We can generate new columns based on existing columns and certain conditions.
We can use the when() and otherwise() method to define different condition branches.
Create Dataframe from the row list
We can use some rows to create a Spark Dataframe.
Firstly, we can use the createDataFrame() method of sparkSession to create the DataFrame.
The first parameter can be the list of rows or generated RDD object from the row list.
The schema can be get by the schema property of DataFrame.
df2 = spark.createDataFrame([newRow], df.schema)
# OR
df2=spark.createDataFrame(spark.sparkContext.parallelize([newRow]), df.schema)
df2.show()
Append rows to existing DataFrame
We need to create a new DataFrame from appending rows and concatenate the new DataFrame with the existing DataFrame because DataFrames are immutable.
Firstly, let’s show all the rows of the existing DataFrame.
df.show(100)
Next, let’s create a new DataFrame of the same schema with the existing DataFrame.
We will find the new records are appended at the end of the concatenated DataFrame.
# Concatenate the two DataFrames df.union(newDF).show(100)
Sort rows
Sometimes we need to sort the rows in the DataFrame in a certain order to achieve our goals.
We can use the sort() method of DataFrame to sort one column or multiple columns.
If you want to sort multiple columns and specify the ascending order for each other, you can give the column list and specify ascending flag list.
Limit row number
Sometimes, we need to restrict the row number. We can use the limit() method of DataFrame.
df.limit(10).show()
Filter Rows
Sometimes, we need to extract some rows based on some conditions.
We can use the filter() or where() method of DataFrame to filter rows.
df.filter("score > 75").show()
# OR
df.filter(F.col("score") > 75).show()
df.where("score > 75").show()
# OR
df.where(F.col("score") > 75).show()
We can also chain multiple filter functions to filter rows step by step based on multiple conditions.
df.filter("score > 75").filter(F.col("Subject") == "Statistics").show()
# OR
df.filter("score > 75").where(F.col("Subject") == "Statistics").show()
# OR
df.where("score > 75").filter(F.col("Subject") == "Statistics").show()
# OR
df.where("score > 75").where(F.col("Subject") == "Statistics").show()
Get unique rows
We can use the distinct() method of DataFrame to get the unique rows.
For example, if we don’t use the distinct() method, let’s try to show the id, first_name, and last_name columns.
About Column, we talked about how to reference a column, add a column, get the list of column names, rename column names, select a column, change a column’s data type and delete a column.
About Row, we talked about how to create a row, create a new DataFrame from a row list, append rows to an existing DataFrame, sort rows, limit row number, filter rows, and get unique rows.
If you don’t have a local Spark development environment, you can read the article below.
Apache Spark “ Create a Local Apache Spark Development Environment on Windows With Just Two Commands