Joining Pyspark dataframes with multiple conditions and null values
It is important to be able to join dataframes based on multiple conditions. The default behavior for a left join when one of the join columns is null is to disregard that column and say there is not a match. However, if you have data at different resolutions, you may not want to drop those null values. I will use fabricated real estate and climate information to explain.
Say that you have a dataset with weather data. However your weather dataset for some states is on the state level, but for other states is on a city level:
geo_data = [
{"state": "CA", "city": "Los Angeles", "average_temp": 66},
{"state": "HI", "city": None, "average_temp": 70},
{"state": "CA", "city": "San Francisco", "average_temp": 60},
]
geo_df = spark.createDataFrame(Row(**x) for x in geo_data).cache()
geo_df.show()
+------------+-------------+-----+
|average_temp| city|state|
+------------+-------------+-----+
| 66| Los Angeles| CA|
| 70| null| HI|
| 60|San Francisco| CA|
+------------+-------------+-----+
You want to join this weather data on a list of homes that you are interested in buying. The homes table looks like the following:
house_data = [
{"mls": 111,"state": "CA", "city": "Los Angeles", "price": 1000000},
{"mls": 222, "state": "CA", "city": "San Francisco", "price": 1200000},
{"mls": 333, "state": "HI", "city": "Waimea", "price": 7000000},
{"mls": 444, "state": "HI", "city": "Hawaii Kai", "price": 8000000},
]
house_df = spark.createDataFrame(Row(**x) for x in house_data).cache()
house_df.show()
+-------------+---+-------+-----+
| city|mls| price|state|
+-------------+---+-------+-----+
| Los Angeles|111|1000000| CA|
|San Francisco|222|1200000| CA|
| Waimea|333|7000000| HI|
| Hawaii Kai|444|8000000| HI|
+-------------+---+-------+-----+
Now what can you do even though the resolution on the weather data does not match up with the resolution on the real estate data? The first thought may to be to use the eqNullSafe() operator in pyspark. But that function only applies when the value is null in both dataframes. Instead you can use an “or” statement in your join conditions
join_conditions = [
house_df.state == geo_df.state,
(house_df.city == geo_df.city) | (geo_df.city.isNull())
]
(
house_df
.join(
geo_df,
join_conditions,
"left"
)
.select(
"mls",
"price",
house_df.state,
house_df.city,
"average_temp"
)
.show()
)
+---+-------+-----+-------------+------------+
|mls| price|state| city|average_temp|
+---+-------+-----+-------------+------------+
|111|1000000| CA| Los Angeles| 66|
|222|1200000| CA|San Francisco| 60|
|333|7000000| HI| Waimea| 70|
|444|8000000| HI| Hawaii Kai| 70|
+---+-------+-----+-------------+------------+
There you go! Now you have your important weather data joined on to your house data so you can make a better assessment on where you would like to live next.