Summary

Using randomSplit in PySpark for splitting data into train and test sets can lead to unshuffled data and significant issues in machine learning model training due to its sorting behavior, which is different from the expected random ordering like train_test_split in sklearn.

Abstract

When working with large-scale data for machine learning models, it's crucial to avoid using PySpark's randomSplit function for dividing datasets into training and testing portions. Contrary to the random order provided by sklearn's train_test_split, randomSplit sorts the partitions before splitting, which can result in non-random sequences in the data. This can be problematic because it may cause the training batches to be unrepresentative of the overall data distribution, particularly if the target variable is binary. For instance, a dataset with features x and binary target labels y might end up with training batches that contain only one class of y, leading to poor model performance. The issue arises because randomSplit does not randomize the order of rows as expected, potentially causing the first part of the split dataframe to contain only one type of instance.

Opinions

The author advises against using randomSplit in PySpark for splitting data due to its unintended sorting behavior, which can lead to silent problems in model training.
The expected behavior of data splitting should mimic sklearn's train_test_split, which assigns rows randomly to either the train or test set.
The sorting behavior of randomSplit is explicitly mentioned in the PySpark code comments, highlighting a known discrepancy from the expected random split.
The author emphasizes that the lack of proper shuffling can severely impact the distribution of the target variable in the training batches, potentially destroying the performance of machine learning models.
To mitigate this issue, the author suggests three potential solutions: reshuffling the dataframes post randomSplit, reordering columns before splitting, or abandoning randomSplit in favor of using filter to create two separate dataframes, accepting that this may not result in an exact split ratio but a good approximation for large datasets.

Why you should not use randomSplit in PySpark to split data into train and test.

In case you work with large scale data and want to prepare dataset for your Tensorflow/PyTorch model, don’t use randomSplit function to split data into train and test.

The problem

You have a pyspark dataframe and you would like to split it into two dataframes, train and test. Obviously, you would like to have both parts to be shuffled. The behavior you expect should be similar to train_test_split in sklearn, where it assigns rows in random order into train or test.

However, PySpark function randomSplit first sorts the partitions and then makes splits. It can be found in the comment to the code.

Example

In this example we create a dataframe with x column and then shuffle this column

We would expect that after randomSplit we will see also random order of this column. However, it’s not the case, and in fact the column x is in sorted order.

Needless to say that this could cause huge silent problems later on during the training because the training batches would include only instances of one type. Here is an example.

Why it’s a big problem?

Imagine your dataset has 2 columns x and y , where x for features and y is binary target labels that take 0 or 1, for your ML model to predict. Naturally, when you train a model your training batch should have approximately the same distribution of y as in the whole dataset. However, if we use randomSplit, training batch will have either y=0 or y=1.

In the example, below we have original dataframe.

Now we do random split and get the following:

In fact, the first 10K rows of the dataframe has no y=1:

If we look at all the rows of df_train we will see that the ratio of 0 and 1 is balanced (about 250K out of 500K):

So it means that dataframe df_train has the first rows with y=0 which are followed by y=1. This is highly unshuffled behavior, which can silently destroy ML model’s peformance.

Solution

There are 3 solutions to this problem.

We can just reshuffle dataframes after randomSplit.The problem is just the cost: for big datasets reshuffling datasets can be expensive.

2. Another is more subtle. We can reorder columns and that would impact how ordering is performed. However, it does not always solve the problem.

3. Get rid of randomSplit altogether. Instead, just use filter to get 2 different dataframes. This does not guarantee exact split_ratio, but for a big enough dataset you should get a good approximation.