avatarValeriy Manokhin, PhD, MBA, CQF

Summary

The web content discusses the comparison between Facebook Prophet and its successor, NeuralProphet, in the context of time series forecasting, highlighting the limitations of Facebook Prophet and the potential improvements offered by NeuralProphet.

Abstract

The article critically examines the transition from Facebook Prophet to NeuralProphet, a newer forecasting algorithm. It outlines the shortcomings of Facebook Prophet, noting its poor performance and the discontinuation of its development. The author then introduces NeuralProphet, which is claimed to be a superior model that retains the interpretability of Facebook Prophet while improving forecasting accuracy. The article scrutinizes NeuralProphet's performance through a series of experiments, comparing it to Facebook Prophet and assessing the impact of autoregressive (AR) terms and neural network components. The results indicate that NeuralProphet significantly outperforms Facebook Prophet when AR terms are included and that the addition of non-linear AR-Net components does not necessarily improve out-of-sample forecasting. The article concludes by suggesting that while NeuralProphet is an advancement over Facebook Prophet, it may not surpass other established forecasting methods like ARIMA.

Opinions

  • The author expresses skepticism about the original claims made by Facebook Prophet, considering them to be overstated and not supported by empirical evidence.
  • There is a clear opinion that Facebook Prophet's performance is subpar, especially when compared to other time series forecasting algorithms.
  • The author is critical of the lack of proper benchmarking in the original Facebook Prophet paper.
  • The introduction of NeuralProphet is met with cautious optimism, acknowledging its potential while also highlighting the need for rigorous testing against other algorithms.
  • The author points out that NeuralProphet's improvements are mainly due to the inclusion of AR terms, a feature already present in other models like ARIMA.
  • The article suggests that the complexity added by NeuralProphet's AR-Net components may not translate to better real-world forecasting performance.
  • The author emphasizes the importance of out-of-sample performance over in-sample fit when evaluating forecasting models.
  • There is an underlying sentiment that the forecasting community should be wary of adopting new algorithms without thorough comparison to existing, well-established methods.

Benchmarking Neural Prophet. Part I โ€” Neural Prophet vs Facebook Prophet.

In 2020โ€“2021 have written many LinkedIn posts explaining ๐˜๐—ต๐—ฎ๐˜ ๐—ณ๐—ฎ๐—ฐ๐—ฒ๐—ฏ๐—ผ๐—ผ๐—ธ ๐—ฝ๐—ฟ๐—ผ๐—ฝ๐—ต๐—ฒ๐˜ ๐—ถ๐˜€ ๐—ฎ ๐—ป๐—ผ๐—ป-๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ถ๐—ป๐—ด ๐—ณ๐—ผ๐—ฟ๐—ฒ๐—ฐ๐—ฎ๐˜€๐˜๐—ถ๐—ป๐—ด ๐—ฎ๐—น๐—ด๐—ผ๐—ฟ๐—ถ๐˜๐—ต๐—บ ๐˜๐—ต๐—ฎ๐˜ ๐—ป๐—ผ๐˜ ๐—ผ๐—ป๐—น๐˜† ๐—ฑ๐—ผ๐—ฒ๐˜€ ๐—ป๐—ผ๐˜ ๐˜„๐—ผ๐—ฟ๐—ธ ๐—ฎ๐—ฐ๐—ฟ๐—ผ๐˜€๐˜€ ๐—ฎ๐—ป๐˜† ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ฎ๐—ฏ๐—น๐—ฒ ๐˜€๐—ฒ๐˜ ๐—ผ๐—ณ ๐˜๐—ถ๐—บ๐—ฒ๐˜€๐—ฒ๐—ฟ๐—ถ๐—ฒ๐˜€ ๐—ฑ๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜๐˜€, but also ๐Ÿ†„๐Ÿ…ฝ๐Ÿ…ณ๐Ÿ…ด๐Ÿ†๐Ÿ…ฟ๐Ÿ…ด๐Ÿ†๐Ÿ…ต๐Ÿ…พ๐Ÿ†๐Ÿ…ผ๐Ÿ†‚ ๐Ÿ…ผ๐Ÿ…พ๐Ÿ†‚๐Ÿ†ƒ ๐Ÿ…พ๐Ÿ…ต ๐Ÿ…พ๐Ÿ†ƒ๐Ÿ…ท๐Ÿ…ด๐Ÿ† ๐Ÿ…ต๐Ÿ…พ๐Ÿ†๐Ÿ…ด๐Ÿ…ฒ๐Ÿ…ฐ๐Ÿ†‚๐Ÿ†ƒ๐Ÿ…ธ๐Ÿ…ฝ๐Ÿ…ถ ๐Ÿ…ฐ๐Ÿ…ป๐Ÿ…ถ๐Ÿ…พ๐Ÿ†๐Ÿ…ธ๐Ÿ†ƒ๐Ÿ…ท๐Ÿ…ผ๐Ÿ†‚.

2022 update: Meta has discontinued all claims made by the original Facebook Prophet development team, including grotesque claims such as 'anyone can achieve forecasting performance on par with human experts by using Facebook Prophet.'

As I have mentioned in my interview with Analytics India magazine (see Facebook Prophet falls out of favour), Facebook Prophet's credibility and popularity have taken a severe hit. I have previously pointed out that recent papers on time series have not used Facebook Prophet as baselines as it does not perform well across any general forecasting task.

More importantly, as explained in several posts, such issues can not be rectified as facebook prophet contains pathological flaws inherent in the Facebook prophet's design itself.

With the recent launch of 'NeuralProphet' trumpeted by the new dev team with great fanfare, "We introduce NeuralProphet, ๐™– ๐™จ๐™ช๐™˜๐™˜๐™š๐™จ๐™จ๐™ค๐™ง ๐™ฉ๐™ค ๐™๐™–๐™˜๐™š๐™—๐™ค๐™ค๐™  ๐™‹๐™ง๐™ค๐™ฅ๐™๐™š๐™ฉ, ๐™ฌ๐™๐™ž๐™˜๐™ ๐™จ๐™š๐™ฉ ๐™–๐™ฃ ๐™ž๐™ฃ๐™™๐™ช๐™จ๐™ฉ๐™ง๐™ฎ ๐™จ๐™ฉ๐™–๐™ฃ๐™™๐™–๐™ง๐™™ ๐™›๐™ค๐™ง ๐™š๐™ญ๐™ฅ๐™ก๐™–๐™ž๐™ฃ๐™–๐™—๐™ก๐™š, ๐™จ๐™˜๐™–๐™ก๐™–๐™—๐™ก๐™š, ๐™–๐™ฃ๐™™ ๐™ช๐™จ๐™š๐™ง-๐™›๐™ง๐™ž๐™š๐™ฃ๐™™๐™ก๐™ฎ ๐™›๐™ค๐™ง๐™š๐™˜๐™–๐™จ๐™ฉ๐™ž๐™ฃ๐™œ ๐™›๐™ง๐™–๐™ข๐™š๐™ฌ๐™ค๐™ง๐™ ๐™จ".

I did not realise that Facebook prophet set any standards other than generally terrible forecasting performance, but let's not spoil the show.

The launch of 'NeuralProphet' had caused a strange sense of Deja Vu reminiscent of when original facebook prophet devs claimed that 'anyone can obtain excellent performance on par with human experts by using Facebook prophet' whilst the paper about Facebook prophet didn't even benchmark Facebook prophet properly on any datasets beyond internal Facebook dataset or indeed against any other datasets or algorithms.

Fast forward 2+ years, many scientific papers, articles and social media posts have demonstrated that Facebook prophet does not work in general compared to other time-series forecasting algorithms. It does not generalise to diverse datasets and does not even work well on datasets it was expressly designed forโ€” the data with trend and seasonality.

Coming back to the new incarnation of the 'prophet' โ€” the NeuralProphet dev team recently posted a paper on ArXiv saying there is the need for hybrid solutions to bridge the gap between interpretable classical methods and scalable deep learning methods. However, this claim is not backed by any scientific evidence. The results from the M5 competition have demonstrated that data-driven machine learning methods outperformed both the simple and hybrid methods.

No 'hybrid methods' were seen in the M5 forecasting competition near the top winning table, and the creators of both winning 'hybrids' (Slawek Smyl and the team from Monash that took #1 and #2 winning places in the M4 forecasting competitions) did not win any top spots in the M5 competition either. Instead, the M5 forecasting competition was won by a variety of LightGBM methods that dominated Kaggle contests for a long time.

According to the claims from the NeuralProphet development team:

๐‘ถ๐’•๐’‰๐’†๐’“๐’˜๐’Š๐’”๐’†, ๐‘ต๐’†๐’–๐’“๐’‚๐’๐‘ท๐’“๐’๐’‘๐’‰๐’†๐’• ๐’“๐’†๐’•๐’‚๐’Š๐’๐’” ๐’•๐’‰๐’† ๐’…๐’†๐’”๐’Š๐’ˆ๐’ ๐’‘๐’‰๐’Š๐’๐’๐’”๐’๐’‘๐’‰๐’š ๐’๐’‡ ๐‘ท๐’“๐’๐’‘๐’‰๐’†๐’• ๐’‚๐’๐’… ๐’‘๐’“๐’๐’—๐’Š๐’…๐’†๐’” ๐’•๐’‰๐’† ๐’”๐’‚๐’Ž๐’† ๐’ƒ๐’‚๐’”๐’Š๐’„ ๐’Ž๐’๐’…๐’†๐’ ๐’„๐’๐’Ž๐’‘๐’๐’๐’†๐’๐’•๐’”. ๐‘ถ๐’–๐’“ ๐’“๐’†๐’”๐’–๐’๐’•๐’” ๐’…๐’†๐’Ž๐’๐’๐’”๐’•๐’“๐’‚๐’•๐’† ๐’•๐’‰๐’‚๐’• ๐‘ต๐’†๐’–๐’“๐’‚๐’๐‘ท๐’“๐’๐’‘๐’‰๐’†๐’• ๐’‘๐’“๐’๐’…๐’–๐’„๐’†๐’” ๐’Š๐’๐’•๐’†๐’“๐’‘๐’“๐’†๐’•๐’‚๐’ƒ๐’๐’† ๐’‡๐’๐’“๐’†๐’„๐’‚๐’”๐’• ๐’„๐’๐’Ž๐’‘๐’๐’๐’†๐’๐’•๐’” ๐’๐’‡ ๐’†๐’’๐’–๐’Š๐’—๐’‚๐’๐’†๐’๐’• ๐’๐’“ ๐’”๐’–๐’‘๐’†๐’“๐’Š๐’๐’“ ๐’’๐’–๐’‚๐’๐’Š๐’•๐’š ๐’•๐’ ๐‘ท๐’“๐’๐’‘๐’‰๐’†๐’• ๐’๐’ ๐’‚ ๐’”๐’†๐’• ๐’๐’‡ ๐’ˆ๐’†๐’๐’†๐’“๐’‚๐’•๐’†๐’… ๐’•๐’Š๐’Ž๐’† ๐’”๐’†๐’“๐’Š๐’†๐’”. ๐‘ต๐’†๐’–๐’“๐’‚๐’๐‘ท๐’“๐’๐’‘๐’‰๐’†๐’• ๐’๐’–๐’•๐’‘๐’†๐’“๐’‡๐’๐’“๐’Ž๐’” ๐‘ท๐’“๐’๐’‘๐’‰๐’†๐’• ๐’๐’ ๐’‚ ๐’…๐’Š๐’—๐’†๐’“๐’”๐’† ๐’„๐’๐’๐’๐’†๐’„๐’•๐’Š๐’๐’ ๐’๐’‡ ๐’“๐’†๐’‚๐’-๐’˜๐’๐’“๐’๐’… ๐’…๐’‚๐’•๐’‚๐’”๐’†๐’•๐’”. ๐‘ญ๐’๐’“ ๐’”๐’‰๐’๐’“๐’• ๐’•๐’ ๐’Ž๐’†๐’…๐’Š๐’–๐’Ž-๐’•๐’†๐’“๐’Ž ๐’‡๐’๐’“๐’†๐’„๐’‚๐’”๐’•๐’”, ๐‘ต๐’†๐’–๐’“๐’‚๐’๐‘ท๐’“๐’๐’‘๐’‰๐’†๐’• ๐’Š๐’Ž๐’‘๐’“๐’๐’—๐’†๐’” ๐’‡๐’๐’“๐’†๐’„๐’‚๐’”๐’• ๐’‚๐’„๐’„๐’–๐’“๐’‚๐’„๐’š ๐’ƒ๐’š 55 ๐’•๐’ 92 ๐’‘๐’†๐’“๐’„๐’†๐’๐’•.'

We already know that Facebook prophet is a low-performance forecasting algorithm of terrible quality and is overperformed by many other algorithms, including on datasets where facebook prophet is supposed to work. So if anything, the new paper by the NeuralProphet dev team confirms that Facebook prophet is terrible by pointing out that it is outperformed (by 55 to 92) per cent by being conceptually the same type of algorithm [we will talk about differences later in this article].

But the NeuralProphet paper does not tell us anything about whether NeuralProphet is any good at all in comparison with many other algorithms available as it simply does not benchmark NeuralProphet against anything other than โ€ฆ Facebook Prophet.

Why did the paper's authors not include any benchmarks, as is standard in almost any other paper introducing a new forecasting algorithm? After all, who needs another algorithm unless it is proven to perform vis-a-vis at least a core set of already available methods?

So let's start the journey to see how (if any) good 'NeuralProphet' really is.

In the first part of the series, we will take NeuralProphet out of the garage, open the bonnet (hood), and kick the tires.

To begin, we will use the same dataset that the NeuralProphet developer team has used https://neuralprophet.com/html/energy_solar_pv.html

We first do the same experiments here to ensure reproducibility and compare apples with apples.

First Neural prophet model โ€” with no AR terms included (this would be broadly the same model as the original Facebook prophet that also does not have AR terms, so a priori a flawed model, but let's see how it goes anyway).

As seen in the plot below, without AR (autoregressive) terms, the fit, even in-sample, is quite wrong. This is in line with the current domain knowledge โ€” without the AR terms, Neural Prophet is the original Facebook prophet.

If anything else, the new Neural Prophet model is yet another confirmation that Facebook Prophet did not manage to reproduce the data with seasonalities for which it was expressly designed.

Neural Prophet without AR terms fit on the training part (90%) of the dataset. This is equivalent to the original Facebook Prophet model type

Predictions on the test set for a week and the day ahead are also terrible.

Neural prophet (no AR terms) predicting one week ahead

Let's magnify this to see what happens in terms of predictions one day (24 hours) ahead โ€” Neural Prophet without AR terms fails to capture the simple pattern. The RMSE for the training set is 118, and for the test, set is 143.

Neural prophet (no AR terms) predicting one day ahead

So we can reach our first conclusion โ€” without the AR terms Neural Prophet = Facebook Prophet = generally a useless forecasting model that is not fit for purpose.

Second Neural Prophet โ€” linear AR terms added.

We next fit neural prophet with the AR terms included using the same parameters as from the neural prophet website (n_lags = 3*24)

The in-sample fit is now much better. Using the AR terms, the Neural Prophet can now capture the dynamics of the time series better.

Neural prophet with linear AR terms fit on the training part (90%) of the dataset.
Neural prophet with linear AR terms predicting one week ahead
Neural prophet with linear AR terms predicting one day ahead

The RMSE for the training set is now 53, and for test, set is around 31.

Much better than Neural Prophet without AR terms; however same as prophet Neural Prophet seems unable to constrain radiation forecasts from going below zeroโ€ฆ

Let's do one more final architecture from the Neural Prophet website before we kick the tires and open the hood to check the engine.

Third (and the last) neural prophet model using (non-linear) AR-Net.

One step ahead forecast with AR-Net: Using a neural network with several hidden layers. We use an optimised 0.003 learning rate as in the final version of the exercise on the Neural Prophet website.

m = NeuralProphet(growth='off', yearly_seasonality=False, \
  weekly_seasonality=False, daily_seasonality=False, n_lags=3*24, \
  num_hidden_layers=4, d_hidden=16, learning_rate=0.003)

We train Neural Prophet by providing it with the training set.

m = NeuralProphet()
df_train, df_test = m.split_df(df, valid_p=0.1)
train_metrics = m.fit(df_train)
test_metrics = m.test(df_test)
Neural prophet with non-linear AR terms (AR-Net with 4 hidden layers) fit on the training part (90%) of the dataset.
Neural prophet with non-linear AR terms (AR-Net with 4 hidden layers) predicting one week ahead
Neural prophet with non-linear AR terms (AR-Net with 4 hidden layers) predicting one day ahead
Neural prophet with non-linear AR terms (AR-Net with 4 hidden layers) trend plus AR components
Neural prophet with non-linear AR terms (AR-Net with 4 hidden layers) AR component lag relevance

I have to say the plots look much nicer now, but one should never let the in-sample pictures charm you. What matters is the performance out-of-sample.

The RMSE for the training set is now 39, and for the test set is around ~31.

We note that whilst training error (RMSE) has come down from 53 to 39 due to much higher model capacity (4 layer DNN instead of linear function), the test error did not change. So in this case for this particular dataset AR_Net has provided no improvement out-of-sample in comparison with linear AR terms.

Conclusion 1: AR terms are crucial. Neural Prophet is only adding value via a vis Facebook Prophet when auto-regressive terms are switched on.

But on the other side, other models include AR terms, in particular ARIMA.

Conclusion 2: AR-Net does not seem to add much additional value after linear AR terms have been included. Whether it is for this dataset or a more general result remains to be seen, but for now, something to bear in mind as if this is a general observation, then Neural prophet does not offer anything new in addition to ARIMA / SARIMA model family.

Let's plot the final Neural Prophet Model (AR-Net with 4 hidden layers) predictions on the test set.

Neural Prophet Model (AR-Net with 4 hidden layers) predictions on the test set vs actual

So far, so good, the tires did not fall off when we kicked them, but the engine needed some oil and tuning. In our next article "Benchmarking Neural Prophet. Part II โ€” exploring electricity dataset", we take our Neural Prophet model for a ride and check what else is on the road and if it can stay in lane or needs to go into a slow lane.

To be continuedโ€ฆ

References:

  1. "Facebook Prophet Falls out of favour"
  2. "Benchmarking Neural Prophet. Part II โ€” exploring electricity dataset"
  3. "Benchmarking Facebook Prophet"
Forecasting
Time Series Analysis
Machine Learning
Demand Forecasting
Facebook Prophet
Recommended from ReadMedium