Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

egistered in India:<figure id="087e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*eNywrLUQTzbdE79jJjHTUw.png"><figcaption>New cases registered in India. Image Source: Author.</figcaption></figure>Now we will build the training dataset and training labels. The training dataset will be time-series data. We chose 90 days as the window size of the time series data.<div id="1033"><pre>day = 90 # Number of days (window size)</pre></div>Let’s create the training dataset and training labels now.<div id="d00a"><pre>k = 0</pre></div><div id="799e"><pre>array = [] array_temp = [] train_data = [] train_labels = []</pre></div><div id="875f"><pre>for i in range(len(df_new)): array_temp.append(df_new.iloc[i]['New Cases'])</pre></div><div id="ace4"><pre>array_temp = np.array(array_temp).reshape(-1,1) array_temp = scaler.fit_transform(array_temp) array_temp = array_temp.tolist()</pre></div><div id="4248"><pre>for i in array_temp: array.append(i[0])</pre></div><div id="19dc"><pre>for i in range(len(array)): try: train_data.append(array[k:day+k]) # Creating inner lists with 'day' days of data train_labels.append([array[day+k]]) k+=1 except: break</pre></div><div id="76ec"><pre>length = max(map(len, train_data)) train_data=np.array([xi+[None](length-len(xi)) for xi in train_data]).astype('float32')</pre></div><div id="4ed5"><pre>length = max(map(len, train_labels)) train_labels = np.array([xi+[None](length-len(xi)) for xi in train_labels]).astype('float32')</pre></div>We used the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html">MinMaxScaler</a> to bring our data in the 0–1 range. Then we reshaped it to make it just one column and n number of rows (n = number of elements in the array). After that, we converted the array to a list.Then, we created the training data. For every 90 points in one list, the 91st point will be the label of them. Then we created a list of lists with all the lists containing 90 points. We created another list containing the labels. Later, all those lists are converted into NumPy arrays and NumPy arrays or arrays.The Shape of the training and the label dataset is like this:<div id="feb6"><pre>print(train_data.shape) print(train_labels.shape)</pre></div>Shape:(644, 90) (643, 1)Let’s expand the dimensions of the training dataset for the LSTM architecture.<div id="fa99"><pre>train_data = train_data[:len(train_labels)] train_data = np.expand_dims(train_data,1) train_data.shape</pre></div>Shape: (643, 1, 90)Let’s see how one row of the training data looks now.<div id="e4cf"><pre>train_data[1]</pre></div><figure id="c427"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*sMsXONgr83_s_h7NUQjZUA.png"><figcaption>One row of the training dataset. Image Source: Author.</figcaption></figure>Let’s see how one row of the label dataset:<div id="3945"><pre>train_labels[1]</pre></div>Output: array([0.0027741], dtype=float32)<h2 id="48b2">Architecture:</h2>Let’s build the LSTM architecture now<div id="5a4e"><pre>model = Sequential()</pre></div><div id="ba83"><pre>model.add(LSTM(250,input_shape=(1,day))) model.add(Dropout(0.5))</pre></div><div id="438c"><pre>model.add(Dense(250,activation='relu')) model.add(Dropout(0.5))</pre></div><div id="7e96"><pre>model.add(Dense(day,activation='relu')) model.add(Dropout(0.5))</pre></div><div id="7cf5"><pre>model.add(Dense(1,activation='relu'))</pre></div><div id="a987"><pre>model.compile(loss='mean_squared_error',optimizer='adam')</pre></div><div id="5695"><pre>model.summary()</pre></div><figure id="e6e5"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*wHwvkNKvYwJUNzpnugTFBg.png"><figcaption>LSTM architecture summary. Image Source: Author.</figcaption></figure><h2 id="fd6d">Training:</h2>The fit the data to the model now.<div id="806b"><pre>E = 1000 callback = EarlyStopping(monitor='loss', mode='min', patience=20) H = model.fit(train_data,train_labels,epochs=E, verbose=0, callbacks=[callback])</pre></div>After the training step is done, let’s plot the loss value.<div id="203a"><pre>loss = H.history['loss'] epochs = range(0,len(loss)) plt<span cl

Options

ass="hljs-selector-class">.figure(figsize=(15,5)) plt.plot(epochs,loss)</pre></div><figure id="171b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*GBXYIs44KC5BKe7YV5JsWA.png"><figcaption>Plot of loss value for each epoch. Image Source: Author.</figcaption></figure>Let’s plot the predictions given by LSTM along with the real values of newly registered COVID-19 cases each day to see how accurate the predictions are.<div id="1f3b"><pre>preds = scaler.inverse_transform(model.predict(train_data)) plt.figure(figsize=(30,10)) plt.plot(preds,label='our predictions') plt.plot(scaler.inverse_transform(train_labels),label='real values') plt.legend()</pre></div><figure id="16ef"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*DY1DAlnYGeRV-osAsHsMgw.png"><figcaption>Predicted values along with real values (newly registered COVID-19 cases each day). Image Source: Author.</figcaption></figure>From this graph, we can see that the predicted values are very accurate and the model is very promising.<h2 id="b8a6">Future Predictions:</h2>Let’s try to predict this curve in the near future. We will try to predict how many cases new cases of COVID-19 will be registered in India for the next 90 days (starting from 10 January 2022).<div id="b8ec"><pre>days_to_predict = 90 seed = array[-day:] #seed</pre></div>What we are doing here is, we are creating a seed. A seed is nothing but the last 90 days of data in the array. We will use this seed to predict the data for the next day. When it will predict it then we will update our seed, and make it contain the predicted data to be used in the next prediction.Let’s take an example. The seed first contains the last 90 days of data (values of how many new covid cases are registered each day), ending on January 10, 2022. Using it, we will predict the value for 11 January 2021. Then we will use the predicted data of 11 January 2022 as a datapoint in the seed and predict the value data for 12 January 2022 and so on.Let’s use this model to predict the number of cases for the next 90 days:<div id="0516"><pre>for _ in range(days_to_predict): current_days = seed[-day:] current_days = np.squeeze(current_days) current_days = np.expand_dims(current_days,0) current_days = np.expand_dims(current_days,0) pred = model.predict(current_days) seed = np.append(seed,pred)</pre></div>Let’s plot the predicted values:<div id="6987"><pre>upcoming_days_prediction = scaler.inverse_transform(seed[-days_to_predict:].reshape(-1,1)) plt.figure(figsize=(30,10)) plt.plot(range(0,days_to_predict),upcoming_days_prediction)</pre></div>Let’s add the predicted values with the values in our dataset and plot the complete graph.<div id="f61c"><pre># Adding real values and predicted values together arr_without_pred = scaler.inverse_transform(train_labels) arr_pred = scaler.inverse_transform(seed[-days_to_predict:].reshape(-1,1)) arr_with_pred = np.concatenate((arr_without_pred, arr_pred))</pre></div><div id="7e2e"><pre>plt.figure(figsize=(30,10)) plt.plot(arr_with_pred)</pre></div><h1 id="d19e">Results:</h1>Before we see the results, I hope you are enjoying reading this article. If you did, please become a member of Medium. Just for 5 $per month, you can read any article on Medium (not just my articles, any article). Click the link below.<a href="https://samratduttaofficial.medium.com/membership">https://samratduttaofficial.medium.com/membership</a>I will get a small commission from the 5$ and it will motivate me to write more!Let’s see the prediction of newly registered COVID-19 cases for the upcoming 90 days (starting from 11 January 2022).<figure id="2126"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*o6s1LKaRBfiTtXPop8cs6A.png"><figcaption>Prediction of newly registered COVID-19 cases for the upcoming 90 days. Image Source: Author.</figcaption></figure>Let’s see the complete graph of newly registered COVID-19 cases each day, along with the predicted values for the upcoming 90 days (starting from 11 January 2022).<figure id="2f01"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*A3ydfWSHxm41UpldjSdhzw.png"><figcaption>newly registered COVID-19 cases each day, along with the predicted values for the upcoming 90 days. Image Source: Author.</figcaption></figure>So, how many total cases does the model predict for the upcoming 90 days?<div id="0f8f"><pre>total_new_cases = 0 for i in upcoming_days_prediction: total_new_cases += i[0] #print(i) print(total_new_cases)</pre></div>The answer is: 23,417,088This model predicts about 23.5 million new COVID-19 cases in India in the upcoming 90 days (starting from 11 January 2022).<h1 id="72ac">Conclusion</h1>This study/project just showcases the usage of the LSTM architecture in predicting time-series data. In this case, we used the COVID-19 data from India for our study. This model does not consider transmissibility and other factors while making the predictions.Since the transmissibility of the Omicron variant is much higher than the Delta variant of COVID-19 (the Delta variant was the dominant variant during the second wave of the pandemic in India), I personally think that we will see a much steeper and higher curve in the near future.But I am no health worker or doctor. So take everything I say with a grain of salt.Repository: <a href="https://github.com/SamratDuttaOfficial/Covid_India_LSTM">https://github.com/SamratDuttaOfficial/Covid_India_LSTM</a>Make sure to give the Github repository a star.<h1 id="7752">Samrat Dutta:</h1>Github: <a href="https://github.com/SamratDuttaOfficial">https://github.com/SamratDuttaOfficial</a>Linkedin: <a href="https://www.linkedin.com/in/SamratDuttaOfficial">https://www.linkedin.com/in/SamratDuttaOfficial</a> [Hire Me]Wisest Friends (Machine Learning) Discord: <a href="https://discord.gg/7Bx6PGVy">https://discord.gg/7Bx6PGVy</a>Buy me a coffee: <a href="https://www.buymeacoffee.com/SamratDutta">https://www.buymeacoffee.com/SamratDutta</a></article></body>

Another Steep Curve? Predicting India’s COVID-19 Third Wave with LSTM

Complete project of predicting new COVID-19 cases in the next 90 days with LSTM (code included)

India is seeing a steep rise in COVID-19 cases again! Only 6,358 new cases were registered on 27 December 2021. But in just 14 days, on 10 January 2022, a whopping 168,063 new cases were registered. The curve of the third wave is very steep and that’s why it is a big concern right now.

So, I thought about using the artificial recurrent neural network (RNN) architecture Long Short-Term Memory (LSTM) to predict how the COVID-19 graph will look in near future (next 90 days).

Dataset:

The dataset is downloaded from ‘COVID-19 India Datasets by DataMeet’. The data is community collected, cleaned and organized from different government websites which are freely available to all the Indians.

Github Repository: https://github.com/datameet/covid19

The dataset has a Creative Commons Attribution 4.0 International Public License. The dataset is downloaded on 10 January 2022 and contains data up to the same date.

We are using the file all_totals.JSON file in the data directory. The pre-processing step demonstrates how I processed the data from this JSON file to use it for the LSTM architecture.

Methodology

Let’s discuss the methodology of how I executed the project. The result sections after this will have a discussion of the results we got from the whole project.

Import Libraries:

Let’s import the necessary libraries for the project.

import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Activation,Dense,Dropout
%matplotlib inline
scaler = MinMaxScaler()
from tensorflow.keras.callbacks import EarlyStopping

Data Pre-Processing:

From the dataset all_totals.JSON, I created a dataframe that contains new cases registered every day.

f = open('all_totals.json')
# returns JSON object as a dictionary
data = json.load(f)

total_cases_list = []
new_cases_list = []
    
pre_total_cases = 0
for row in data["rows"]:
    if "total_confirmed_cases" in row["key"]:
        temp_list = []
        temp_list.append(row["key"][0][0:10]) # Appending the date (the time part is trimmed)
        temp_list.append(row["value"]) # Appending the value on that date
        total_cases_list.append(temp_list)
        
        temp_list_2 = []
        temp_list_2.append(row["key"][0][0:10])
        temp_list_2.append((row["value"] - pre_total_cases)) # Appending the value on that date
        new_cases_list.append(temp_list_2)
        
        pre_total_cases = row["value"]
    
df_total = pd.DataFrame(total_cases_list, columns = ["Date", "Total Cases"])
df_new = pd.DataFrame(new_cases_list, columns = ["Date", "New Cases"])

The dataframe named df_new contains new cases registered on each day (some dates are missing), starting from 2020–01–30.

Let’s make the ‘Date’ field the index.

df_total = df_total.set_index("Date")
df_new = df_new.set_index("Date")

Let’s delete some faulty values in the file:

#deleting two faulty values

df_total.drop('2021-09-16', inplace=True, axis=0)
df_total.drop('2021-09-17', inplace=True, axis=0)
df_total.drop('2021-07-21', inplace=True, axis=0)
df_total.drop('2021-07-22', inplace=True, axis=0)

df_new.drop('2021-09-16', inplace=True, axis=0)
df_new.drop('2021-09-17', inplace=True, axis=0)
df_new.drop('2021-07-21', inplace=True, axis=0)
df_new.drop('2021-07-22', inplace=True, axis=0)

This is how the dataframe looks now:

Let’s see the curve of total cases of COVID-19 registered in India:

Total Cases Registered in India. Image Source: Author.

And this is the curve of new cases of COVID-19 registered in India:

New cases registered in India. Image Source: Author.

Now we will build the training dataset and training labels. The training dataset will be time-series data. We chose 90 days as the window size of the time series data.

day = 90 # Number of days (window size)

Let’s create the training dataset and training labels now.

k = 0

array = []
array_temp = []
train_data = []
train_labels = []

for i in range(len(df_new)):
    array_temp.append(df_new.iloc[i]['New Cases'])

array_temp = np.array(array_temp).reshape(-1,1)
array_temp = scaler.fit_transform(array_temp)
array_temp = array_temp.tolist()

for i in array_temp:
    array.append(i[0])

for i in range(len(array)):
    try:
        train_data.append(array[k:day+k]) # Creating inner lists with 'day' days of data
        train_labels.append([array[day+k]])
        k+=1
    except:
        break

length = max(map(len, train_data))
train_data=np.array([xi+[None]*(length-len(xi)) for xi in train_data]).astype('float32')

length = max(map(len, train_labels))
train_labels = np.array([xi+[None]*(length-len(xi)) for xi in train_labels]).astype('float32')

We used the MinMaxScaler to bring our data in the 0–1 range. Then we reshaped it to make it just one column and n number of rows (n = number of elements in the array). After that, we converted the array to a list.

Then, we created the training data. For every 90 points in one list, the 91st point will be the label of them. Then we created a list of lists with all the lists containing 90 points. We created another list containing the labels. Later, all those lists are converted into NumPy arrays and NumPy arrays or arrays.

The Shape of the training and the label dataset is like this:

print(train_data.shape)
print(train_labels.shape)

Shape:

(644, 90) (643, 1)

Let’s expand the dimensions of the training dataset for the LSTM architecture.

train_data = train_data[:len(train_labels)]
train_data = np.expand_dims(train_data,1)
train_data.shape

Shape: (643, 1, 90)

Let’s see how one row of the training data looks now.

train_data[1]

One row of the training dataset. Image Source: Author.

Let’s see how one row of the label dataset:

train_labels[1]

Output: array([0.0027741], dtype=float32)

Architecture:

Let’s build the LSTM architecture now

model = Sequential()

model.add(LSTM(250,input_shape=(1,day)))
model.add(Dropout(0.5))

model.add(Dense(250,activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(day,activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(1,activation='relu'))

model.compile(loss='mean_squared_error',optimizer='adam')

model.summary()

LSTM architecture summary. Image Source: Author.

Training:

The fit the data to the model now.

E = 1000
callback = EarlyStopping(monitor='loss', mode='min', patience=20)
H = model.fit(train_data,train_labels,epochs=E, verbose=0, callbacks=[callback])

After the training step is done, let’s plot the loss value.

loss = H.history['loss']
epochs = range(0,len(loss))
plt.figure(figsize=(15,5))
plt.plot(epochs,loss)

Plot of loss value for each epoch. Image Source: Author.

Let’s plot the predictions given by LSTM along with the real values of newly registered COVID-19 cases each day to see how accurate the predictions are.

preds = scaler.inverse_transform(model.predict(train_data))
plt.figure(figsize=(30,10))
plt.plot(preds,label='our predictions')
plt.plot(scaler.inverse_transform(train_labels),label='real values')
plt.legend()

Predicted values along with real values (newly registered COVID-19 cases each day). Image Source: Author.

From this graph, we can see that the predicted values are very accurate and the model is very promising.

Future Predictions:

Let’s try to predict this curve in the near future. We will try to predict how many cases new cases of COVID-19 will be registered in India for the next 90 days (starting from 10 January 2022).

days_to_predict = 90
seed = array[-day:]
#seed

What we are doing here is, we are creating a seed. A seed is nothing but the last 90 days of data in the array. We will use this seed to predict the data for the next day. When it will predict it then we will update our seed, and make it contain the predicted data to be used in the next prediction.

Let’s take an example. The seed first contains the last 90 days of data (values of how many new covid cases are registered each day), ending on January 10, 2022. Using it, we will predict the value for 11 January 2021. Then we will use the predicted data of 11 January 2022 as a datapoint in the seed and predict the value data for 12 January 2022 and so on.

Let’s use this model to predict the number of cases for the next 90 days:

for _ in range(days_to_predict):
    current_days = seed[-day:]
    current_days = np.squeeze(current_days)
    current_days = np.expand_dims(current_days,0)
    current_days = np.expand_dims(current_days,0)
    pred = model.predict(current_days)
    seed = np.append(seed,pred)

Let’s plot the predicted values:

upcoming_days_prediction = scaler.inverse_transform(seed[-days_to_predict:].reshape(-1,1))
plt.figure(figsize=(30,10))
plt.plot(range(0,days_to_predict),upcoming_days_prediction)

Let’s add the predicted values with the values in our dataset and plot the complete graph.

# Adding real values and predicted values together
arr_without_pred = scaler.inverse_transform(train_labels)
arr_pred = scaler.inverse_transform(seed[-days_to_predict:].reshape(-1,1))
arr_with_pred = np.concatenate((arr_without_pred, arr_pred))

plt.figure(figsize=(30,10))
plt.plot(arr_with_pred)

Results:

Before we see the results, I hope you are enjoying reading this article. If you did, please become a member of Medium. Just for 5$ per month, you can read any article on Medium (not just my articles, any article). Click the link below.

https://samratduttaofficial.medium.com/membership

I will get a small commission from the 5$ and it will motivate me to write more!

Let’s see the prediction of newly registered COVID-19 cases for the upcoming 90 days (starting from 11 January 2022).

Prediction of newly registered COVID-19 cases for the upcoming 90 days. Image Source: Author.

Let’s see the complete graph of newly registered COVID-19 cases each day, along with the predicted values for the upcoming 90 days (starting from 11 January 2022).

newly registered COVID-19 cases each day, along with the predicted values for the upcoming 90 days. Image Source: Author.

So, how many total cases does the model predict for the upcoming 90 days?

total_new_cases = 0
for i in upcoming_days_prediction:
    total_new_cases += i[0]
    #print(i)
print(total_new_cases)

The answer is: 23,417,088

This model predicts about 23.5 million new COVID-19 cases in India in the upcoming 90 days (starting from 11 January 2022).

Conclusion

This study/project just showcases the usage of the LSTM architecture in predicting time-series data. In this case, we used the COVID-19 data from India for our study. This model does not consider transmissibility and other factors while making the predictions.

Since the transmissibility of the Omicron variant is much higher than the Delta variant of COVID-19 (the Delta variant was the dominant variant during the second wave of the pandemic in India), I personally think that we will see a much steeper and higher curve in the near future.

But I am no health worker or doctor. So take everything I say with a grain of salt.

Repository: https://github.com/SamratDuttaOfficial/Covid_India_LSTM

Make sure to give the Github repository a star.

Samrat Dutta:

Github: https://github.com/SamratDuttaOfficial

Linkedin: https://www.linkedin.com/in/SamratDuttaOfficial [Hire Me]

Wisest Friends (Machine Learning) Discord: https://discord.gg/7Bx6PGVy

Buy me a coffee: https://www.buymeacoffee.com/SamratDutta