avatarSiméon Ferez

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

7545

Abstract

like 2 discrete variables with a precise distribution. We can also see that in Loan. In Amount we have 3 outliers and extreme values. It will be interesting to see if we want to keep these values in our model prediction, or if we decided to clean the data for a more homogenous dataset to train our model. This visualization is confirmed in the boxplot from figure 2.</p><h2 id="5014">DataEnergy.csv</h2><figure id="58f8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*VPctNPmTfZT-FVdTzlHLkA.png"><figcaption>Figure 4: Pairplot DataEnergy.csv</figcaption></figure><figure id="0529"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9EskpGSSBy5bPkJvKlVSOg.png"><figcaption>Figure 5: Boxplot DataEnergy.csv</figcaption></figure><figure id="f5d0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*h_AI4fK-SgXA7mOlsO21AQ.png"><figcaption>Figure 6: Correlation matrix DataEnergy.csv</figcaption></figure><p id="4963">Similarly, from the pairplot of Figure 4, we can already just with this visualization of the data that PE is inversely correlated with AT and V. This is confirmed by the correlation matrix of Figure 6. This will probably mean that the coefficient of AT and V will be bigger than the coefficients from AP and RH.</p><p id="64c8">From the Pairplot Figure 4 and the Boxplot Figure 5, we can see that the dataset is clean and doesn’t have outliers’ values or extrema that can affect the model. In this case, we will not need to process and clean the data before using them.</p><p id="acba">We can also use for both datasets the correlation matrix to also see the correlation between two predictors X. It is also important to not train the model with too similar predictors because it could lead to issues in the prediction of the model. Because the variable will be taken as too similar, and the model will not be able to tell which one affects the output.</p><p id="c82b">The selection of the training is primordial because it will directly be linked to the underfitting or overfitting problems of our model. If some parameters were in type that cannot be read and understood by the model — such as string values — we will need to transform these parameters to numerical values. This is not the case in our datasets.</p><h1 id="7969">Results</h1><p id="7fdf">We will analyze two different datasets containing different predictor parameters: dataEnergy.csv and DataLoans.csv. During our analysis it’s important to pay attention to the feature correlation — be aware of the correlation between the predictor’s parameter — the data choice — we shouldn’t assume that all columns of the dataframe may be used, sometimes the model will work better without certain columns — the physical interpretation — It might be interesting to create new association and columns if necessary to yield to a better result.</p><p id="2b04">Finally, for the validation of our model, we will test our result on an “unfamiliar” dataset, it’s why we are splitting the data between a train and test dataset. The testing may avoid some issues such as: — <b>Overfitting</b>: often comes when we are not using a test dataset to validate the model. The model will predict very good results while training, but the accuracy with a new set of data may be bad. — <b>Underfitting</b>: The model doesn’t understand the change in the predictor and fails to predict the correct output, it is essentially the contrary of overfitting.</p><h2 id="c754">DataEnergy.csv</h2><h2 id="9ed4">Effect of predictors parameters</h2><p id="dc6d">In our first scenario, we will analyze the effect of training our model with different parameters, we will create several multiple regressions with different predictor parameters. We will start by training our model with the most correlated predictor (AT) and then adding the other predictors (V, AP, RH) one after the other.</p><figure id="9c18"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*KEigTH2zqCyB_rxZtaJjYA.png"><figcaption>Figure 7: DataEnergy model trained with [“AT”]</figcaption></figure><figure id="fc6f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*d2T-DJyRrFzexY_G8tBxPQ.png"><figcaption>Figure 8: DataEnergy model trained with [“AT”,”V”]</figcaption></figure><figure id="7491"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*xMpUqUX1hXj9SRNJRmFpUQ.png"><figcaption>Figure 9: DataEnergy model trained with [“AT”,”V”,”AP”]</figcaption></figure><figure id="139f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*BZw6CJIVW1U-R2wH7zKkyg.png"><figcaption>Figure 10: DataEnergy model trained with [“AT”,”V”,”AP”,”RH”]</figcaption></figure><blockquote id="2d0f"><p>First model: EP = θ_0+ θ_1AT R2 = 89,95 Second model: EP = θ_0+ θ_1AT+ θ_2V R2 = 91,44 Third model: EP = θ_0+ θ_1AT+ θ_2V + θ_3AP R2 = 91,32 Fourth model: EP = θ_0+ θ_1AT+ θ_3AP+ θ_4*RH R2 = 92,76</p></blockquote><figure id="81f7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*793Di2GD2_jrcNajkaHfkA.png"><figcaption>Figure 11: DataEnergy Effect of parameters</figcaption></figure><p id="7a04">In this first experience, we have seen that just by using the most correlated predictor, AT we can already have a good accuracy and an R2 of 89,95 which is correct. But adding the others predictor even if at first sight they seem uncorrelated to y — like with the distribution of EP or RH seen in figure 4 — these coefficients will improve our model and raise R2 until 92,76 which starts to be a good and accurate model. We also see that the prediction of the model has less noise and has a better fit when adding more parameters — As shown in Figures 7,8,9,10 on the left graph. However, when training a model, we need to be careful to not add too many or unnecessary parameters that will lead to overfitting issues in our model.</p><h2 id="4327">Effect of learning rate</h2><p id="4c80">Our second scenario will focus on the effect of the learning rate along models, we will try different learning rates across the same model and analyze their effect on the model, and the values will start from 0.1 to 0.0001. We will finish the loop when the change in thetas converges around 10e-6 to minimize the unnecessary computational time.</p><figure id="5d25"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*CMvGtkw1Z9EwtkgTQbeFAg.png"><figcaption>Figure 12: DataEnergy Effect of learning rate</figcaption></figure><figure id="0b7a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*R3J8mZlPYOwm67YsGSbRSg.png"><figcaption>Figure 13: DataEnergy Iteration before convergence</figcaption></figure><p id="4d9c">With a bigger learning rate, the value of thetas will change quicker and tend to converge faster too. These lead to less computational time but slightly less precise results in the end because the step is larger than with a small learning rate. It’s important to adapt the learning rate to each specific problem and the type of model you want to build. Depending on the security factor and the precision you expect at the end — depending also on the field of application (Motorsport, Space exploration, medical application).</p><h2 id="ce97">Effect of train/test ratio</h2><p id="a72d">In this third scenario, we will test our model with different train/test ratios, from 0,4 to 0,95.</p><figure id="e283"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9fkZpib6M01_lcwJvLJbLw.png"><figcaption>Figure 14: DataEnergy, R2 depending on train/test ratio</figcaption></

Options

figure><p id="b9f3">As shown in Figure 14, in our case of the model, the train/test ratio doesn’t affect the R2 and accuracy of the model. It doesn’t mean that the testing dataset is not important, the testing and in other applications, we can also have a validation dataset primordial to avoid and detect overfitting problems in our model.</p><h2 id="b8f3">Effect of iterations</h2><p id="0f13">In this fourth scenario, we are testing the model with different number max of iterations, from 10 to 1000 by keeping all the other parameters unchanged — we are using a learning rate of 0,01.</p><figure id="c934"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*GoDKJH8-PV5-SqJ_OjzKYw.png"><figcaption>Figure 16: Evolution of thetas with 10 iterations</figcaption></figure><figure id="f219"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ORT3pYC_XW-It-llDXMPAA.png"><figcaption>Figure 15: Evolution of thetas with 1000 iterations</figcaption></figure><figure id="6ff8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*yKE0OgbD8wW0fiPtTdzbGg.png"><figcaption>Figure 17: R2 value for the number of iterations</figcaption></figure><p id="8df7">With a few iterations — 10,50- the thetas values don’t have time to converge, so our model is not fully trained and that leads to bad and not accurate results. Whereas, more iteration helps the model to have thetas that converge to their real value and improve the accuracy of our model. Is it essential to note that more iterations lead to more computational time. A combination of a high number of iterations mixed with a small learning rate improves the accuracy of the model but improves the computational time by a lot.</p><h2 id="88e8">DataLoans.csv</h2><p id="8ef4">For DataLoans we will use the Loan.Amount and FICO.Score predictors parameters.</p><h2 id="5c21">Cleaning the data</h2><p id="687b">As seen in our analysis part the DataLoans.csv dataset has huge outliers and extreme value in the Loan.Amount column. In this scenario, we will train our model one time with the raw value of DataLoans and one time with the cleaned value — where we have removed the 3 extreme values of Loan.Amount — to assess if these values had a positive or negative impact on the model.</p><figure id="82db"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*qZRTay6D_AX6RQfQwTj_ug.png"><figcaption></figcaption></figure><figure id="5a84"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*b1G_wXKZkjFZodruk3RNmg.png"><figcaption>Figure 18: DataLoans model without cleaning the data</figcaption></figure><figure id="3794"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*srNWO1lRDlTLweHGK1BhZw.png"><figcaption></figcaption></figure><figure id="da10"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*7znNo7v9N1KGp4cMf0NOVA.png"><figcaption>Figure 19: DataLoans model with cleaning the data</figcaption></figure><p id="cb7c">Finally, we can conclude that these extreme values slightly affect the model, we have a small increase of R2 from 0.0004 to 0.004. But in both cases the models don’t have a high R2 value, it might be an underfitting problem, where the model doesn’t find the correlation between the predictors and the y value.</p><h2 id="d548">Create custom predictors parameters</h2><p id="3aa5">In this last part, we will try to create additional custom predictor parameters to increase the accuracy of our model. We will create three additional X:

  • Loan.Amount * Loan.Amount
  • FICO.Score * FICO.Score
  • Loan.Amount * FICO.Score</p><figure id="c84c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ejVjqn1koow6sWsNDbmwjA.png"><figcaption>Figure 20: Effect of custom parameters</figcaption></figure><p id="16b2">Finally, even with additional custom parameter R2 doesn’t have an increase and stay around 0. Our model lack accuracy and suffers from underfitting problems.</p><h1 id="2684">Conclusion</h1><p id="9b7c">In conclusion, we managed to obtain several Multilinear Regressions by different methods of computation that are coherent with each other with respect to both the data set. We also found that more specific calculations for complex relationships are computed through multiple linear regression as it is often better and using a gradient descent algorithm in order to minimize the cost function and update the parameters of the learning model as it is computationally faster.</p><p id="a16d">While working we shouldn’t assume all parameters must be used at the same time and it’s important to take the time to understand and choose the right parameters, and predictors to increase the precision of our model. Underfitting and overfitting problems are always important to have in mind and be careful when training our model, that’s also why the testing and sometimes validation set is important</p><p id="d9e3">We will in the future investigate several other machine learning algorithms which might be an interesting alternative way for a more complex model that required more parameters and accuracy.</p><p id="b486">Find more of the code <a href="https://github.com/sferez/Gradient_Descent">here</a>.</p><h1 id="f81b">References</h1><p id="b0ad">[1] Evangelos C. Alexopoulos (2010), Multivariate Regression Analysis, Introduction to Multivariate Regression Analysis pdf [2] O’Reilly, Multiple Regression Analysis 4th Edition, Chapter 3. [3] Bargiela, Nakashima, Pedrycz (2005), Iterative gradient descent approach to multiple regression, Nottingham Trend University. [4] Dr. Stuart Barnes (2022), Advanced Java and Advance Python, Cranfield University. [5] Multiple Regression (2017), CenterStat, youtube video, <a href="https://www.youtube.com/watch?v=Qyw2_3cVpa0&amp;list=PLQGe6zcSJT0V4xC1NDyQePkyxUj8LWLnD&amp;index=5">https://www.youtube.com/watch?v=Qyw2_3cVpa0&amp;list=PLQGe6zcSJT0V4xC1NDyQePkyxUj8LWLnD&amp;index=5</a> [6] Multiple Linear Regression (2018), Kaggle, <a href="https://www.kaggle.com/code/rakend/multiple-linear-regression-with-gradient-descent">https://www.kaggle.com/code/rakend/multiple-linear-regression-with-gradient-descent</a> [7] Gradient Descent Algorithm (2020), youtube video, <a href="https://www.youtube.com/watch?v=xrPZbHrxrWo">https://www.youtube.com/watch?v=xrPZbHrxrWo</a></p><h1 id="8e64">Learn more</h1><h2 id="4744">How to Develop an Arbitrage Betting Bot Using Python</h2><figure id="b0fe"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*FOsdA14bD5h1qPo-.png"><figcaption></figcaption></figure><h2 id="78a5">How to Set up and Use Binance API with Python</h2><figure id="7965"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*VVjGvpSZ_AhcfUwA.png"><figcaption></figcaption></figure><p id="e0d5">© All rights reserved, December 2022, Siméon FEREZ</p><p id="3bbb"><i>More content at <a href="https://plainenglish.io/"><b>PlainEnglish.io</b></a>.</i></p><p id="e5de"><i>Sign up for our <a href="http://newsletter.plainenglish.io/"><b>free weekly newsletter</b></a>. Follow us on <a href="https://twitter.com/inPlainEngHQ"><b>Twitter</b></a></i>, <a href="https://www.linkedin.com/company/inplainenglish/"><b><i>LinkedIn</i></b></a><i>, <a href="https://www.youtube.com/channel/UCtipWUghju290NWcn8jhyAw"><b>YouTube</b></a>, and <a href="https://discord.gg/GtDtUAvyhW"><b>Discord</b></a><b>.</b></i></p><p id="6028"><b><i>Interested in scaling your software startup</i></b><i>? Check out <a href="https://circuit.ooo?utm=publication-post-cta"><b>Circuit</b></a>.</i></p></article></body>

Multiple Linear Regression, Gradient Descent /w Python

Multiple linear regression is a technique that uses several independent variables in order to predict the outcome of a dependent variable. It is recommended to use it more than simple linear regression as it provides more specific calculation and captures the complex relationship followed by planning, controlling, and predicting the relations which helps us get more information to estimate the dependent variable and also enables us to fit curves and lines. In our day-to-day life, linear regression and especially multiple linear regression are used in financial forecasting, software cost and effort prediction, geographic image processing, and health care.

If you haven’t yet read my first article about linear regression, I invite you to read it now as this article will extend my first one and will allow you to have a deeper understanding of the following results. Read the article here.

During this project, the major aims are to:

  • Analyze the datasets and select the best predictors.
  • Implement the gradient descent algorithm.
  • Test our implementation for a varying number of iteration steps, learning rate, train/test ratio to estimate their effect on the model
  • Evaluation for each scenario of the goodness of fit and having a better understanding of the behaviors of multiple linear regression

Literature review

Gradient Descent is an optimization algorithm used for minimizing the cost function in various machine learning algorithms that are used to update the parameters of the learning model. The basic difference between simple linear regression (SLR) and Gradient Descent is with respect to the iteration process. It finds the linear model parameters iteratively. The cost function implies the error or how far the predicted line is from the actual points we were given [1].

Although there is an alternative method to minimize the cost function known as Closed Form Solution. Since Gradient descent is a computationally faster option to find the solution it is used to minimize cost function by repetitively moving in the direction of the steepest descent.

Methodology

The gradient descent algorithm aims to minimize the cost function. For instance, if we have ’n’ number of variables, we have to set the hypothesis (x) which is a function of θ and x. [2]

Equation 1

We will use and transform equation 1 into matrices to be able to compute it using Python.

Each data point is now a(n+1) dimensional vector. Here, xθ = 1 and we initialize the variables from 0. We will set up the hypotheses and initialize the parameters in order to find out the cost function.

Cost function: The generic formula for cost function is [3]:

Equation 2
Equation 3

(x(i)) is the hypothesis, y(i) = corresponding value, and m = Total number of training samples.

For gradient descent we need to update the θ value by using:

This has to be repeated in a loop until we find the best result, or we get a stopping point using the number of iterations. Hence, the cost function when we do the partial derivation from equation [3] it simplifies into [4]:

Equation 4

For computing the multi-linear regression, python is being used with several libraries such as numpy , pandas , seaborn , matplotlib, and sklearn . The code program is organized between several sub-functions, each doing a specific task — reading the csv file and plotting graphs. We have created the main class called MLR_gradient() which will allow us to create for each simulation and model an instance of this class. The code is built in that manner to allow easy maintenance and efficient modification if needed.

The program is being run inside a jupyter notebook, allowing us to experiment with a small part of the code without rerunning the whole project.

After initializing the variable for the different gradient descent, we split the data between a train and test set, followed by normalizing the data. Finally, we apply the gradient algorithm to find the thetas parameters. In our program we will be using matrixes — it’s where the numpy package is important — which is much faster than python native loop.

Analysis

Analyzing the correlation of the dataset to estimate and have a first understanding of the predictor parameters we might want to use and how they will affect our model.

DataLoans.csv

Figure 1: Pairplot DataLoans.csv
Figure 2: BoxPlot DataLoans.csv
Figure 3: Correlation matrix DataLoans.csv

In order to analyze the DataFrame, three alternative methods of have been introduced to find out the shape and correlation of the data. A pairplot is used to understand the best set of features to explain the relationship between variables or to form the most separated clusters. The heatmap correlation matrix is a method to display the correlation among our variables. It is represented as a color palette choice for building an effective map. The lighter shade is always used to present the higher values, whereas the lower values are darker.

From figure 1 we can directly see visually that Interest. Rate doesn’t have a correlated predictor parameter, which is also translated from the correlation matrix of Figure 3. The distribution seems random, and we cannot predict which parameter will be more correlated and works best for our model prediction.

From figure 1 we can already see the type of data that we will deal with, for example, Montly.Income seems like 2 discrete variables with a precise distribution. We can also see that in Loan. In Amount we have 3 outliers and extreme values. It will be interesting to see if we want to keep these values in our model prediction, or if we decided to clean the data for a more homogenous dataset to train our model. This visualization is confirmed in the boxplot from figure 2.

DataEnergy.csv

Figure 4: Pairplot DataEnergy.csv
Figure 5: Boxplot DataEnergy.csv
Figure 6: Correlation matrix DataEnergy.csv

Similarly, from the pairplot of Figure 4, we can already just with this visualization of the data that PE is inversely correlated with AT and V. This is confirmed by the correlation matrix of Figure 6. This will probably mean that the coefficient of AT and V will be bigger than the coefficients from AP and RH.

From the Pairplot Figure 4 and the Boxplot Figure 5, we can see that the dataset is clean and doesn’t have outliers’ values or extrema that can affect the model. In this case, we will not need to process and clean the data before using them.

We can also use for both datasets the correlation matrix to also see the correlation between two predictors X. It is also important to not train the model with too similar predictors because it could lead to issues in the prediction of the model. Because the variable will be taken as too similar, and the model will not be able to tell which one affects the output.

The selection of the training is primordial because it will directly be linked to the underfitting or overfitting problems of our model. If some parameters were in type that cannot be read and understood by the model — such as string values — we will need to transform these parameters to numerical values. This is not the case in our datasets.

Results

We will analyze two different datasets containing different predictor parameters: dataEnergy.csv and DataLoans.csv. During our analysis it’s important to pay attention to the feature correlation — be aware of the correlation between the predictor’s parameter — the data choice — we shouldn’t assume that all columns of the dataframe may be used, sometimes the model will work better without certain columns — the physical interpretation — It might be interesting to create new association and columns if necessary to yield to a better result.

Finally, for the validation of our model, we will test our result on an “unfamiliar” dataset, it’s why we are splitting the data between a train and test dataset. The testing may avoid some issues such as: — Overfitting: often comes when we are not using a test dataset to validate the model. The model will predict very good results while training, but the accuracy with a new set of data may be bad. — Underfitting: The model doesn’t understand the change in the predictor and fails to predict the correct output, it is essentially the contrary of overfitting.

DataEnergy.csv

Effect of predictors parameters

In our first scenario, we will analyze the effect of training our model with different parameters, we will create several multiple regressions with different predictor parameters. We will start by training our model with the most correlated predictor (AT) and then adding the other predictors (V, AP, RH) one after the other.

Figure 7: DataEnergy model trained with [“AT”]
Figure 8: DataEnergy model trained with [“AT”,”V”]
Figure 9: DataEnergy model trained with [“AT”,”V”,”AP”]
Figure 10: DataEnergy model trained with [“AT”,”V”,”AP”,”RH”]

First model: EP = θ_0+ θ_1*AT R2 = 89,95 Second model: EP = θ_0+ θ_1*AT+ θ_2*V R2 = 91,44 Third model: EP = θ_0+ θ_1*AT+ θ_2*V + θ_3*AP R2 = 91,32 Fourth model: EP = θ_0+ θ_1*AT+ θ_3*AP+ θ_4*RH R2 = 92,76

Figure 11: DataEnergy Effect of parameters

In this first experience, we have seen that just by using the most correlated predictor, AT we can already have a good accuracy and an R2 of 89,95 which is correct. But adding the others predictor even if at first sight they seem uncorrelated to y — like with the distribution of EP or RH seen in figure 4 — these coefficients will improve our model and raise R2 until 92,76 which starts to be a good and accurate model. We also see that the prediction of the model has less noise and has a better fit when adding more parameters — As shown in Figures 7,8,9,10 on the left graph. However, when training a model, we need to be careful to not add too many or unnecessary parameters that will lead to overfitting issues in our model.

Effect of learning rate

Our second scenario will focus on the effect of the learning rate along models, we will try different learning rates across the same model and analyze their effect on the model, and the values will start from 0.1 to 0.0001. We will finish the loop when the change in thetas converges around 10e-6 to minimize the unnecessary computational time.

Figure 12: DataEnergy Effect of learning rate
Figure 13: DataEnergy Iteration before convergence

With a bigger learning rate, the value of thetas will change quicker and tend to converge faster too. These lead to less computational time but slightly less precise results in the end because the step is larger than with a small learning rate. It’s important to adapt the learning rate to each specific problem and the type of model you want to build. Depending on the security factor and the precision you expect at the end — depending also on the field of application (Motorsport, Space exploration, medical application).

Effect of train/test ratio

In this third scenario, we will test our model with different train/test ratios, from 0,4 to 0,95.

Figure 14: DataEnergy, R2 depending on train/test ratio

As shown in Figure 14, in our case of the model, the train/test ratio doesn’t affect the R2 and accuracy of the model. It doesn’t mean that the testing dataset is not important, the testing and in other applications, we can also have a validation dataset primordial to avoid and detect overfitting problems in our model.

Effect of iterations

In this fourth scenario, we are testing the model with different number max of iterations, from 10 to 1000 by keeping all the other parameters unchanged — we are using a learning rate of 0,01.

Figure 16: Evolution of thetas with 10 iterations
Figure 15: Evolution of thetas with 1000 iterations
Figure 17: R2 value for the number of iterations

With a few iterations — 10,50- the thetas values don’t have time to converge, so our model is not fully trained and that leads to bad and not accurate results. Whereas, more iteration helps the model to have thetas that converge to their real value and improve the accuracy of our model. Is it essential to note that more iterations lead to more computational time. A combination of a high number of iterations mixed with a small learning rate improves the accuracy of the model but improves the computational time by a lot.

DataLoans.csv

For DataLoans we will use the Loan.Amount and FICO.Score predictors parameters.

Cleaning the data

As seen in our analysis part the DataLoans.csv dataset has huge outliers and extreme value in the Loan.Amount column. In this scenario, we will train our model one time with the raw value of DataLoans and one time with the cleaned value — where we have removed the 3 extreme values of Loan.Amount — to assess if these values had a positive or negative impact on the model.

Figure 18: DataLoans model without cleaning the data
Figure 19: DataLoans model with cleaning the data

Finally, we can conclude that these extreme values slightly affect the model, we have a small increase of R2 from 0.0004 to 0.004. But in both cases the models don’t have a high R2 value, it might be an underfitting problem, where the model doesn’t find the correlation between the predictors and the y value.

Create custom predictors parameters

In this last part, we will try to create additional custom predictor parameters to increase the accuracy of our model. We will create three additional X: - Loan.Amount * Loan.Amount - FICO.Score * FICO.Score - Loan.Amount * FICO.Score

Figure 20: Effect of custom parameters

Finally, even with additional custom parameter R2 doesn’t have an increase and stay around 0. Our model lack accuracy and suffers from underfitting problems.

Conclusion

In conclusion, we managed to obtain several Multilinear Regressions by different methods of computation that are coherent with each other with respect to both the data set. We also found that more specific calculations for complex relationships are computed through multiple linear regression as it is often better and using a gradient descent algorithm in order to minimize the cost function and update the parameters of the learning model as it is computationally faster.

While working we shouldn’t assume all parameters must be used at the same time and it’s important to take the time to understand and choose the right parameters, and predictors to increase the precision of our model. Underfitting and overfitting problems are always important to have in mind and be careful when training our model, that’s also why the testing and sometimes validation set is important

We will in the future investigate several other machine learning algorithms which might be an interesting alternative way for a more complex model that required more parameters and accuracy.

Find more of the code here.

References

[1] Evangelos C. Alexopoulos (2010), Multivariate Regression Analysis, Introduction to Multivariate Regression Analysis pdf [2] O’Reilly, Multiple Regression Analysis 4th Edition, Chapter 3. [3] Bargiela, Nakashima, Pedrycz (2005), Iterative gradient descent approach to multiple regression, Nottingham Trend University. [4] Dr. Stuart Barnes (2022), Advanced Java and Advance Python, Cranfield University. [5] Multiple Regression (2017), CenterStat, youtube video, https://www.youtube.com/watch?v=Qyw2_3cVpa0&list=PLQGe6zcSJT0V4xC1NDyQePkyxUj8LWLnD&index=5 [6] Multiple Linear Regression (2018), Kaggle, https://www.kaggle.com/code/rakend/multiple-linear-regression-with-gradient-descent [7] Gradient Descent Algorithm (2020), youtube video, https://www.youtube.com/watch?v=xrPZbHrxrWo

Learn more

How to Develop an Arbitrage Betting Bot Using Python

How to Set up and Use Binance API with Python

© All rights reserved, December 2022, Siméon FEREZ

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Interested in scaling your software startup? Check out Circuit.

Python
Data Science
Linear Regression
Programming
Technology
Recommended from ReadMedium