Free AI web copilot to create summaries, insights and extended knowledge, download it at here

11692

Abstract

</a>, ‘I’ stands for infant abalones. It represents that the abalones are in the early stage of development and have not yet reached sexual maturity. Let’s check its frequency distribution and consider excluding rows corresponding to infant abalones in cases where there are only a few records of them.<div id="3fc0"><pre># get the distribution of abalones by Sex fig, ax = plt.subplots(figsize=(5,3)) ax.bar(X['Sex'].value_counts().index, X['Sex'].value_counts()) ax.set_xticklabels(['Male', 'Infant', 'Female']) ax.set_ylabel('Count') ax.set_title('Sex Distribution of Abalones')</pre></div><figure id="328a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ZNROj7552aO6sXDI3BV1Pw.png"><figcaption>Sex distribution of abalones</figcaption></figure>It shows that there is a substantial number of male, infant and female abalones, with a relatively even distribution among the three categories. We will keep the infant abalones’ records and one-hot encode the <code>Sex</code> column.<div id="a68e"><pre># one-hot encode Sex column abalone_sex = pd.get_dummies(X['Sex']) abalone_sex = abalone_sex.drop(columns='I') abalone_sex = abalone_sex.astype(int) abalone_sex.columns = ['IsFemale', 'IsMale'] X = pd.concat([abalone_sex, X], axis=1)

# add intercept column X['Intercept'] = 1 X = X[['Intercept'] + [col for col in X.columns if col not in ['Intercept', 'Sex']]] X.head()

## | Intercept | IsFemale | IsMale | Length | Diameter | Height | Whole_weight | Shucked_weight | Viscera_weight | Shell_weight | ## -------|-----------|----------|--------|--------|----------|--------|--------------|-----------------|----------------|--------------| ## 0 | 1 | 0 | 1 | 0.455 | 0.365 | 0.095 | 0.5140 | 0.2245 | 0.1010 | 0.1500 | ## 1 | 1 | 0 | 1 | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.0700 | ## 2 | 1 | 1 | 0 | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.2100 | ## 3 | 1 | 0 | 1 | 0.440 | 0.365 | 0.125 | 0.5160 | 0.2155 | 0.1140 | 0.1550 | ## 4 | 1 | 0 | 0 | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.0550 |</pre></div>We also added the intercept column which has all entries equal to 1. Our dataset now is ready. Let’s see how to perform Bayesian linear regression.<h2 id="9e65">5.2. Estimate posterior distribution of θ and σ² with Gibbs</h2>I created a class <code>BayesianLinearRegression</code> that contains several useful methods for fitting Bayesian linear model and diagnosing the MCMC samples.<div id="40be"><pre>class BayesianLinearRegression:

<span class="hljs-keyword">def</span> <span class="hljs-title function_">__init__</span>(<span class="hljs-params">self, nIter=<span class="hljs-number">2</span>**<span class="hljs-number">12</span>, pBurn=<span class="hljs-literal">None</span></span>):
    self.nIter = nIter
    self.pBurn = pBurn <span class="hljs-keyword">if</span> pBurn <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">else</span> <span class="hljs-number">0.5</span>
    
<span class="hljs-keyword">def</span> <span class="hljs-title function_">fit</span>(<span class="hljs-params">self, y, X, theta0=<span class="hljs-literal">None</span>, Sigma=<span class="hljs-literal">None</span>, alpha=<span class="hljs-literal">None</span>, beta=<span class="hljs-literal">None</span></span>):
    
    self.colnames = X.columns
    
    <span class="hljs-keyword">if</span> <span class="hljs-built_in">isinstance</span>(y, pd.Series):
        y = y.to_numpy()            
    <span class="hljs-keyword">if</span> <span class="hljs-built_in">isinstance</span>(X, pd.DataFrame):
        X = X.to_numpy()
    
    self.n, self.p = X.shape        
    self.theta0 = theta0 <span class="hljs-keyword">if</span> theta0 <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">else</span> np.zeros(self.p)
    self.Sigma = Sigma <span class="hljs-keyword">if</span> Sigma <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">else</span> np.diag(np.ones(self.p))
    self.alpha = alpha <span class="hljs-keyword">if</span> alpha <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">else</span> <span class="hljs-number">1</span>
    self.beta = beta <span class="hljs-keyword">if</span> beta <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">else</span> <span class="hljs-number">1</span>        
    theta = np.empty((self.nIter+<span class="hljs-number">1</span>, self.p))
    sigma2 = np.empty(self.nIter+<span class="hljs-number">1</span>)        
    theta[<span class="hljs-number">0</span>] = np.random.normal(size=self.p)
    sigma2[<span class="hljs-number">0</span>] = np.random.exponential(size=<span class="hljs-number">1</span>)
    
    <span class="hljs-keyword">for</span> iIter <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">1</span>, self.nIter+<span class="hljs-number">1</span>, <span class="hljs-number">1</span>):
        <span class="hljs-comment"># update theta</span>
        inv_Sigma = inv(self.Sigma)
        inv_comp1 = inv(np.matmul(X.T, X)/sigma2[iIter-<span class="hljs-number">1</span>] + inv_Sigma)
        comp2 = np.matmul(X.T, y)/sigma2[iIter-<span class="hljs-number">1</span>] + np.matmul(inv_Sigma, self.theta0)
        theta[iIter] = np.random.multivariate_normal(np.matmul(inv_comp1, comp2), inv_comp1, <span class="hljs-number">1</span>)         
        <span class="hljs-comment"># update sigma2</span>
        e = y - np.matmul(X, theta[iIter])
        alpha_n = self.alpha + self.n/<span class="hljs-number">2</span>
        beta_n = self.beta + np.matmul(e.T, e)/<span class="hljs-number">2</span>
        sigma2[iIter] = <span class="hljs-number">1</span>/np.random.gamma(alpha_n, <span class="hljs-number">1</span>/beta_n, <span class="hljs-number">1</span>)
    
    self.thetaAll = pd.DataFrame(theta, columns=self.colnames)
    self.sigma2All = pd.DataFrame(sigma2, columns=[<span class="hljs-string">'sigma2'</span>])
    self.thetaBurn, self.theta = self._burn(self.thetaAll)
    self.sigma2Burn, self.sigma2 = self._burn(self.sigma2All)
    
<span class="hljs-keyword">def</span> <span class="hljs-title function_">_burn</span>(<span class="hljs-params">self, theta</span>):
    self.nBurn = <span class="hljs-built_in">int</span>(self.nIter * self.pBurn)
    <span class="hljs-keyword">return</span> theta.iloc[:self.nBurn,:], theta.iloc[self.nBurn:,:]
    
<span class="hljs-keyword">def</span> <span class="hljs-title function_">get_credible_interval</span>(<span class="hljs-params">self</span>):
    index_labels = [<span class="hljs-string">'2.5%'</span>, <span class="hljs-string">'50%'</span>, <span class="hljs-string">'97.5%'</span>]
    theta_ci = np.quantile(self.theta, [<span class="hljs-number">0.025</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">0.975</span>], axis=<span class="hljs-number">0</span>)
    sigma2_ci = np.quantile(self.sigma2, [<span class="hljs-number">0.025</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">0.975</span>]).reshape((<span class="hljs-number">3</span>,<span class="hljs-number">1</span>))
    self.theta_ci = pd.DataFrame(theta_ci, columns=self.colnames, index=index_labels)
    self.sigma2_ci = pd.DataFrame(sigma2_ci, columns=[<span class="hljs-string">'sigma2'</span>], index=index_labels)
    
<span class="hljs-keyword">def</span> <span class="hljs-title function_">plot_posterior</span>(<span class="hljs-params">self, variables=<span class="hljs-string">'all'</span></span>):
    
    self.get_credible_interval()
    
    <span class="hljs-keyword">if</span> variables==<span class="hljs-string">'all'</span>:
        n_vars = self.p
        variables = self.colnames

    n_vars = <span class="hljs-built_in">len</span>(variables) + <span class="hljs-number">1</span>
    n_fcols = <span class="hljs-built_in">int</span>(np.ceil(np.sqrt(n_vars)))
    n_frows = <span class="hljs-built_in">int</span>(np.ceil(n_vars/n_fcols))
    fig, axs = plt.subplots(n_frows, n_fcols, figsize=(<span class="hljs-number">4</span>*n_fcols, <span class="hljs-number">4</span>*n_frows), squeeze=<span class="hljs-literal">False</span>)        
    self.sigma2.plot(kind=<span class="hljs-string">'kde'</span>, ax=axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>], label=<span class="hljs-string">r'$f(\sigma^2\mid\mathrm{y},\mathrm{X})$'</span>)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].axvline(self.sigma2_ci.loc[<span class="hljs-string">'50%'</span>, <span class="hljs-string">'sigma2'</span>], color=<span class="hljs-string">'red'</span>, linestyle=<span class="hljs-string">'dashed'</span>, linewidth=<span class="hljs-number">2</span>, label=<span class="hljs-string">r'$\widehat{\sigma}^2$'</span>)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">r'$\sigma^2$'</span>)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].legend(loc=<span class="hljs-string">'upper right'</span>)
    
    i, j = <span class="hljs-number">0</span>, <span class="hljs-number">1</span>
    <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> variables:
        self.theta[col].plot(kind=<span class="hljs-string">'kde'</span>, ax=axs[i,j], label=<span class="hljs-string">r'$f(\theta\mid\mathrm{y},\mathrm{X})$'</span>)

Options

  axs[i,j].axvline(self.theta_ci.loc[<span class="hljs-string">'50%'</span>, col], color=<span class="hljs-string">'red'</span>, linestyle=<span class="hljs-string">'dashed'</span>, linewidth=<span class="hljs-number">2</span>, label=<span class="hljs-string">r'$\widehat{\theta}$'</span>)
        axs[i,j].set_title(col)
        axs[i,j].legend(loc=<span class="hljs-string">'upper right'</span>)
        j += <span class="hljs-number">1</span>
        <span class="hljs-keyword">if</span> j % n_fcols == <span class="hljs-number">0</span>:
            j = <span class="hljs-number">0</span>
            i += <span class="hljs-number">1</span>

<span class="hljs-keyword">def</span> <span class="hljs-title function_">trace_plot</span>(<span class="hljs-params">self, first_n=<span class="hljs-number">200</span></span>):
    
    fig, axs = plt.subplots(<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, squeeze=<span class="hljs-literal">False</span>, figsize=(<span class="hljs-number">5</span>*<span class="hljs-number">2</span>, <span class="hljs-number">3</span>)) 
 
    <span class="hljs-comment"># First n=200 samples</span>
    thetaFirst200 = self.thetaBurn.iloc[<span class="hljs-number">0</span>:first_n,:]
    sigma2First200 = self.sigma2Burn.iloc[<span class="hljs-number">0</span>:first_n,:]
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].plot(sigma2First200.index, sigma2First200[<span class="hljs-string">'sigma2'</span>], label=<span class="hljs-string">r'$\sigma^2$'</span>, linestyle=<span class="hljs-string">'-'</span>)
    <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> thetaFirst200.columns:
        axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].plot(thetaFirst200.index, thetaFirst200[col], label=col, linestyle=<span class="hljs-string">'-'</span>)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_xlabel(<span class="hljs-string">'Index'</span>)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_ylabel(<span class="hljs-string">'Value'</span>)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">'Trace plot of '</span> + <span class="hljs-string">r'$\theta_{1:%s}$'</span> % first_n)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].grid(<span class="hljs-literal">True</span>)
    
    <span class="hljs-comment"># Latter part of the Markov-chain</span>
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>].plot(self.sigma2.index, self.sigma2[<span class="hljs-string">'sigma2'</span>], label=<span class="hljs-string">r'$\sigma^2$'</span>, linestyle=<span class="hljs-string">'-'</span>)
    <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> self.theta.columns:
        axs[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>].plot(self.theta.index, self.theta[col], label=col, linestyle=<span class="hljs-string">'-'</span>)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>].set_xlabel(<span class="hljs-string">'Index'</span>)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>].set_ylabel(<span class="hljs-string">'Value'</span>)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>].set_title(<span class="hljs-string">'Trace plot of $\\theta_{{{}:{}}}$'</span>.<span class="hljs-built_in">format</span>(self.theta.index[<span class="hljs-number">0</span>], self.theta.index[-<span class="hljs-number">1</span>]))
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>].legend(bbox_to_anchor=(<span class="hljs-number">1</span>, <span class="hljs-number">0.5</span>), loc=<span class="hljs-string">'center left'</span>)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>].grid(<span class="hljs-literal">True</span>)
    
<span class="hljs-keyword">def</span> <span class="hljs-title function_">acf_plot</span>(<span class="hljs-params">self</span>):

    n_vars = self.p + <span class="hljs-number">1</span>
    n_fcols = <span class="hljs-built_in">int</span>(np.ceil(np.sqrt(n_vars)))
    n_frows = <span class="hljs-built_in">int</span>(np.ceil(n_vars/n_fcols))
    fig, axs = plt.subplots(n_frows, n_fcols, figsize=(<span class="hljs-number">4</span>*n_fcols, <span class="hljs-number">4</span>*n_frows), squeeze=<span class="hljs-literal">False</span>)
    plot_acf(self.sigma2All[<span class="hljs-string">'sigma2'</span>], lags=<span class="hljs-number">20</span>, ax=axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>])
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">r'$\sigma^2$'</span>)
    axs[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_ylim([-<span class="hljs-number">0.1</span>, <span class="hljs-number">1.1</span>])
    
    i, j = <span class="hljs-number">0</span>, <span class="hljs-number">1</span>
    <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> self.thetaAll.columns:
        plot_acf(self.thetaAll[col], lags=<span class="hljs-number">20</span>, ax=axs[i,j])
        axs[i,j].set_title(col)
        axs[i,j].set_ylim([-<span class="hljs-number">0.1</span>, <span class="hljs-number">1.1</span>])
        j += <span class="hljs-number">1</span>
        <span class="hljs-keyword">if</span> j % n_fcols == <span class="hljs-number">0</span>:
            j = <span class="hljs-number">0</span>
            i += <span class="hljs-number">1</span></pre></div><p id="8dd9">Methods of <code>BayesianLinearRegression</code> :</p><ol><li><code>fit</code>: fit Bayesian linear model to the dataset</li><li><code>_burn</code>: burn-in Markov chain</li><li><code>get_credible_interval</code>: return 95% credible interval of <b><i>θ</i></b> and <i>σ</i>²</li><li><code>plot_posterior</code>: plot the kernel density estimator (KDE) for <b><i>θ</i></b> and <i>σ</i>²</li><li><code>trace_plot</code>: get the trace plot of first 200 Markov chain and the trace plot of the used samples</li><li><code>acf_plot</code>: plot the autocorrelation (ACF) for <b><i>θ</i></b> and <i>σ</i>²</li></ol><div id="fbdd"><pre>y = y[<span class="hljs-string">'Rings'</span>]

nIter = 2**14 pBurn = 0.5 model = BayesianLinearRegression(nIter, pBurn) model.fit(y, X)</pre></div>To get the Bayes estimator under L₁-loss and 95% credible interval for θ and σ²:<div id="0b06"><pre>model.get_credible_interval() print(model.theta_ci)

print(model.sigma2_ci)

## | sigma2 | ## ---------|----------| ## 2.5% | 4.7825 | ## 50% | 4.9989 | ## 97.5% | 5.2170 |</pre></div>For the classical linear regression, we can use <code>statsmodels</code> to get the OLS estimator.<div id="b40b"><pre>ols_result = sm.OLS(y, X).fit() print(ols_result.params)

## Intercept 3.069765 ## IsFemale 0.824876 ## IsMale 0.882592 ## Length -0.458335 ## Diameter 11.075103 ## Height 10.761537 ## Whole_weight 8.975445 ## Shucked_weight -19.786867 ## Viscera_weight -10.581827 ## Shell_weight 8.741806 ## dtype: float64</pre></div>It is clear that there are some differences between these two methods. Only the Bayes estimator and OLS estimator for <code>Intercept</code>, <code>IsFemale</code> and <code>IsMale</code> are consistent. For other variables, both methods give a slightly different results.<h2 id="9e8e">5.3. Diagnose Markov chain</h2>Use <code>plot_posterior</code> to check the shape for each θ and σ².<figure id="74b1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*kgKDKkDUQf28MyriwonCXQ.png"><figcaption>plot of KDE for each θ and σ²; red dotted line represents the posterior median</figcaption></figure>The above graph shows the density estimate for each θ and σ². The blue curve is the KDE and the red dotted vertical line is the posterior median. We can see that the shape for each θ and σ² are also Gaussian.Use <code>trace_plot</code> to check the value of θ and σ² for each iteration.<figure id="8cfe"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*n9G6IQS7aB8mRXin0-yIhA.png"><figcaption>trace plot of first 200 Markov chain and the used samples</figcaption></figure>This is the trace plot of the Markov chain. We can see that the Markov chain converges to the target distribution very quickly, reaching convergence after only a few iterations.Use <code>acf_plot</code> to check the autocorrelation.<figure id="bb70"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*D20Lz-dy3ENSG_n_ckLekQ.png"><figcaption></figcaption></figure><h1 id="f4ea">6. Conclusion</h1>This article covers most of the essential knowledge to guide you how to write your own Bayesian linear regression from scratch in Python. It’s always good to code from scratch when you start learning a new model, as it ensures that you can completely understand the underlying concepts. In the future, I will introduce more Bayesian regression related topics such as BIC and LASSO. I will also explain the Gibbs samplers in details.If you’d like to stay connected, please follow me on Medium 📖.If you found this article useful, please don’t forget to give it a clap 👏.Feel free to share your thoughts in the comments — I love hearing from you!</article></body>

Bayesian Linear Regression from Scratch in Python: A Comprehensive Guide

Learn how to implement linear regression in Bayesian framework

(This article presume you have a basic knowledge on linear regression, Bayesian inference and MCMC. If you wanna know more about MCMC, please check my previous post shown below.)

The Must-Know Bayesian Inference Algorithm: Metropolis-Hastings

Learn what Metropolis-Hastings algorithm is.

medium.com

1. Introduction

Everyone who is interested in any data-related area must familiar with linear regression. It is the most fundamental and basic regression method that must be covered in every statistics and machine learning courses. Many advanced models are also developed based on linear regression such as generalised linear model (GLM) and polynomial regression.

Traditionally, ordinary least square (OLS) is used to estimate the linear regression coefficients. We can also use maximum likelihood estimates (MLE) when the distribution of the error terms is known. We refer this type of regression as ‘frequentist’ linear regression.

Bayesian linear regression, as its name suggests, is simply the linear regression but implemented under a Bayesian framework, which is a powerful way to integrate our prior knowledge and update beliefs based on observed data. The Bayesian approach gives the estimate of the posterior distribution for the regression coefficients. It provides a range of values for the coefficients, capturing uncertainty more comprehensively than a point estimate.

In this article, I will first have a quick recap on the concept of linear regression, then explain the Bayesian linear regression along with the derivation of its posterior distribution. Lastly, I will demonstrate how to use Gibbs sampler to implement Bayesian linear regression from scratch with Python.

2. Classical linear regression

Linear regression is a statistical method to model the relationship between a dependent variable and a series of independent variables by fitting a linear model to the observed data, where the best-fitting line can be obtained using OLS method. Suppose we have a dependent variable y ∈ℝⁿ and a data matrix X ∈ℝⁿᵖ, the linear model can be written as follow:

where θ is the regression coefficient and ε is the error term. OLS method estimates the regression coefficients by minimising the sum of squared error:

OLS minimise the loss function to obtain the estimate of θ

We can use calculus to obtain θ that minimise the loss function. The gradient of ℒ with respect to θ is

Setting ∇ℒ = 0 to obtain the OLS estimate of θ:

OLS estimator of linear regression coefficient

If we make the Gaussian assumption on ε, we can estimate θ using MLE, and it will yield the same result as OLS does. Furthermore, with the Gaussian assumption, we can derive the distribution of θ ‘hat’. This allows users to conduct hypothesis tests on the regression coefficients and access the uncertainty of the estimator. The distribution of θ ‘hat’ is

Note that the Gaussian assumption is not necessary to obtain the OLS estimate, but it facilitates statistical inference and hypothesis testing, and simplifies the mathematical derivation of certain properties of the estimators.

3. Bayesian Linear Regression

In Bayesian linear regression, we don’t treat the regression coefficient as a fixed parameter anymore. Instead, we regard it as a random variable, and our goal is to update its distribution based on the data (y, X). With the Gaussian assumption, the linear model can be written as:

multiple linear model in matrix notation

where θ and σ² is the model parameters. Just like in other Bayesian problems, we have the freedom to pick priors based on what we know. If our knowledge about θ and σ is a bit fuzzy, we can opt for an informative (or weakly-informative) prior. One handy choice is the conjugate prior, which yield a posterior distribution that mirrors the prior one.

conjugate priors for bayesian linear model

θ₀, Σ, α and β are called hyper-parameters. To estimate the joint posterior with Gibbs sampler, we compute the conditional posterior f(θ | σ², y, X) and f(σ² | θ, y, X) respectively. The conditional posterior of θ is

The posterior distribution of σ² is

Do they look tedious? Rewrite the posterior:

4. Gibbs sampler

The Gibbs sampler is one of the MCMC methods used to draw samples from a multivariate distribution. It is particularly useful when direct sampling from a joint distribution is challenging, but sampling from its conditional distributions is straightforward.

In our case, f(θ | σ², y, X) follows a multivariate normal distribution, and f(σ² | θ, y, X) follows an inverse-gamma distribution. These distributions are well-known, and direct sampling with them is really simple. Therefore, we will use the Gibbs sampler to estimate the posterior.

The algorithm of the Gibbs sampler is:

Initialise θ₀ and σ₀
For each iteration i, do

Sample θᵢ ~ f( . | σᵢ₋₁)
Sample σᵢ ~ f( . | θᵢ)

Notice that we use the most-updated values of θ and σ for sampling. We sample σᵢ from f(σ² | θᵢ, y, X) but not f(σ² | θᵢ₋₁, y, X) because we already have the value of θᵢ.

5. Coding Part

In this article, I will use abalone dataset to demonstrate how to perform Bayesian linear regression from scratch in Python (only some basic libraries like numpy, pandas and matplotlib will be used). I will also use statsmodels to compare the results between the frequentist approach and the bayesian approach.

5.1. Import libraries and clean abalone dataset

First we need to import the required libraries and load the dataset. The abalone dataset can be obtained at UCI Machine Learning Repository.

import numpy as np
from numpy.linalg import inv

import pandas as pd
import matplotlib.pyplot as plt

from ucimlrepo import fetch_ucirepo

# fetch dataset 
abalone = fetch_ucirepo(id=1) 
  
# data (as pandas dataframes) 
X = abalone.data.features 
y = abalone.data.targets 

# show first 5 rows
X.head()

##        | Sex | Length | Diameter | Height | Whole_weight | Shucked_weight  | Viscera_weight | Shell_weight |
## -------|-----|--------|----------|--------|--------------|-----------------|----------------|--------------|
##    0   |  M  | 0.455  |  0.365   | 0.095  |    0.5140    |     0.2245      |     0.1010     |    0.1500    |
##    1   |  M  | 0.350  |  0.265   | 0.090  |    0.2255    |     0.0995      |     0.0485     |    0.0700    |
##    2   |  F  | 0.530  |  0.420   | 0.135  |    0.6770    |     0.2565      |     0.1415     |    0.2100    |
##    3   |  M  | 0.440  |  0.365   | 0.125  |    0.5160    |     0.2155      |     0.1140     |    0.1550    |
##    4   |  I  | 0.330  |  0.255   | 0.080  |    0.2050    |     0.0895      |     0.0395     |    0.0550    |

You may notice that there is an ‘I’ in the fifth row of Sex. According to the abalone’s variables table, ‘I’ stands for infant abalones. It represents that the abalones are in the early stage of development and have not yet reached sexual maturity. Let’s check its frequency distribution and consider excluding rows corresponding to infant abalones in cases where there are only a few records of them.

# get the distribution of abalones by Sex
fig, ax = plt.subplots(figsize=(5,3))
ax.bar(X['Sex'].value_counts().index, X['Sex'].value_counts())
ax.set_xticklabels(['Male', 'Infant', 'Female'])
ax.set_ylabel('Count')
ax.set_title('Sex Distribution of Abalones')

It shows that there is a substantial number of male, infant and female abalones, with a relatively even distribution among the three categories. We will keep the infant abalones’ records and one-hot encode the Sex column.

# one-hot encode Sex column
abalone_sex = pd.get_dummies(X['Sex'])
abalone_sex = abalone_sex.drop(columns='I')
abalone_sex = abalone_sex.astype(int)
abalone_sex.columns = ['IsFemale', 'IsMale']
X = pd.concat([abalone_sex, X], axis=1)

# add intercept column
X['Intercept'] = 1
X = X[['Intercept'] + [col for col in X.columns if col not in ['Intercept', 'Sex']]]
X.head()

##        | Intercept | IsFemale | IsMale | Length | Diameter | Height | Whole_weight | Shucked_weight  | Viscera_weight | Shell_weight |
## -------|-----------|----------|--------|--------|----------|--------|--------------|-----------------|----------------|--------------|
##    0   |     1     |    0     |   1    | 0.455  |  0.365   | 0.095  |    0.5140    |     0.2245      |     0.1010     |    0.1500    |
##    1   |     1     |    0     |   1    | 0.350  |  0.265   | 0.090  |    0.2255    |     0.0995      |     0.0485     |    0.0700    |
##    2   |     1     |    1     |   0    | 0.530  |  0.420   | 0.135  |    0.6770    |     0.2565      |     0.1415     |    0.2100    |
##    3   |     1     |    0     |   1    | 0.440  |  0.365   | 0.125  |    0.5160    |     0.2155      |     0.1140     |    0.1550    |
##    4   |     1     |    0     |   0    | 0.330  |  0.255   | 0.080  |    0.2050    |     0.0895      |     0.0395     |    0.0550    |

We also added the intercept column which has all entries equal to 1. Our dataset now is ready. Let’s see how to perform Bayesian linear regression.

5.2. Estimate posterior distribution of θ and σ² with Gibbs

I created a class BayesianLinearRegression that contains several useful methods for fitting Bayesian linear model and diagnosing the MCMC samples.

class BayesianLinearRegression:
    
    def __init__(self, nIter=2**12, pBurn=None):
        self.nIter = nIter
        self.pBurn = pBurn if pBurn is not None else 0.5
        
    def fit(self, y, X, theta0=None, Sigma=None, alpha=None, beta=None):
        
        self.colnames = X.columns
        
        if isinstance(y, pd.Series):
            y = y.to_numpy()            
        if isinstance(X, pd.DataFrame):
            X = X.to_numpy()
        
        self.n, self.p = X.shape        
        self.theta0 = theta0 if theta0 is not None else np.zeros(self.p)
        self.Sigma = Sigma if Sigma is not None else np.diag(np.ones(self.p))
        self.alpha = alpha if alpha is not None else 1
        self.beta = beta if beta is not None else 1        
        theta = np.empty((self.nIter+1, self.p))
        sigma2 = np.empty(self.nIter+1)        
        theta[0] = np.random.normal(size=self.p)
        sigma2[0] = np.random.exponential(size=1)
        
        for iIter in range(1, self.nIter+1, 1):
            # update theta
            inv_Sigma = inv(self.Sigma)
            inv_comp1 = inv(np.matmul(X.T, X)/sigma2[iIter-1] + inv_Sigma)
            comp2 = np.matmul(X.T, y)/sigma2[iIter-1] + np.matmul(inv_Sigma, self.theta0)
            theta[iIter] = np.random.multivariate_normal(np.matmul(inv_comp1, comp2), inv_comp1, 1)         
            # update sigma2
            e = y - np.matmul(X, theta[iIter])
            alpha_n = self.alpha + self.n/2
            beta_n = self.beta + np.matmul(e.T, e)/2
            sigma2[iIter] = 1/np.random.gamma(alpha_n, 1/beta_n, 1)
        
        self.thetaAll = pd.DataFrame(theta, columns=self.colnames)
        self.sigma2All = pd.DataFrame(sigma2, columns=['sigma2'])
        self.thetaBurn, self.theta = self._burn(self.thetaAll)
        self.sigma2Burn, self.sigma2 = self._burn(self.sigma2All)
        
    def _burn(self, theta):
        self.nBurn = int(self.nIter * self.pBurn)
        return theta.iloc[:self.nBurn,:], theta.iloc[self.nBurn:,:]
        
    def get_credible_interval(self):
        index_labels = ['2.5%', '50%', '97.5%']
        theta_ci = np.quantile(self.theta, [0.025, 0.5, 0.975], axis=0)
        sigma2_ci = np.quantile(self.sigma2, [0.025, 0.5, 0.975]).reshape((3,1))
        self.theta_ci = pd.DataFrame(theta_ci, columns=self.colnames, index=index_labels)
        self.sigma2_ci = pd.DataFrame(sigma2_ci, columns=['sigma2'], index=index_labels)
        
    def plot_posterior(self, variables='all'):
        
        self.get_credible_interval()
        
        if variables=='all':
            n_vars = self.p
            variables = self.colnames

        n_vars = len(variables) + 1
        n_fcols = int(np.ceil(np.sqrt(n_vars)))
        n_frows = int(np.ceil(n_vars/n_fcols))
        fig, axs = plt.subplots(n_frows, n_fcols, figsize=(4*n_fcols, 4*n_frows), squeeze=False)        
        self.sigma2.plot(kind='kde', ax=axs[0,0], label=r'$f(\sigma^2\mid\mathrm{y},\mathrm{X})$')
        axs[0,0].axvline(self.sigma2_ci.loc['50%', 'sigma2'], color='red', linestyle='dashed', linewidth=2, label=r'$\widehat{\sigma}^2$')
        axs[0,0].set_title(r'$\sigma^2$')
        axs[0,0].legend(loc='upper right')
        
        i, j = 0, 1
        for col in variables:
            self.theta[col].plot(kind='kde', ax=axs[i,j], label=r'$f(\theta\mid\mathrm{y},\mathrm{X})$')
            axs[i,j].axvline(self.theta_ci.loc['50%', col], color='red', linestyle='dashed', linewidth=2, label=r'$\widehat{\theta}$')
            axs[i,j].set_title(col)
            axs[i,j].legend(loc='upper right')
            j += 1
            if j % n_fcols == 0:
                j = 0
                i += 1
    
    def trace_plot(self, first_n=200):
        
        fig, axs = plt.subplots(1, 2, squeeze=False, figsize=(5*2, 3)) 
     
        # First n=200 samples
        thetaFirst200 = self.thetaBurn.iloc[0:first_n,:]
        sigma2First200 = self.sigma2Burn.iloc[0:first_n,:]
        axs[0,0].plot(sigma2First200.index, sigma2First200['sigma2'], label=r'$\sigma^2$', linestyle='-')
        for col in thetaFirst200.columns:
            axs[0,0].plot(thetaFirst200.index, thetaFirst200[col], label=col, linestyle='-')
        axs[0,0].set_xlabel('Index')
        axs[0,0].set_ylabel('Value')
        axs[0,0].set_title('Trace plot of ' + r'$\theta_{1:%s}$' % first_n)
        axs[0,0].grid(True)
        
        # Latter part of the Markov-chain
        axs[0,1].plot(self.sigma2.index, self.sigma2['sigma2'], label=r'$\sigma^2$', linestyle='-')
        for col in self.theta.columns:
            axs[0,1].plot(self.theta.index, self.theta[col], label=col, linestyle='-')
        axs[0,1].set_xlabel('Index')
        axs[0,1].set_ylabel('Value')
        axs[0,1].set_title('Trace plot of $\\theta_{{{}:{}}}$'.format(self.theta.index[0], self.theta.index[-1]))
        axs[0,1].legend(bbox_to_anchor=(1, 0.5), loc='center left')
        axs[0,1].grid(True)
        
    def acf_plot(self):

        n_vars = self.p + 1
        n_fcols = int(np.ceil(np.sqrt(n_vars)))
        n_frows = int(np.ceil(n_vars/n_fcols))
        fig, axs = plt.subplots(n_frows, n_fcols, figsize=(4*n_fcols, 4*n_frows), squeeze=False)
        plot_acf(self.sigma2All['sigma2'], lags=20, ax=axs[0,0])
        axs[0,0].set_title(r'$\sigma^2$')
        axs[0,0].set_ylim([-0.1, 1.1])
        
        i, j = 0, 1
        for col in self.thetaAll.columns:
            plot_acf(self.thetaAll[col], lags=20, ax=axs[i,j])
            axs[i,j].set_title(col)
            axs[i,j].set_ylim([-0.1, 1.1])
            j += 1
            if j % n_fcols == 0:
                j = 0
                i += 1

Methods of BayesianLinearRegression :

fit: fit Bayesian linear model to the dataset
_burn: burn-in Markov chain
get_credible_interval: return 95% credible interval of θ and σ²
plot_posterior: plot the kernel density estimator (KDE) for θ and σ²
trace_plot: get the trace plot of first 200 Markov chain and the trace plot of the used samples
acf_plot: plot the autocorrelation (ACF) for θ and σ²

y = y['Rings']
nIter = 2**14
pBurn = 0.5
model = BayesianLinearRegression(nIter, pBurn)
model.fit(y, X)

To get the Bayes estimator under L₁-loss and 95% credible interval for θ and σ²:

model.get_credible_interval()
print(model.theta_ci)

##          | Intercept | IsFemale | IsMale | Length | Diameter | Height | Whole_weight | Shucked_weight  | Viscera_weight  | Shell_weight |
## ---------|-----------|----------|--------|--------|----------|--------|--------------|-----------------|-----------------|--------------|
##   2.5%   | 3.2915    | 0.8010   | 0.7991 | 2.2293 | 3.3068   | 2.8143 | 3.9159       | -14.3075        | -4.6137         | 8.7121       |
##   50%    | 3.7318    | 0.9989   | 0.9850 | 3.6405 | 4.9242   | 4.4861 | 4.7185       | -13.2287        | -3.1100         | 10.0295      |
##   97.5%  | 4.1679    | 1.2009   | 1.1711 | 5.0367 | 6.5046   | 6.1138 | 5.5016       | -12.1086        | -1.5382         | 11.3662      |

print(model.sigma2_ci)

##          | sigma2   |
## ---------|----------|
##   2.5%   | 4.7825   |
##   50%    | 4.9989   |
##   97.5%  | 5.2170   |

For the classical linear regression, we can use statsmodels to get the OLS estimator.

ols_result = sm.OLS(y, X).fit()
print(ols_result.params)

## Intercept          3.069765
## IsFemale           0.824876
## IsMale             0.882592
## Length            -0.458335
## Diameter          11.075103
## Height            10.761537
## Whole_weight       8.975445
## Shucked_weight   -19.786867
## Viscera_weight   -10.581827
## Shell_weight       8.741806
## dtype: float64

It is clear that there are some differences between these two methods. Only the Bayes estimator and OLS estimator for Intercept, IsFemale and IsMale are consistent. For other variables, both methods give a slightly different results.

5.3. Diagnose Markov chain

Use plot_posterior to check the shape for each θ and σ².

plot of KDE for each *θ and σ*²; red dotted line represents the posterior median

The above graph shows the density estimate for each θ and σ². The blue curve is the KDE and the red dotted vertical line is the posterior median. We can see that the shape for each θ and σ² are also Gaussian.

Use trace_plot to check the value of θ and σ² for each iteration.

trace plot of first 200 Markov chain and the used samples

This is the trace plot of the Markov chain. We can see that the Markov chain converges to the target distribution very quickly, reaching convergence after only a few iterations.

Use acf_plot to check the autocorrelation.

6. Conclusion

This article covers most of the essential knowledge to guide you how to write your own Bayesian linear regression from scratch in Python. It’s always good to code from scratch when you start learning a new model, as it ensures that you can completely understand the underlying concepts. In the future, I will introduce more Bayesian regression related topics such as BIC and LASSO. I will also explain the Gibbs samplers in details.

If you’d like to stay connected, please follow me on Medium 📖.

If you found this article useful, please don’t forget to give it a clap 👏.

Feel free to share your thoughts in the comments — I love hearing from you!