avatarNaina Chaturvedi

Summary

The website content provides a comprehensive overview of the "30 days of Data Engineering Series," covering various machine learning algorithms, performance metrics, and practical code implementations, culminating in a final day focused on system design case studies and neural networks.

Abstract

The "30 days of Data Engineering Series" concludes with a detailed exploration of machine learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines, K nearest neighbors, K means clustering, hierarchical clustering, and neural networks. The content emphasizes the importance of understanding performance metrics to evaluate the effectiveness of machine learning models. It also includes complete code implementations for these algorithms using the Python programming language and the scikit-learn library. Additionally, the series touches on system design case studies, providing insights into the design of complex systems such as Instagram and Netflix. The final day encapsulates the series by discussing neural networks and their applications in data engineering and machine learning. The author encourages readers to engage with the material through comments, subscriptions, and by following the provided resources, including a YouTube channel and a newsletter for further learning and updates in the field.

Opinions

  • The author believes in the practical application of machine learning algorithms and provides code examples to solidify understanding.
  • There is a strong emphasis on the importance of clustering and classification techniques in data engineering and machine learning.
  • The series is designed to be sequential, with each day building upon the previous, suggesting a structured learning approach.
  • The author values the community aspect of learning, encouraging interaction and feedback through comments and subscriptions.
  • The inclusion of system design case studies indicates the author's opinion that system design is a critical skill for data engineers and machine learning professionals.
  • By providing a newsletter and a YouTube channel, the author shows a commitment to continuous learning and community engagement beyond the written series.

Day 30 of 30 days of Data Engineering Series with Projects

Pic credits : IBM

Welcome back peeps to Day 30 of Data Engineering Series with Projects!

In this we will cover —

Machine Learning Algorithms

Performance metrics

Linear Regression

Logistic Regression

Decision Trees

Random Forest

Support Vector Machines

K Nearest Neighbors

K means Clustering

Hierarchical Clustering

Neural Networks

Pre-requisite to Day 30 is to complete Day 1–29( link below):

Day 1 : What’s Data Engineering, Why Data Engineering, Data Engineers — ML Engineers — Data Scientists, Purpose and Scope

Day 2 : Complete Python for Data Engineering — Part 1

Day 3 : Complete Advanced Python for Data Engineering — Part 2

Day 4: Techniques to write efficient and Optimized Code

Day 5 : SQL

Day 6 : Advanced SQL

Day 7 : BigQuery and SQL vs NOSQL databases

Day 8 : Advanced Functions

Day 9 : Query Optimizations

Day 10 : MySQL and PostgreSQL

Day 11: Shell scripting and Linux “touch” command

Day 12 : Map Reduce, Data Warehouse, Data Lakes

Day 13: Pandas, Pandas, Data Cleaning and processing, Outlier Detection, Noisy Data, Missing Data, Pandas Functions, Aggregate Functions, Joins

Day 14 : Numpy

Day 15 : Advanced Pandas Techniques

Day 16 : Data Pre-processing, Handling missing values, Data Cleaning, Mean/mode/median Imputation, Hot Deck Imputation, Rescale Data, Binarize Data, Regression Imputation, Stochastic regression imputation, Feature Scaling

Day 17 : Data Augmentation, Read and Process Large Datasets

Day 18 : Data Visualization basics, Data Visualization Projects, Data Visualization using Plotly and Bokeh, Data Profiling, Summary Functions, Indexing, Grouping, Linear Regression, Multi Linear Regression, Polynomial Regression, Regression, Support Vector Regression, Decision Tree Regression, Random Forest Regression, Feature Engineering, GroupBy Features, Categorical and Numerical Features, Missing Value Analysis, Fill the missing Values, Unique Value Analysis, Univariate Analysis, Bivariate Analysis, Multivariate Analysis, Correlation Analysis, Spearman’s ρ, Pearson’s r, Kendall’s τ, Cramér’s V (φc), Phik (φk)

Day 19 : MySQL and PostgreSQL

Day 20 : ETL ( Extract, Tranform and Load) basics, Why ETL is important?, How ETL works, ETL Tools

Day 21 : Structured Data, Semi Structured Data, Unstructured Data, Data Warehouse, Data Mart, Data Lake

Day 22 :Big Data, Types of Big Data, Big data tools, SQL and NoSQL Databases, Hadoop, Hadoop HDFS, Hadoop Yarn

Day 23: Batch Processing, Stream Processing, Apache Spark, Apache Spark Commands, Apache Kafka, How Apache Kafka works

Day 24 : Hive, Zookeper, Pig, Cassandra, Sqoop

Day 25: Docker, Docker vs Virtual Machines, Most important Docker commands, Kubernetes, Snowflake

Day 26 : Data Pipelines, Transformation, Processing, Workflow, Monitoring, Airflow, DAG

Day 27 : Power BI, Which chart to use and When?, Power BI — Data Analysis Expressions, Joins, Data Profiling

Day 28 : REST API, Postman, Data API

Day 29 : Data Engineering on cloud, AWS, AWS Services, Google Cloud Platform, GCP services

Day 30 : Machine Learning Algorithms, Linear Regression, Logistic Regression, Decision Trees, Random Forest, Support Vector Machines, K Nearest Neighbors, K means Clustering, Hierarchical Clustering, Neural Networks

All the projects, data structures, algorithms, system design, Data Science and ML , Data Engineering, MLOps and Deep Learning videos will be published on our youtube channel ( just launched).

Subscribe today!

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Ignito:

System Design Case Studies — In Depth

Design Instagram

Design Netflix

Design Reddit

Design Amazon

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Amazon Prime Video

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Most Popular System Design Questions

Mega Compilation : Solved System Design Case studies

Let’s get started!

1.Machine Learning Algorithms: These are mathematical models that are used to analyze and make predictions from data. Some popular machine learning algorithms include:

  • Linear Regression: A type of supervised learning algorithm that is used for predicting continuous values.
  • Logistic Regression: A type of supervised learning algorithm that is used for predicting binary outcomes.
  • Decision Trees: A type of supervised learning algorithm that is used for both classification and regression problems.
  • Random Forest: A type of ensemble learning algorithm that combines multiple decision trees to improve the accuracy of predictions.
  • Support Vector Machines (SVMs): A type of supervised learning algorithm that is used for both classification and regression problems.
  • K Nearest Neighbors (KNN): A type of supervised learning algorithm that is used for classification and regression problems.
  • K-means Clustering: A type of unsupervised learning algorithm that is used for grouping similar data points together.
  • Hierarchical Clustering: A type of unsupervised learning algorithm that is used to group similar data points into a hierarchical structure.
  • Neural Networks: A type of supervised or unsupervised learning algorithm that is inspired by the structure and function of the human brain.

2. Performance Metrics: These are used to evaluate the performance of a machine learning model. The choice of performance metric depends on the type of problem being solved, but some common metrics include:

  • Mean Absolute Error (MAE) and Mean Squared Error (MSE) for regression problems.
  • Accuracy, Precision, Recall, F1 Score, AUC-ROC Curve for classification problems.
  • Silhouette Score, Davies-Bouldin Index for clustering problems.

3. Linear Regression: Linear regression is a supervised learning algorithm used for predicting continuous values. It assumes a linear relationship between the input variables (also known as independent variables) and the output variable (also known as the dependent variable). Linear regression can be used for both simple and multiple regression.

4. Logistic Regression: Logistic regression is a supervised learning algorithm used for predicting binary outcomes. It models the probability of the default class using a logistic function. It is used for binary classification problem.

5. Decision Trees: Decision trees are a type of supervised learning algorithm that is used for both classification and regression problems. It uses a tree-like model of decisions and their possible consequences. Each internal node of the tree represents a feature(or attribute), each branch represents a decision and each leaf node represents the outcome.

6. Random Forest: Random Forest is an ensemble learning algorithm which combines multiple decision trees to improve the accuracy of predictions. It uses bagging technique and random subspace method to make decision trees more robust.

7. Support Vector Machines (SVMs): Support Vector Machines are a type of supervised learning algorithm that is used for both classification and regression problems. It tries to find a hyperplane which maximizes the margin between the different classes.

8. K Nearest Neighbors (KNN): K-nearest neighbors is a type of supervised learning algorithm that is used for classification and regression problems. It finds the k-nearest data points to a given data point, and then uses the majority class among the k-nearest points to classify the given data point.

9. K-means Clustering: K-means is a type of unsupervised learning algorithm that is used for grouping similar data points together. It works by randomly initializing k centroids and then iteratively moving the centroids to the center of the data points in their cluster, while reassigning the data points to the closest centroid.

Complete Code Implementation —

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression
linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)
linear_regression_predictions = linear_regression.predict(X_test)

# Logistic Regression
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
logistic_regression_predictions = logistic_regression.predict(X_test)

# Decision Trees
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
decision_tree_predictions = decision_tree.predict(X_test)

# Random Forest
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
random_forest_predictions = random_forest.predict(X_test)

# Support Vector Machines
svm = SVC()
svm.fit(X_train, y_train)
svm_predictions = svm.predict(X_test)

# K Nearest Neighbors
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_predictions = knn.predict(X_test)

# K-means Clustering
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
kmeans_predictions = kmeans.predict(X)

# Hierarchical Clustering
hierarchical_clustering = AgglomerativeClustering(n_clusters=3)
hierarchical_clustering.fit(X)
hierarchical_clustering_predictions = hierarchical_clustering.labels_

# Performance Metrics
linear_regression_mse = mean_squared_error(y_test, linear_regression_predictions)
logistic_regression_accuracy = accuracy_score(y_test, logistic_regression_predictions)
decision_tree_f1_score = f1_score(y_test, decision_tree_predictions, average='weighted')
random_forest_precision = precision_score(y_test, random_forest_predictions, average='weighted')
svm_recall = recall_score(y_test, svm_predictions, average='weighted')
knn_accuracy = accuracy_score(y_test, knn_predictions)
kmeans_silhouette_score = silhouette_score(X, kmeans_predictions)
hierarchical_clustering_completeness_score = completeness_score(y, hierarchical_clustering_predictions)

# Print the performance metrics
print(f"Linear Regression MSE: {linear_regression_mse}")
print(f"Logistic Regression Accuracy: {logistic_regression_accuracy}")
print(f"Decision Tree F1 Score: {decision_tree_f1_score}")
print(f"Random Forest Precision: {random_forest_precision}")
print(f"SVM Recall: {svm_recall}")
print(f"KNN Accuracy: {knn_accuracy}")
print(f"K-means Silhouette Score: {kmeans_silhouette_score}")
print(f"Hierarchical Clustering Completeness Score: {hierarchical_clustering_completeness_score}")

Snippet —

Machine Learning Algorithms

In this post we will cover the most commonly used and popular machine learning algorithms along with the code.

1. Linear Regression

It’s a technique to estimate the relationship between two quantitative variables. It is used when you want to establish:

  1. Strength of the relationship — How strong the relationship is between two variables
  2. The value of the dependent variable at a certain value of the independent variable.

where,

y is the predicted value of the dependent variable for any given value of the independent variable which is X.

B0 is the intercept and B1 is the regression coefficient

x is the independent variable

e is the error of the estimate

It works on the assumption that the relationship between the independent and dependent variable is linear: the line of best fit through the data points is a straight line as shown in the diagram.

In order to work with linear regression —

Import the libraries and dataset

Check for null values as well as missing data

Spilt the dataset into train and test set

Fit the dataset into the model sklearn.linear_model

Predict the result y_pred

Visualize the results using matplotlib.pyplot

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(I_Train_data, D_Train_data)
preds = regressor.predict(X_test)

Visualize the training and test set

Training set results —

import matplotlib.pyplot as plt
# Visualizing the Training set results
plt.scatter(X_train, y_train, color = 'green')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.xlabel('Independent Variable set')
plt.ylabel('Dependent Variable set')
plt.show()

Test set results —

# Visualizing the Test set results
plt.scatter(X_test, y_test, color = 'green')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.xlabel('Independent Variable set')
plt.ylabel('Dependent Variable set')
plt.show()

Linear regression uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

Measure the distance of the observed y-values from the predicted y-values at each value of x;

Square each of these distances and calculate the mean of each of the squared distances.

2. Multi Linear Regression

It’s used to estimate the relationship between two or more independent variables and one dependent variable. It is used when you want to establish:

  1. Strength of the relationship — How strong the relationship is between two or more independent variables and one dependent variable
  2. The value of the dependent variable at a certain value of the independent variable.

B0 = the y-intercept

B1X1= the regression coefficient (b1) of the first independent variable (x1)

BnXn = the regression coefficient of the last independent variable

y = the predicted value of the dependent variable

e = model error

Pic credits : jmp

Assumptions of Linear Regression

Linearity

Independence of errors

Homoscedasticity

Multivariate normality

Lack of multi collinearity

In order to find best fit line, it calculates 3 parameters —

  • The t-stat of the model
  • The associated p-value
  • Regression coefficients
# Fitting multiple lineaar regression to the training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the test set results
y_pred = regressor.predict(X_test)
X = np.append(arr = np.ones((20, 1)).astype(int), values = X, axis = 1)
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
    
X_opt = X[:, [0, 1, 3, 4, 5]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()
X_opt = X[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()
X_opt = X[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()
X_opt = X[:, [0, 3]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

3. Polynomial Regression

In polynomial Regression, the relationship between the independent variable x and dependent variable y is established as the nth degree polynomial providing the best approximation of the relationship between dependent and independent variables.

Pic Credits : numpyninja

It fits a nonlinear relationship between the value of x and conditional mean of y. For cases where data points are arranged in a non-linear fashion, we need the Polynomial Regression model.

# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 3)
X_poly = poly.fit_transform(X)
poly.fit(X_poly, y)
l = LinearRegression()
l.fit(X_poly, y)

4. Logistic Regression

In logistic regression, we establish the relationship between the dependent variable and one or more independent variables by estimating probabilities using an equation ( logistic regression).

Pic credits : ResearchGate

Define the Logistic Sigmoid Function 𝜎(𝑧)

def logistic_function(x):
    return 1/(1+np.exp(-x))
logistic_function(0)

Output —

0.5

Compute the Cost Function 𝐽(𝜃) and Gradient

def compute_cost(theta,x,y):
    m=len(y)
    y_pred =logistic_function(np.dot(x,theta))
    error = (y * np.log(y_pred)) +(1-y) * np.log(1-y_pred)
    cost = -1/m * sum(error)
    gradient = 1/m * np.dot(x.transpose(),(y_pred - y))
    return cost[0],gradient

Cost and Gradient

mean_scores = np.mean(scores,axis=0)
std_scores = np.std(scores, axis=0)
scores = ( scores - mean_scores)/std_scores
rows = scores.shape[0]
cols = scores.shape[1]
X=np.append(np.ones((rows,1)),scores,axis=1)
y=results.reshape(rows,1)
theta_init = np.zeros((cols+1,1))
cost,gradient = compute_cost(theta_init,X,y)
print('Cost : ',cost)
print('Gradient : ',gradient)

Output —

Cost 0.693147180559946
Gradient 
[[-0.1       ]
 [-0.28122914]
 [-0.25098615]]

Implement Gradient Descent

def gradient_descent(x,y,theta,alpha,iterations):
    costs = []
    for i in range(iterations):
        cost,gradient = compute_cost(theta,x,y)
        theta -= (alpha * gradient)
        costs.append(cost)
    return theta,costs
theta, costs = gradient_descent(X,y,theta_init,1,200)
print('Theta: ',theta)
print('Resulting Cost: ',costs[-1])

Output —

Theta 
[[1.50850586]
 [3.5468762 ]
 [3.29383709]]
Resulting Cost 0.2048938203512014

Plotting the Convergence of 𝐽(𝜃)

plt.plot(costs)
plt.xlabel('Iterations')
plt.ylabel('$J(\Theta)$')

5. Decision Trees

Decision Tree are widely used in both in classification and regression problems. These are basically predictive models that use binary rules to calculate an output/target value. Each tree has branches, nodes and leaves where the root node represents the entire population or sample.

Pic credits : ResearchGate

Here is the great resource to study how does it work.

from sklearn.tree import DecisionTreeRegressor 
regressor = DecisionTreeRegressor(random_state = 0) 
regressor.fit(X, y)
y_pred = regressor.predict(6.5)

6. Random Forest

It’s a supervised machine learning algorithm that is constructed from decision tree algorithms ( it predicts the outcome by taking the average or mean of the output from the different trees) and Is used to solve both regression and classification problems. It mainly used ensemble learning, a technique in which many classifiers are combined together to provide solutions to complex problems. It’s very efficient as it reduces the overfitting of datasets, provides an effective way of handling missing data, runs efficiently on large databases, achieves extremely high accuracies, increases precision and scales really well when new features are added to the dataset..

Pic credits : ResearchGate

It works well with both categorical and numerical input variables, which in-turn minimizes the time spent on one-hot encoding or labeling data.

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=10, random_state=0)    regressor.fit(X, y)    
y_pred = regressor.predict(4.2)

7. Support Vector Machines

Support Vector Machine is a supervised machine learning algorithm which is used for both classification or regression problems. In this technique we start by plotting each data item as point in n-dimensional space with the value of a particular coordinate as being the value of the feature/independent variable. Then, we find the hyper-plane that differentiates the two classes by a boundary as shown in the image below.

Pic credits : ResearchGate

Model Fitting and Predictions

svmclassifier = SVC(kernel = 'linear' , random_state = 0)
svmclassifier.fit (X_train, Y_train)
Y_pred = svmclassifier.predict(X_test)

8. K Nearest Neighbors

KNN is a classification algorithm in which the output is to predict a class membership. In order to classify an unlabeled object, the ditance of the object to the labeled object is computed, and its k nearesr neighbors are identified.

Pic credits : IBM

The distance measure — Euclidean distance is calculated as square root of the sum of the squared differences between a new point and an existing point.

knn = kNN()
knn.fit(X_train, y_train)

9. K means Clustering

Clustering is a technique of dividing the population or data points, grouping them into different clusters on the basis of similarity and dissimilarity between them. It helps in determining the intrinsic group among the unlabeled data points.

K-means clustering method is used and can be summarized as —

i. Divide into number of cluster K

ii. Find the centroid of the current partition

iii. Calculate the distance each points to Centroids

iv. Group based on minimum distance

v. After re-grouping/re-allotting the points, find the new centroid of the new cluster

Pic credits : Pinterest
# Kmeans ( 4 Clusters )
km=KMeans(n_clusters=4,random_state=0,init="k-means++")
km.fit(c_df_final)
clusters=km.predict(c_df_final)
c_df['cluster_no'] = clusters#plotplt.figure(figsize=(12,9))
plt.scatter(df['Income'],df['Total_purchase'],c=clusters, cmap='icefire')
plt.xlabel('Income')
plt.ylabel('Total Purchase')
plt.grid(False)
plt.show()

Output —

10. Hierarchical Clustering

In simple terms, the algorithm builds a hierarchy of clusters starting from the all the data points assigned to their own cluster.

The algorithm exits when there’s only single cluster is left.

There are two types of hierarchical clustering — Agglomerative hierarchical clustering and Dendrogram.

# Hierarchical Clustering ( Dendogram)
X=df.iloc[:, [3,4]].values
plt.figure(figsize=(20,10))
dendrogram=ch.dendrogram(ch.linkage(X,method = 'ward'))
plt.title('Hierarchical Clustering (Dendrogram plot)')
plt.show()

Output —

11. Neural Network

Neural Network in simple terms is an interconnected group of nodes which take input along with information from other nodes, develop output without programmed rules.

Pic credits : Pinterest

Build a Neural Network Model from Scratch

model = tf.keras.models.Sequential([
    # The input shape here is the desired size of the image 300x300 ( 3 bytes color )
    # This is the first convolution
    tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(300, 300, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    
    # The second convolution
    tf.keras.layers.Conv2D(32,(3,3),activation = 'relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    
    # The third convolution
     tf.keras.layers.Conv2D(64,(3,3),activation = 'relu'),
     tf.keras.layers.MaxPooling2D(2,2),
    
    # The fourth convolution
     tf.keras.layers.Conv2D(64,(3,3),activation = 'relu'),
     tf.keras.layers.MaxPooling2D(2,2),
    
    # The fifth convolution
     tf.keras.layers.Conv2D(64,(3,3),activation = 'relu'),
     tf.keras.layers.MaxPooling2D(2,2),
    
    # The sixth convolution
     tf.keras.layers.Conv2D(64,(3,3),activation = 'relu'),
     tf.keras.layers.MaxPooling2D(2,2),
    
    
    # Flatten the results to feed into a DNN
    tf.keras.layers.Flatten(),
    
    # 512 neuron hidden layer
    tf.keras.layers.Dense(512, activation='relu'),
    
    # We have only 1 output neuron
    # and it will contain a value from 0-1 where 0 for 1 class ('horses') and 1 for the other ('humans')
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()
model.compile(loss='binary_crossentropy',
              optimizer=RMSprop(lr=0.001),
              metrics=['accuracy'])

Output —

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 298, 298, 16)      448       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 149, 149, 16)      0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 147, 147, 32)      4640      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 73, 73, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 71, 71, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 35, 35, 64)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 33, 33, 64)        36928     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 16, 16, 64)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 14, 14, 64)        36928     
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 7, 7, 64)          0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 5, 5, 64)          36928     
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 2, 2, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 512)               131584    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 513       
=================================================================
Total params: 266,465
Trainable params: 266,465
Non-trainable params: 0
_________________________________________________________________

Project Code —

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neural_network import MLPRegressor

# Load the Boston Housing dataset
boston = load_boston()
X = boston.data
y = boston.target

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression
linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)
linear_regression_predictions = linear_regression.predict(X_test)

# Logistic Regression
logistic_regression = LogisticRegression()  # Note: Logistic regression is not suitable for regression tasks, it is used for classification tasks
logistic_regression.fit(X_train, y_train)
logistic_regression_predictions = logistic_regression.predict(X_test)

# Decision Trees
decision_tree = DecisionTreeRegressor()
decision_tree.fit(X_train, y_train)
decision_tree_predictions = decision_tree.predict(X_test)

# Random Forest
random_forest = RandomForestRegressor()
random_forest.fit(X_train, y_train)
random_forest_predictions = random_forest.predict(X_test)

# Support Vector Machines
svm = SVR()
svm.fit(X_train, y_train)
svm_predictions = svm.predict(X_test)

# K Nearest Neighbors
knn = KNeighborsRegressor()
knn.fit(X_train, y_train)
knn_predictions = knn.predict(X_test)

# K-means Clustering
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
kmeans_predictions = kmeans.predict(X)

# Hierarchical Clustering
hierarchical_clustering = AgglomerativeClustering(n_clusters=3)
hierarchical_clustering.fit(X)
hierarchical_clustering_predictions = hierarchical_clustering.labels_

# Neural Networks
neural_network = MLPRegressor()
neural_network.fit(X_train, y_train)
neural_network_predictions = neural_network.predict(X_test)

# Performance Metrics
linear_regression_mse = mean_squared_error(y_test, linear_regression_predictions)
linear_regression_r2 = r2_score(y_test, linear_regression_predictions)
decision_tree_mse = mean_squared_error(y_test, decision_tree_predictions)
decision_tree_r2 = r2_score(y_test, decision_tree_predictions)
random_forest_mse = mean_squared_error(y_test, random_forest_predictions)
random_forest_r2 = r2_score(y_test, random_forest_predictions)
svm_mse = mean_squared_error(y_test, svm_predictions)
svm_r2 = r2_score(y_test, svm_predictions)
knn_mse = mean_squared_error(y_test, knn_predictions)
knn_r2 = r2_score(y_test, knn_predictions)
neural_network_mse = mean_squared_error(y_test, neural_network_predictions)
neural_network_r2 = r2_score(y_test, neural_network_predictions)

# Print the performance metrics
print("Linear Regression")
print(f"MSE: {linear_regression_mse}")
print(f"R2 Score: {linear_regression_r2}")
print()
print("Decision Tree")
print(f"MSE: {decision_tree_mse}")
print(f"R2 Score: {decision_tree_r2}")
print()
print("Random Forest")
print(f"MSE: {random_forest_mse}")
print(f"R2 Score: {random_forest_r2})

Snippet —

A project video covering all the Machine Learning Algorithms coming soon ( subscribe today) —

That’s it for 30 days of Data Engineering Series.

Data Engineering projects …coming soon!

Start here Day 1 of Data Engineering :

Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time

Stay Tuned!!

Read more —

All the Complete System Design Series Parts —

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

Github —

For Python Projects —

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Follow for more updates. Stay tuned and keep coding!

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Recurrent Neural Network with Keras

Clustering Geolocation Data in Python using DBSCAN and K-Means

Facial Expression Recognition using Keras

Hyperparameter Tuning with Keras Tuner

Custom Layers in Keras

Data Science
Machine Learning
Tech
Programming
Artificial Intelligence
Recommended from ReadMedium