Day 25 of 30 days of Data Engineering Series with Projects

Welcome back peeps to Day 25 of Data Engineering Series with Projects!

In this we will cover —

Docker

Docker vs Virtual Machines

Most important Docker commands

Kubernetes

Snowflake

Pre-requisite to Day 25 is to complete Day 1–24( link below):

Day 1 : What’s Data Engineering, Why Data Engineering, Data Engineers — ML Engineers — Data Scientists, Purpose and Scope

Day 2 : Complete Python for Data Engineering — Part 1

Day 3 : Complete Advanced Python for Data Engineering — Part 2

Day 4: Techniques to write efficient and Optimized Code

Day 5 : SQL

Day 6 : Advanced SQL

Day 7 : BigQuery and SQL vs NOSQL databases

Day 8 : Advanced Functions

Day 9 : Query Optimizations

Day 10 : MySQL and PostgreSQL

Day 11: Shell scripting and Linux “touch” command

Day 12 : Map Reduce, Data Warehouse, Data Lakes

Day 13: Pandas, Pandas, Data Cleaning and processing, Outlier Detection, Noisy Data, Missing Data, Pandas Functions, Aggregate Functions, Joins

Day 14 : Numpy

Day 15 : Advanced Pandas Techniques

Day 16 : Data Pre-processing, Handling missing values, Data Cleaning, Mean/mode/median Imputation, Hot Deck Imputation, Rescale Data, Binarize Data, Regression Imputation, Stochastic regression imputation, Feature Scaling

Day 17 : Data Augmentation, Read and Process Large Datasets

Day 18 : Data Visualization basics, Data Visualization Projects, Data Visualization using Plotly and Bokeh, Data Profiling, Summary Functions, Indexing, Grouping, Linear Regression, Multi Linear Regression, Polynomial Regression, Regression, Support Vector Regression, Decision Tree Regression, Random Forest Regression, Feature Engineering, GroupBy Features, Categorical and Numerical Features, Missing Value Analysis, Fill the missing Values, Unique Value Analysis, Univariate Analysis, Bivariate Analysis, Multivariate Analysis, Correlation Analysis, Spearman’s ρ, Pearson’s r, Kendall’s τ, Cramér’s V (φc), Phik (φk)

Day 19 : MySQL and PostgreSQL

Day 20 : ETL ( Extract, Tranform and Load) basics, Why ETL is important?, How ETL works, ETL Tools

Day 21 : Structured Data, Semi Structured Data, Unstructured Data, Data Warehouse, Data Mart, Data Lake

Day 22 :Big Data, Types of Big Data, Big data tools, SQL and NoSQL Databases, Hadoop, Hadoop HDFS, Hadoop Yarn

Day 23: Batch Processing, Stream Processing, Apache Spark, Apache Spark Commands, Apache Kafka, How Apache Kafka works

Day 24 : Hive, Zookeper, Pig, Cassandra, Sqoop

Day 25: Docker, Docker vs Virtual Machines, Most important Docker commands, Kubernetes, Snowflake

Projects Videos —

Subscribe today!

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

www.youtube.com

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Ignito:

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication with hundreds of…

naina0405.substack.com

System Design Case Studies — In Depth

Design Instagram

Design Netflix

Design Reddit

Design Amazon

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Amazon Prime Video

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Mega Compilation : Solved System Design Case studies

Let’s get started!

Docker is a platform for developing, shipping, and running applications in containers. Containers are lightweight, portable, and self-sufficient environments that include all the dependencies required to run an application. This allows developers to easily move applications between different environments, such as development, staging, and production.

Docker vs Virtual Machines: Virtual machines (VMs) are software that emulate a physical machine, providing a separate operating system and hardware environment for each application. Docker containers, on the other hand, share the host machine’s operating system kernel and only include the libraries and dependencies needed by the application. This means that Docker containers are more lightweight and efficient than VMs, and can run on a single machine alongside other containers.

Some important Docker commands:

docker run - run a command in a new container
docker start - start one or more stopped containers
docker stop - stop one or more running containers
docker ps - list all running containers
docker images - list all images
docker pull - download an image from a registry
docker push - upload an image to a registry
docker build - build an image from a Dockerfile

Implementation —

# Run a command in a new container
docker run -it ubuntu:latest /bin/bash

# Start one or more stopped containers
docker start container_name

# Stop one or more running containers
docker stop container_name

# List all running containers
docker ps

# List all images
docker images

# Download an image from a registry
docker pull image_name

# Upload an image to a registry
docker push image_name

# Build an image from a Dockerfile
docker build -t image_name /path/to/dockerfile

Kubernetes is an open-source container orchestration system for automating the deployment, scaling, and management of containerized applications. It provides a way to manage multiple containers across multiple machines, and can automatically scale resources as needed.

Implementation —

# Create a deployment
kubectl create deployment myapp --image=myimage:tag

# Scale the deployment
kubectl scale deployment myapp --replicas=3

# Expose the deployment as a service
kubectl expose deployment myapp --port=8080 --target-port=80 --type=LoadBalancer

# List all pods
kubectl get pods

# List all services
kubectl get services

# Describe a pod
kubectl describe pod pod_name

# Describe a service
kubectl describe service service_name

# Delete a deployment
kubectl delete deployment myapp

# Delete a service
kubectl delete service myapp

# Apply changes to a deployment
kubectl apply -f deployment.yaml

# Execute a command in a pod
kubectl exec pod_name -- command

# Port-forward to a pod
kubectl port-forward pod_name local_port:pod_port

# Get logs of a pod
kubectl logs pod_name

# Get the status of all resources in a namespace
kubectl get all -n namespace_name

Snowflake is a cloud-based data warehouse that allows users to store, process and analyze data using SQL. It is a fully relational ANSI SQL data warehouse as a service, it’s unique architecture allows for unlimited concurrent users, and separate compute and storage resources.

Implementation —

-- Create a database
CREATE DATABASE mydatabase;

-- Use a database
USE DATABASE mydatabase;

-- Create a schema
CREATE SCHEMA myschema;

-- Use a schema
USE SCHEMA myschema;

-- Create a table
CREATE TABLE mytable (
  column1 VARCHAR,
  column2 INT,
  column3 DATE
);

-- Insert data into a table
INSERT INTO mytable (column1, column2, column3)
VALUES ('value1', 10, '2023-05-21'),
       ('value2', 20, '2023-05-22');

-- Select data from a table
SELECT * FROM mytable;

-- Update data in a table
UPDATE mytable
SET column1 = 'new_value'
WHERE column2 = 10;

-- Delete data from a table
DELETE FROM mytable
WHERE column2 = 20;

-- Create a view
CREATE VIEW myview AS
SELECT column1, column2
FROM mytable
WHERE column3 = '2023-05-21';

-- Grant privileges to a role
GRANT SELECT, INSERT, UPDATE, DELETE ON mytable TO myrole;

-- Revoke privileges from a role
REVOKE INSERT, UPDATE ON mytable FROM myrole;

-- Show grants for a table
SHOW GRANTS ON mytable;

-- Describe a table
DESCRIBE mytable;

-- Show query history
SHOW QUERY_HISTORY;

-- Show warehouse status
SHOW WAREHOUSES;

-- Show table sizes
SHOW TABLES LIKE 'my%';

Docker

Docker is an open platform for programmers and sysadmins of distributed applications to build, ship and run any app anywhere you want.

It’s a container platform to build, configure and distribute docker containers and has its own clustering solution.

Docker file allows users to build custom docket images and can be distributed. It allows images to be distributed and scaled.

It’s highly scalable and does auto load balancing of traffic between containers in the cluster.

Docket most important commands —

docker build -t <imgname> : Build an Image from a Dockerfile

docker images : To list local images

docker rmi : To delete an image

docker login -u <usrname> : To login into docket

docker push <uname>/<imgname> : To publish an image to the Docket Hub

docker search <imgname> : To search hub for an image

docker pull <imgname> : To pull an image from the docker hub

docker — help : To get help from the docker documentation

docker run — name <cntrname> <imagename> : To create and run a container from the image with a custome name

docker start|stop <cntrname> : To stop or start a container

docker logs -f <cntrname> : To fetch the logs of the container

docker ps : To list currently running containers

docker ps — all : To list all the docker containers

docker container stats : To get information of the resource usage stats

docker commit <cntrimage> : To commit a new docker image

docker wait <cntrname> : To wait until the container terminates and returns exit code

docker export <cntrname> : To export the content of the container

docker stack services stack_name : To see all the running services

To compose docker —

docker — compose start

docker — compose stop

docker — compose pause

docker — compose unpause

docker — compose ps

docker — compose up

docker — compose down

Implementation —

# Build an Image from a Dockerfile
docker build -t <imgname> .

# List local images
docker images

# Delete an image
docker rmi <imgname>

# Login to Docker Hub
docker login -u <usrname>

# Publish an image to the Docker Hub
docker push <uname>/<imgname>

# Search Docker Hub for an image
docker search <imgname>

# Pull an image from the Docker Hub
docker pull <imgname>

# Get help from the Docker documentation
docker --help

# Create and run a container from the image with a custom name
docker run --name <cntrname> <imagename>

# Start or stop a container
docker start|stop <cntrname>

# Fetch the logs of the container
docker logs -f <cntrname>

# List currently running containers
docker ps

# List all Docker containers
docker ps -a

# Get resource usage stats of containers
docker container stats

# Commit a new Docker image
docker commit <cntrname>

# Wait until the container terminates and returns exit code
docker wait <cntrname>

# Export the content of the container
docker export <cntrname>

# See all running services in a Docker stack
docker stack services stack_name

# Compose Docker commands
docker-compose start
docker-compose stop
docker-compose pause
docker-compose unpause
docker-compose ps
docker-compose up
docker-compose down

Snippet —

Kubernetes

Kubernetes is an environment for managing the cluster of docker containers. It’s highly scalable and can do auto scaling.

It can also deploy rolling updates and auto rollbacks. It has in built tools for monitoring and logging and shares storage with other containers belonging to the same Pod.

It can work with any compartment framework that fits with OCI guidelines.

Why Kubernetes —

It optimizes the cost
Automated provisioning
Gives productivity boost
Software portability
Automated Rollback
Auto Scaling
Load Balancing
Storage Orchestration
Batch execution

Some important Kubernetes commands:

kubectl get - retrieve information about resources in a cluster
kubectl describe - show detailed information about a resource
kubectl create - create a resource from a file or from stdin
kubectl replace - replace a resource by filename or stdin
kubectl delete - delete resources by filenames, stdin, resources and names, or label selector
kubectl apply - apply a configuration to a resource by filename or stdin
kubectl exec - execute a command in a container
kubectl logs - print the logs for a container in a pod
kubectl scale - set a new size for a Deployment, ReplicaSet, or ReplicationController
kubectl rolling-update - roll updates of pods in a deployment
kubectl port-forward - forward one or more local ports to a pod
kubectl top - show metrics for a given resource or pod
kubectl cluster-info - display addresses of the master and services
kubectl config - Modify kubeconfig files using subcommands like "kubectl config view"
kubectl auth - Inspect authorization.

Implementation —

# Retrieve information about resources in a cluster
kubectl get <resource>

# Show detailed information about a resource
kubectl describe <resource> <name>

# Create a resource from a file or from stdin
kubectl create -f <filename>

# Replace a resource by filename or stdin
kubectl replace -f <filename>

# Delete resources by filenames, stdin, resources and names, or label selector
kubectl delete <resource> <name>

# Apply a configuration to a resource by filename or stdin
kubectl apply -f <filename>

# Execute a command in a container
kubectl exec <pod> -- <command>

# Print the logs for a container in a pod
kubectl logs <pod>

# Set a new size for a Deployment, ReplicaSet, or ReplicationController
kubectl scale <resource> <name> --replicas=<replicas>

# Roll updates of pods in a deployment
kubectl rolling-update <deployment> --image=<new_image>

# Forward one or more local ports to a pod
kubectl port-forward <pod> <local_port>:<pod_port>

# Show metrics for a given resource or pod
kubectl top <resource> <name>

# Display addresses of the master and services
kubectl cluster-info

# Modify kubeconfig files using subcommands
kubectl config view

# Inspect authorization
kubectl auth <subcommand>

Snippet —

Snowflake

Snowflake is an analytics data warehouse which is much faster, easier than traditional data warehouses and enables data storage and processing.

It used SQL database engine along with unique architecture designed for the clouds.

It’s architecture consists of three layers —

Db storage — To store and reorganize the data in a compressed format.

Query processing — To process queries using warehouses that are virtual and consists of multiple compute nodes

Cloud services — To manage different activities and processes across snowflake like authentication, access control, query optimization etc.

Data Files can be loaded to Snowflake through local environment as well as Amazon S3, GC Storage etc.

Some important Snowflake commands include:

SELECT — retrieves data from one or more tables in the database
INSERT — adds new rows of data to a table
UPDATE — modifies existing data in a table
DELETE — deletes data from a table
CREATE TABLE — creates a new table in the database
ALTER TABLE — modifies an existing table’s structure
DROP TABLE — deletes a table from the database
USE ROLE — changes the active role for the current session
GRANT — grants permissions to a user or role on a specific object
REVOKE — revokes previously granted permissions from a user or role on a specific object

It is worth noting that Snowflake also support DDL and DCL commands like:

CREATE DATABASE
DROP DATABASE
CREATE SCHEMA
DROP SCHEMA
CREATE VIEW
DROP VIEW
CREATE STAGE
DROP STAGE
CREATE PIPE
DROP PIPE
CREATE INTEGRATION
DROP INTEGRATION
CREATE TASK
DROP TASK
CREATE FILE FORMAT
DROP FILE FORMAT
CREATE WAREHOUSE
START/STOP/RESUME/SUSPEND WAREHOUSE
CREATE USER
DROP USER
SET/ALTER/RESET PASSWORD
GRANT/REVOKE ROLE
GRANT/REVOKE PRIVILEGE
GRANT/REVOKE ALL PRIVILEGE

Implementation —

-- Retrieves data from one or more tables in the database
SELECT column1, column2 FROM table_name;

-- Adds new rows of data to a table
INSERT INTO table_name (column1, column2) VALUES (value1, value2);

-- Modifies existing data in a table
UPDATE table_name SET column1 = new_value WHERE condition;

-- Deletes data from a table
DELETE FROM table_name WHERE condition;

-- Creates a new table in the database
CREATE TABLE table_name (column1 datatype, column2 datatype);

-- Modifies an existing table's structure
ALTER TABLE table_name ADD COLUMN new_column datatype;

-- Deletes a table from the database
DROP TABLE table_name;

-- Changes the active role for the current session
USE ROLE role_name;

-- Grants permissions to a user or role on a specific object
GRANT privilege_name ON object_name TO user_or_role;

-- Revokes previously granted permissions from a user or role on a specific object
REVOKE privilege_name ON object_name FROM user_or_role;

-- Additional DDL and DCL commands
CREATE DATABASE database_name;
DROP DATABASE database_name;
CREATE SCHEMA schema_name;
DROP SCHEMA schema_name;
CREATE VIEW view_name AS SELECT ...;
DROP VIEW view_name;
CREATE STAGE stage_name;
DROP STAGE stage_name;
CREATE PIPE pipe_name AS COPY INTO table_name FROM ...;
DROP PIPE pipe_name;
CREATE INTEGRATION integration_name TYPE <type> ...;
DROP INTEGRATION integration_name;
CREATE TASK task_name ...;
DROP TASK task_name;
CREATE FILE FORMAT file_format_name ...;
DROP FILE FORMAT file_format_name;
CREATE WAREHOUSE warehouse_name ...;
START/STOP/RESUME/SUSPEND WAREHOUSE warehouse_name;
CREATE USER user_name ...;
DROP USER user_name;
SET/ALTER/RESET PASSWORD user_name ...;
GRANT/REVOKE ROLE role_name TO/FROM user_name;
GRANT/REVOKE PRIVILEGE privilege_name ON object_name TO/FROM user_or_role;
GRANT/REVOKE ALL PRIVILEGE ON object_name TO/FROM user_or_role;

Snippet —

Project Code —

# Dockerfile (used to build the Docker image)
FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "app.py"]

# Kubernetes Deployment file (deploying the Docker image)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: my-app-image
        ports:
        - containerPort: 8080

# Kubernetes Service file (exposing the application within the cluster)
apiVersion: v1
kind: Service
metadata:
  name: my-app-service
spec:
  selector:
    app: my-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

# Snowflake SQL script (performing data operations in Snowflake)
CREATE TABLE my_table (
  id INT,
  name VARCHAR,
  age INT
);

INSERT INTO my_table (id, name, age) VALUES (1, 'John', 30);
INSERT INTO my_table (id, name, age) VALUES (2, 'Jane', 25);

SELECT * FROM my_table;

Snippet —

A project video Docker, Kubernetes and Snowflake covering coming soon ( subscribe today) —

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

www.youtube.com

That’s it for now.

Find Day 26 Below :

Day 26 of 30 days of Data Engineering Series with Projects

Welcome back peeps to Day 26 of Data Engineering Series with Projects!

medium.com

Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time

Stay Tuned!!

All the Complete System Design Series Parts —

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

Github —

Complete-System-Design/README.md at main · Coder-World04/Complete-System-Design

This repository contains everything you need to become proficient in System Design Topics you should know in System…

github.com

For Python Projects —

Complete Python And Projects — Mega Compilation

Everything that you need to know in Python with Projects…

medium.com

Analyzing Video using Python, OpenCV and NumPy

With Code Implementation…

medium.datadriveninvestor.com

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Connect the ML dots…

medium.com

Follow for more updates. Stay tuned and keep coding!

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

medium.datadriveninvestor.com

Recurrent Neural Network with Keras

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

medium.datadriveninvestor.com

Clustering Geolocation Data in Python using DBSCAN and K-Means

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

medium.datadriveninvestor.com

Facial Expression Recognition using Keras

Facial Expression Recognition using Keras

Project Implementation…

medium.datadriveninvestor.com

Hyperparameter Tuning with Keras Tuner

Hyperparameter Tuning with Keras Tuner

Project Implementation….

medium.datadriveninvestor.com

Custom Layers in Keras

Custom Layers in Keras

Code implementation …

medium.datadriveninvestor.com

Day 25 of 30 days of Data Engineering Series with Projects

Docker

Docker vs Virtual Machines

Most important Docker commands

Kubernetes

Snowflake

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

Tech Newsletter —

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication with hundreds of…

System Design Case Studies — In Depth

Design Instagram

Design Netflix

Design Reddit

Design Amazon

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Amazon Prime Video

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Most Popular System Design Questions

Mega Compilation : Solved System Design Case studies

Let’s get started!

Docker

Kubernetes

Snowflake

A project video Docker, Kubernetes and Snowflake covering coming soon ( subscribe today) —

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

That’s it for now.

Find Day 26 Below :

Day 26 of 30 days of Data Engineering Series with Projects

Welcome back peeps to Day 26 of Data Engineering Series with Projects!

Read more —

All the Complete System Design Series Parts —

Github —

Complete-System-Design/README.md at main · Coder-World04/Complete-System-Design

This repository contains everything you need to become proficient in System Design Topics you should know in System…

For Python Projects —

Complete Python And Projects — Mega Compilation

Everything that you need to know in Python with Projects…

Analyzing Video using Python, OpenCV and NumPy

With Code Implementation…

Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Connect the ML dots…

For other projects, tune to —

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

Facial Expression Recognition using Keras

Project Implementation…

Hyperparameter Tuning with Keras Tuner

Project Implementation….

Custom Layers in Keras

Code implementation …