Day 7 of 30 days of Data Engineering Series with Projects

Welcome back peeps to Day 7 of Data Engineering Series with Projects!

Day 1 : What’s Data Engineering, Why Data Engineering, Data Engineers — ML Engineers — Data Scientists, Purpose and Scope

Day 2 : Complete Python for Data Engineering — Part 1

Day 3 : Complete Advanced Python for Data Engineering — Part 2

Day 4: Techniques to write efficient and Optimized Code

Day 5 : SQL

Day 6 : Advanced SQL

Day 7 : BigQuery and SQL vs NOSQL databases

Projects Videos —

Subscribe today!

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

www.youtube.com

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Ignito:

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication with hundreds of…

naina0405.substack.com

System Design Case Studies — In Depth

Design Instagram

Design Netflix

Design Reddit

Design Amazon

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Amazon Prime Video

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Mega Compilation : Solved System Design Case studies

Pre-requisite to Day 7 is to complete Day 1–6( link below):

Day 1 of 30 days of Data Engineering can be found below —

Day 1 of 30 days of Data Engineering

With examples and projects…

medium.com

Day 2 of 30 days of Data Engineering can be found below —

Day 2 of 30 days of Data Engineering

With examples and projects…

medium.com

Day 3 of 30 days of Data Engineering can be found below —

Day 3 of 30 days of Data Engineering Series

With examples and projects …

medium.com

This is Day 7 of 30 days of Data Engineering Series where we will be covering —

Big query

SQL vs NO SQL Databases

Let’s get started!

BigQuery basics

BigQuery is a Serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility and is integrated on Google Cloud Platform. If you are a developer who works with huge amounts of data, you can use Google Cloud Platform to query petabytes of data at fast speeds. In this post you will learn how to use different clauses of Standard SQL to analyze your data on Google Cloud Platform.

To get started signup : https://cloud.google.com/

SELECT, FROM and WHERE clauses in Bigquery

In order to retrieve or pull all the columns and rows we use * with the SELECT statement. To pull the selective columns we define the column names with the SELECT Statement. FROM statement is used to specify the table from which you would like to retrieve the data. WHERE is used to filter the rows that meet certain condition or criteria.

Format to use SELECT, FROM and Where :

SELECT column_names

FROM table_name

WHERE condition

WHERE clause is used with SELECT, UPDATE and DELETE statement

Date and Extract Clauses in Bigquery

Alias is used to give a shorthand name for a table or column. Whenever you create complex queries, Alias can help you to improve the readability of your query.

Format to use Alias

SELECT column_name AS Alias-name

FROM table alias_name

WHERE condition

BigQuery has rich DATETIME Functions.

In order to know the current date and time, use the CURRENT_DATETIME function which will return the current time as a DATETIME object.

Format —

CURRENT_DATETIME([timezone])

Since the DATETIME format is in the form of DATETIME(year, month, day, hour, minute, second). In order to Extract the year or month or day or hour or minute, second etc you can use the EXTRACT clause which returns a value that corresponds to the specified part from a supplied expression_datetime.

Format —

EXTRACT(part FROM expression_datetime)

DAY: Returns values in the range [1,7] with Sunday as the first day of the week.
YEAR returns the year.
WEEK: Returns the week number of the date in the range [0, 53]. Weeks begin with Sunday, and dates prior to the first Sunday of the year are in week 0.
MONTH
QUARTER: Returns values in the range [1,4].

Code Implementation —

Common Expression Table

In order to simplify the complex queries, make them more readable and query against them, Common Table Expressions is used. Common Table Expressions is nothing but a query whose result set you can reference in a later section of your query i.e it returns a temporary table that you return within your query. It uses the WITH clause.

Format —

WITH CTE_expression_name AS (

SELECT column_name

FROM table

WHERE condition)

SELECT column_name

FROM table

INNER JOIN CTE_expression_name on condition

Implement JOINS and UNION clauses

JOIN combines data from multiple tables. There are four types of Joins that you can use.

INNER JOIN: Select records that have matching values in both tables.
CROSS JOIN : Select records in the first table multiplied by the records in the second table.
FULL JOIN: Selects all records that match either left or right table records.
LEFT JOIN: Select records from the first (left-most) table with matching right table records.
RIGHT JOIN: Select records from the second (right-most) table with matching left table records.

Format —

SELECT column_names

FROM table1 JOIN table2

ON column_name1 = column_name2

WHERE condition

UNION is used to combine the result-set of two or more SELECT statements. ALL or DISTINCT keyword can be used with a UNION clause.

100 days : Your Data Science and ML Degree — Part 2

Everything you need to know…

medium.com

SELECT column_names1

FROM table1

UNION

SELECT column_names2

FROM table2

Cluster Analysis using Python — Part 1

With implementation/code

medium.datadriveninvestor.com

Query —

Inner Join

/*

SELECT column_names
FROM table1 INNER JOIN table2
ON column_name1 = column_name2
WHERE condition

SELECT end_station_name
FROM `bigquery-public-data.london_bicycles.cycle_hire`
INNER JOIN `bigquery-public-data.london_bicycles.cycle_stations`
ON end_station_id = id
where end_station_name like '%A%'
LIMIT 20;

-------------------------------

Left Join

/*SELECT column_names
FROM table1 Left JOIN table2
ON column_name1 = column_name2
WHERE condition
*/

SELECT end_station_name

FROM `bigquery-public-data.london_bicycles.cycle_hire`

LEFT JOIN `bigquery-public-data.london_bicycles.cycle_stations`

ON end_station_id = id

where end_station_name like 'C%'

LIMIT 20;

-----------------

Right Join

/*SELECT column_names
FROM table1 Right JOIN table2
ON column_name1 = column_name2
WHERE condition
*/

SELECT end_station_name

FROM `bigquery-public-data.london_bicycles.cycle_hire`

RIGHT JOIN `bigquery-public-data.london_bicycles.cycle_stations`

ON end_station_id = id

where end_station_name like 'C%'

LIMIT 20;

-------------

UNION

/*SELECT column_names1
FROM table1
UNION
SELECT column_names2
FROM table2
*/

SELECT end_station_name as T_name

FROM `bigquery-public-data.london_bicycles.cycle_hire`

UNION DISTINCT

SELECT terminal_name

FROM `bigquery-public-data.london_bicycles.cycle_stations`

LIMIT 20;

Code Implementation —

UNNEST Clause

UNNEST is used to flatten the arrays. UNNEST takes an array and in turn returns a table with a single row for each element in the array. It destroys the order of array elements.

Format —

SELECT column_name, COUNT(column_name) as counts

FROM table,

UNNEST(array) as alias_name

WHERE condition

GROUP BY column_name

ORDER BY column_name [ASC or DESC]

Here are some common commands used in BigQuery:

SELECT — used to query data from one or more tables
FROM — specifies the table(s) from which to retrieve data
WHERE — used to filter the query results based on certain conditions
GROUP BY — used to group query results by one or more columns
HAVING — used to filter the query results after they have been grouped
ORDER BY — used to sort the query results in ascending or descending order
LIMIT — used to limit the number of query results returned
JOIN — used to combine rows from two or more tables based on a related column between them
UNION — used to combine the results of two or more SELECT statements
AS — used to assign an alias to a column or table in the query results

Code Implementation -

SQL vs NoSQL Databases

SQL

As you design large systems ( or even smaller ones), you need to decide the inflow-processing and outflow of data coming- and getting processed in the system.

Data is generally organized in tables as rows and columns where columns represents attributes and rows represent records and keys have logical relationships. The SQL db schema always shows relational, tabular data following the ACID properties.

There are two types of databases to consider — SQL and NoSQL databases.

SQL databases have predefined schema and the data is organized/displayed in the form of tables. These databases use SQL ( Structured Query Language) to define, manipulate, update the data.

Relational databases like MS SQL Server, PostgreSQL, Sybase, MySQL Database, Oracle, etc. use SQL.

NoSQL

NoSQL databases on the other side, have no predefined schema which adds to more flexibility to use the formats that best suits the data — Work with graphs, column-oriented data, key-value and documents etc. They are generally preferred for hierarchical data, graphs ( e.g. social network) and to work with large data.

Some examples — Wide-column use Cassandra and HBase, Graph use Neo4j, Document use MongoDB and CouchDB, Key-value use Redis and DynamoDB,

A good comparison —

One of the important question that you might be asked, when to use which db?

When use SQL databases?

When you want to —

1. Scale Vertically — increase the processing power of your hardware

2. Work with predefined schema

3. Process queries and joins against structured data

4. Optimize the storage

5. Data is small

When to use NoSQL databases?

When you want to —

1. Scale horizontally

2. Work with graphs, column-oriented data, key-value and documents etc

3. Use multiple languages to query

4. Work with dynamic schema that has no predefined schema

5. Large Data

Code Implementation -

That’s it for now. Day 8 Coming soon!

Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time

Stay Tuned!!

All the Complete System Design Series Parts —

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

Github —

Complete-System-Design/README.md at main · Coder-World04/Complete-System-Design

This repository contains everything you need to become proficient in System Design Topics you should know in System…

github.com

Keep learning and coding ;)

Day 5 coming soon!

For Python Projects —

Complete Python And Projects — Mega Compilation

Everything that you need to know in Python with Projects…

medium.com

Analyzing Video using Python, OpenCV and NumPy

With Code Implementation…

medium.datadriveninvestor.com

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Connect the ML dots…

medium.com

Follow for more updates. Stay tuned and keep coding! Disclosure: Some of the links are affiliates.

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

medium.datadriveninvestor.com

Recurrent Neural Network with Keras

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

medium.datadriveninvestor.com

Clustering Geolocation Data in Python using DBSCAN and K-Means

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

medium.datadriveninvestor.com

Facial Expression Recognition using Keras

Facial Expression Recognition using Keras

Project Implementation…

medium.datadriveninvestor.com

Hyperparameter Tuning with Keras Tuner

Hyperparameter Tuning with Keras Tuner

Project Implementation….

medium.datadriveninvestor.com

Custom Layers in Keras

Custom Layers in Keras

Code implementation …

medium.datadriveninvestor.com

Day 7 of 30 days of Data Engineering Series with Projects

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

Tech Newsletter —

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication with hundreds of…

System Design Case Studies — In Depth

Design Instagram

Design Netflix

Design Reddit

Design Amazon

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Amazon Prime Video

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Most Popular System Design Questions

Mega Compilation : Solved System Design Case studies

Day 1 of 30 days of Data Engineering

With examples and projects…

Day 2 of 30 days of Data Engineering

With examples and projects…

Day 3 of 30 days of Data Engineering Series

With examples and projects …

Big query

SQL vs NO SQL Databases

BigQuery basics

SELECT, FROM and WHERE clauses in Bigquery

Date and Extract Clauses in Bigquery

Common Expression Table

Implement JOINS and UNION clauses

100 days : Your Data Science and ML Degree — Part 2

Everything you need to know…

Cluster Analysis using Python — Part 1

With implementation/code

UNNEST Clause

SQL vs NoSQL Databases

SQL

NoSQL

That’s it for now. Day 8 Coming soon!

Read more —

All the Complete System Design Series Parts —

Github —

Complete-System-Design/README.md at main · Coder-World04/Complete-System-Design

This repository contains everything you need to become proficient in System Design Topics you should know in System…

For Python Projects —

Complete Python And Projects — Mega Compilation

Everything that you need to know in Python with Projects…

Analyzing Video using Python, OpenCV and NumPy

With Code Implementation…

Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Connect the ML dots…

For other projects, tune to —

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

Facial Expression Recognition using Keras

Project Implementation…

Hyperparameter Tuning with Keras Tuner

Project Implementation….

Custom Layers in Keras

Code implementation …