avatarNaina Chaturvedi

Summary

The provided content outlines Day 3 of a 30-day Data Engineering learning series, focusing on advanced Python topics and concepts such as magic methods, inheritance, polymorphism, error handling, garbage collection, debugging, decorators, memoization, defaultdict, ordereddict, generators, coroutines, and recommends resources for further learning in data science, machine learning, and system design.

Abstract

Day 3 of the "30 days of Data Engineering" series delves into advanced Python programming concepts essential for data engineers. The content covers a range of topics including special Python methods known as magic methods, object-oriented programming principles like inheritance and polymorphism, error handling techniques, and memory management through garbage collection. It also discusses the use of Python's built-in debugger (pdb), the application of decorators for code reusability and memoization to optimize function calls. The article further explains the utility of defaultdict and ordereddict for efficient dictionary operations, and introduces generators and coroutines for handling iterative processes and concurrency. Additionally, the author provides a comprehensive list of recommended courses and projects for practical learning in data science, machine learning, and system design, emphasizing the importance of hands-on coding exercises. The content is supplemented with implementation examples and links to further resources, including a YouTube channel for video tutorials and a GitHub repository for system design material.

Opinions

  • The author emphasizes the importance of understanding advanced Python concepts for data engineering.
  • Practical implementation and hands-on coding exercises are highlighted as crucial for learning and skill development.
  • The use of decorators, memoization, and defaultdict is advocated for writing efficient and optimized Python code.
  • The author suggests that mastering system design is essential for tech interviews and real-world applications.
  • Udacity's nanodegree programs are recommended for structured learning and certification in various tech domains.
  • The inclusion of a GitHub repository and YouTube channel for additional learning resources indicates a preference for community-driven and accessible education.
  • The author's opinion on the significance of error handling and debugging in Python development is conveyed through detailed explanations and examples.
  • The mention of popular system design questions and the importance of understanding both SQL and NoSQL databases reflects the author's view on the breadth of knowledge required in the field.

Day 3 of 30 days of Data Engineering Series

With examples and projects …

Pic credits : alphalogic

Welcome back peeps to Day 3 of Data Engineering!

What’s covered in 30 days of Data Engineering with projects Series till now —

Day 1 : What’s Data Engineering, Why Data Engineering, Data Engineers — ML Engineers — Data Scientists, Purpose and Scope

Day 2 : Complete Python for Data Engineering — Part 1

Day 3 : Complete Advanced Python for Data Engineering — Part 2

Day 4: Techniques to write efficient and Optimized Code

Day 5 : SQL

Day 6 : Advanced SQL

Day 7 : BigQuery and SQL vs NOSQL databases

Day 8 : Advanced Functions

Day 9 : Query Optimizations

Day 10 : MySQL and PostgreSQL

Day 11: Shell scripting and Linux “touch” command

Day 12 : Map Reduce, Data Warehouse, Data Lakes

Day 13: Pandas, Pandas, Data Cleaning and processing, Outlier Detection, Noisy Data, Missing Data, Pandas Functions, Aggregate Functions, Joins

Day 14 : Numpy

Day 15 : Advanced Pandas Techniques

Day 16 : Data Pre-processing, Handling missing values, Data Cleaning, Mean/mode/median Imputation, Hot Deck Imputation, Rescale Data, Binarize Data, Regression Imputation, Stochastic regression imputation, Feature Scaling

Day 17 : Data Augmentation, Read and Process Large Datasets

Day 18 : Data Visualization basics, Data Visualization Projects, Data Visualization using Plotly and Bokeh, Data Profiling, Summary Functions, Indexing, Grouping, Linear Regression, Multi Linear Regression, Polynomial Regression, Regression, Support Vector Regression, Decision Tree Regression, Random Forest Regression, Feature Engineering, GroupBy Features, Categorical and Numerical Features, Missing Value Analysis, Fill the missing Values, Unique Value Analysis, Univariate Analysis, Bivariate Analysis, Multivariate Analysis, Correlation Analysis, Spearman’s ρ, Pearson’s r, Kendall’s τ, Cramér’s V (φc), Phik (φk)

Day 19 : MySQL and PostgreSQL

This is Day 3 of 30 days of Data Engineering Series where we will be covering —

Complete Advanced Python for Data Engineering — Part 2

Our whole syllabi for 30 days of Data Engineering —

I’l be covering only the most important topics in Data Engineering with projects ( written below) —

1. Data Engineering

What’s Data Engineering

Why Data Engineering

Data Engineers — ML Engineers — Data Scientists

Purpose and Scope

2. Python for Data Engineering

Basic Python with Project

Advanced Python with Project

Techniques to write efficient and optimized code

3. Scripting and Automation

Shell Scripting

CRON

ETL

4. Relational Databases and SQL

RDBMS

Data Modeling

Basic SQL

Advanced SQL

Big Query

5. NoSQL Data bases and Map Reduce

Unstructured Data

Advanced ETL

Map-Reduce

Data Warehouses

Data API

6.Data Analysis

Pandas

Numpy

Web Scraping

Data Visualization

7. Data Processing Techniques

Batch Processing : Apache Spark

Stream Processing — Spart Streaming

Build Data Pipelines

Target Databases

Machine learning Algorithms

8. Big Data

Big data basics

HDFS in detail

Hadoop Yarn

Sqoop Hadoop

Hadoop Yarn

Hive

Pig

Hbase

9. WorkFlows

Introduction to Airflow

Airflow hands on project

10. Infrastructure

Docker

Kubernetes

Business Intelligence

11. Cloud Computing

AWS

Google Cloud Platform

12. Research Papers — Data Engineering

Some amazing research papers- data engineering that I have read over the years to help you boot up to the industry standards and what’s next in this field.

Let’s get started with Day 3 —

We will be covering below Python topics in detail with hands on coding exercise —

1. Data types, strings, operators, and Chaining Comparison Operators with Logical Operators

2. Python Lists and Dictionaries, Sets, Tuples

3. Loops, Break and Continue Statements

4. Object-Oriented Programming — Class and attributes

5. Python strings in detail

6. Python F-String

7. Map, Classes, Functions and Arguments

8. First Class functions, Private Variables, Global and Non Local Variables, __import__ function

9. Magic Functions, Tuple Unpacking

10. Static Variables and Methods in Python

11. Lambda Functions, Magic methods

12. Inheritance and Polymorphism, Errors and Exception Handling

13. User-defined functions, Python garbage collection, debugger in Python

14. Iterators, Generators, and Decorators, Memoization using Decorators

15. Ordered and Defaultdict, Coroutine

16. Regular expression, Magic methods, Closures

17. ChainMap

18. Python Itertools

19. Advanced python constructs

20. Comprehensions, Named Tuple, Type hinting in Python

21. How to write efficient Code in Python

22. Efficient Code and Optimization techniques for Python

Open up colab/jupyter notebook and start coding.

Let’s dive in!

Magic Methods in Python

In Python, Magic methods in Python are the special methods that start and end with the double underscores

  • Magic methods are not meant to be invoked directly by you, but the invocation happens internally from the class once certain action is performed
  • Examples for magic methods are: __new__, __repr__, __init__, __add__, __len__, __del__ etc. The __init__ method used for initialization is invoked without any call
  • Use the dir() function to see the number of magic methods inherited by a class
  • The advantage of using Python’s magic methods is that they provide a simple way to make objects behave like built-in types
  • Magic methods can be used to emulate the behavior of built-in types of user-defined objects. Therefore, whenever you find yourself trying to manipulate a user-defined object’s output in a Python class, then use magic methods.

Example :

v = 4

v.__add__(2)

Implementation —

# __Del__ method
from os.path import join
class FileObject:
def __init__(self, file_path='~', file_name='test.txt'):
        self.file = open(join(file_path, file_name), 'rt')
def __del__(self):
        self.file.close()
        del self.file

Implementation —

# __repr__ method
class String:
      
    def __init__(self, string):
        self.string = string
def __repr__(self):
        return 'Object: {}'.format(self.string)

Some of the other best Series —

30 Days of Natural Language Processing ( NLP) Series

30 days of Data Structures and Algorithms and System Design Simplified

60 Days of Deep Learning with Projects Series

30 days of Data Engineering with projects Series

Data Science and Machine Learning Research ( papers) Simplified **

100 days : Your Data Science and Machine Learning Degree Series with projects

23 Data Science Techniques You Should Know

Tech Interview Series — Curated List of coding questions

Complete System Design with most popular Questions Series

Complete Data Visualization and Pre-processing Series with projects

Kaggle Best Notebooks that will teach you the most

60 days of Data Science and ML Series with projects

Complete Developers Guide to Git

All the Data Science and Machine Learning Resources

210 Machine Learning Projects

30 days of Machine Learning Ops

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Tech Brew :

Inheritance and Polymorphism in Python

  • In Python, Inheritance and Polymorphism are very powerful and important concept
  • Using inheritance you can use or inherit all the data fields and methods available in the parent class
  • On top of it, you can add you own methods and data fields
  • Python allows multiple inheritance i.e you can inherit from multiple classes
  • Inheritance provides a way to write better organized code and re-use the code

One of the best article I read on class inheritance by Erdem Isbilen

Syntax —

class ParentClass:

Body of parent class

class DerivedClass(ParentClass):

Body of derived class

  • In Python, Polymorphism allows us to define methods in the child class with the same name as defined in their parent class

Example —

class X:

def sample(self):

print(“sample() method from class X”)

class Y(X):

def sample(self):

print(“sample() method from class Y”)

Implementation —

# Inheritance
class Vehicle:
def __init__(self, name, color):
        self.__name = name      
        self.__color = color
def getColor(self):         
        return self.__color
def setColor(self, color):  
        self.__color = color
def get_Name(self):          
        return self.__name
class Bike(Vehicle):
def __init__(self, name, color, model):
          
        super().__init__(name, color)       # call parent class
        self.__model = model
def get_details(self):
        return self.get_Name() + self.__model + " in " +  
               self.getColor() + " color"
b_obj = Bike("Cziar", "red", "TK720")
print(b_obj.get_details())
print(b_obj.get_Name())

Output —

Cziar TK720 in red color
Cziar

Implementation —

# Polymorphism
from math import pi
class Shape:
    def __init__(self, name):
        self.name = name
def area(self):
        pass
class Sqr(Shape):
    def __init__(self, length):
        super().__init__("Square")
        self.length = length
def area(self):
        return self.length**2
class Circle(Shape):
    def __init__(self, radius):
        super().__init__("Circle")
        self.radius = radius
def area(self):
        return pi*self.radius**2
a = Square(6)
b = Circle(10)
print(a.area())
print(b.area())

Output —

36
314.1592653589793

Errors and Exception Handling in Python

In Python, an error can be a syntax error or an exception.

When the parser detects an incorrect statement, Syntax errors occur.

  • Exceptions errors are raised when an external event occurs which in some way changes the normal flow of the program
  • Exception error occurs whenever syntactically correct python code results in an error
  • Python comes with various built-in exceptions as well as the user can create user-defined exceptions
  • Garbage collection is the memory management feature i.e a process of cleaning shared computer memory

Some of python’s built in exceptions —

IndexError : When the wrong index of a list is retrieved

ImportError : When an imported module is not found

KeyError : When the key of the dictionary is not found

NameError: When the variable is not defined

MemoryError : When a program run out of memory

TypeError : When a function and operation is applied in an incorrect type

AssertionError : When assert statement fails

AttributeError : When an attribute assignment is failed

Try and Except in Python

In Python, exceptions can be handled using a try statement

  • The block of code which can raise an exception is placed inside the try clause. The code that handles the exceptions is written in the except clause
  • In case no exception has occurred, the except block is skipped and program normal flow continues
  • A try clause can have any number of except clauses to handle different exceptions but only one will be executed in case the exception occurs
  • We can also raise exceptions using the raise keyword
  • The try statement in Python can have an optional finally clause which executes regardless of the result of the try- and except blocks

Example :

try:

print(a)

except:

print(“Something went wrong”)

finally:

print(“Exit”)

Implementation —

# try, except, finally
try:
     print(1 / 0)
except:
     print("Error occurred")
finally:
     print("Exit")

Output —

Error occurred
Exit

User-defined Exceptions

In Python, user can create his own error by creating a new exception class

  • Exceptions need to be derived from the Exception class, either directly or indirectly
  • Exceptions errors are raised when an external event occurs which in some way changes the normal flow of the program
  • User defined exceptions can be implemented by raising an exception explicitly, by using assert statement or by defining custom classes for user defined exceptions
  • Use assert statement to implement constraints on the program. When, the condition given in assert statement is not met, the program gives AssertionError in output
  • You can raise an existing exception by using the raise keyword and the name of the exception
  • To create a custom exception class and define an error message, you need to derive the errors from the Exception class directly
  • When creating a module that can raise several distinct errors, a common practice is to create a base class for exceptions defined by that module, and subclass that to create specific exception classes for different error conditions, this is called Hierarchical custom exceptions

Example —

class class_name(Exception)

Implementation —

class Error(Exception):
    pass
class TooSmallValueError(Error):
    pass
number = 100
while True:
    try:
        num = int(input("Enter a number: "))
        if num < number:
            raise TooSmallValueError
        break
    except TooSmallValueError:
        print("Value too small")

Output —

Enter a number: 40
Value too small

Garbage Collection in Python

In Python, Garbage collection is the memory management feature i.e a process of cleaning shared computer memory which is currently being put to use by a running program when that program no longer needs that memory and can be used other programs

  • In python, Garbage collection works automatically. Hence, python provides with good memory management and prevents the wastage of memory
  • In python, forcible garbage collection can be done by calling collect() function of the gc module
  • In python, when there is no reference left to the object in that case it is automatically destroyed by the Garbage collector of python and __del__() method is executed

Example :

import gc

gc.collect()

Implementation —

#manual garbage collection
import sys, gc
def test():
    list = [18, 19, 20,34,78]
    list.append(list)
def main():
    print("Garbage Creation")
    for i in range(5):
        test()
print("Collecting..")
    n = gc.collect()
    print("Unreachable objects collected by GC:", n)
    print("Uncollectable garbage list:", gc.garbage)
if __name__ == "__main__":
    main()
    sys.exit()

Output —

Garbage Creation
Collecting..
Unreachable objects collected by GC: 33

Python Debugger

Debugging is the process of locating and solving the errors in the program. In python, pdb which is a part of Python’s standard library is used to debug the code

  • pdb module internally makes used of bdb and cmd modules
  • It supports setting breakpoints and single stepping at the source line level, inspection of stack frames, source code listing etc

Syntax —

import pdb

pdb.set_trace()

  • To set the breakpoints, there is a built-in function called breakpoint()

Implementation —

import pdb
   
def multiply(a, b):
    answer = a * b
    return answer
  
pdb.set_trace()
a = int(input("Enter first number : "))
b = int(input("Enter second number : "))
sum = multiply(a, b)

Decorators in Python

In Python, a decorator is any callable Python object that is used to modify a function or a class. It takes a function, adds some functionality, and returns it.

  • Decorators are a very powerful and useful tool in Python since it allows programmers to modify/control the behavior of function or class.
  • In Decorators, functions are passed as an argument into another function and then called inside the wrapper function.
  • Decorators are usually called before the definition of a function you want to decorate.

There are two different kinds of decorators in Python:

Function decorators

Class decorators

  • When using Multiple Decorators to a single function, the decorators will be applied in the order they’ve been called
  • By recalling that decorator function, we can re-use the decorator

Implementation —

#Decorators
def test_decorator(func):
    def function_wrapper(x):
        print("Before calling" + func.__name__)
        res = func(x)
        print(res)
        print("After calling" + func.__name__)
    return function_wrapper
@test_decorator
def sqr(n):
    return n**2
sqr(20)

Output —

Before callingsqr
400
After callingsqr

Implementation —

# Multiple Decorators
def lowercase_decorator(function):
    def wrapper():
        func= function()
        make_lowercase = func.lower()
        return make_lowercase
    return wrapper
def split_string(function):
    def wrapper():
        func= function()
        split_string =func.split()
        return split_string
    return wrapper
@split_string
@lowercase_decorator
def test_func():
    return 'MOTHER OF DRAGONS'
test_func()

Output —

['mother', 'of', 'dragons']

Memoization using Decorators

In Python, memoization is a technique which allows you to optimize a Python function by caching its output based on the parameters you supply to it.

  • Once you memoize a function, it will only compute its output once for each set of parameters you call it with. Every call after the first will be quickly retrieved from a cache.
  • If you want to speed up the parts in your program that are expensive, memoization can be a great technique to use.

One of the best article I read about Decorators by Hensle Joseph

There are three approaches to Memoization —

Using global

Using objects

Using default parameter

Using a Callable Class

Implementation —

#fibonacci series using Memoization using decorators
def memoization_func(t):
    dict_one = {}
    def h(z):
        if z not in dict_one:            
            dict_one[z] = t(z)
        return dict_one[z]
    return h
    
@memoization_func
def fib(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fib(n-1) + fib(n-2)
print(fib(20))

Output —

6765

Defaultdict

In python, a dictionary is a container that holds key-value pairs. Keys must be unique, immutable objects

  • If you try to access or modify keys that don’t exist in the dictionary, it raise a KeyError and break up your code execution. To tackle this issue, Python defaultdict type, a dictionary-like class is used
  • If you try to access or modify a missing key, then defaultdict will automatically create the key and generate a default value for it
  • A defaultdict will never raise a KeyError
  • Any key that does not exist gets the value returned by the default factory
  • Hence, whenever you need a dictionary, and each element’s value should start with a default value, use a defaultdict

Syntax —

from collections import defaultdict

demo = defaultdict(int)

Implementation —

from collections import defaultdict 
     
default_dict_var = defaultdict(list) 
  
for i in range(10): 
    default_dict_var[i].append(i) 
  
print(default_dict_var)

Output —

defaultdict(<class 'list'>, {0: [0], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5], 6: [6], 7: [7], 8: [8], 9: [9]})

OrderedDict

In python, OrderedDict is one of the high performance container datatypes and a subclass of dict object. It maintains the order in which the keys are inserted. In case of deletion or re-insertion of the key, the order is maintained and used when creating an iterator

  • It’s a dictionary subclass that remembers the order in which its contents are added
  • When the value of a specified key is changed, the ordering of keys will not change for the OrderedDict
  • If an item is overwritten in the OrderedDict, it’s position is maintained
  • OrderedDict popitem removes the items in FIFO order
  • The reversed() function can be used with OrderedDict to iterate elements in the reverse order
  • OrderedDict has a move_to_end() method to efficiently reposition an element to an endpoint

Example —

from collections import OrderedDict

my_dict = {‘Sunday’: 0, ‘Monday’: 1, ‘tuesday’: 2}

# creating ordered dict

ordered_dict = OrderedDict(my_dict)

Generators in Python

In Python, Generator functions act just like regular functions with just one difference that they use the Python yield keyword instead of return . A generator function is a function that returns an iterator A generator expression is an expression that also returns an iterator

  • Generator objects are used either by calling the next method on the generator object or using the generator object in a “for in” loop.
  • A return statement terminates a function entirely but a yield statement pauses the function saving all its states and later continues from there on successive calls.
  • Generator expressions can be used as the function arguments. Just like list comprehensions, generator expressions allow you to quickly create a generator object within minutes with just a few lines of code.
  • The major difference between a list comprehension and a generator expression is that a list comprehension produces the entire list while the generator expression produces one item at a time as lazy evaluation. For this reason, compared to a list comprehension, a generator expression is much more memory efficient

Example —

def generator():

yield “x”

yield “y”

for i in generator():

print(i)

Implementation —

def test_sequence():
    num = 0
    while num<10:
        yield num
        num += 1
for i in test_sequence():
       print(i, end=",")

Output —

0,1,2,3,4,5,6,7,8,9,

Implementation —

# Python generator with Loop
#Reverse a string
def reverse_str(test_str):
    length = len(test_str)
    for i in range(length - 1, -1, -1):
        yield test_str[i]
for char in reverse_str("Trojan"):
    print(char,end =" ")

Output —

n a j o r T

Implementation —

# Generator Expression
# Initialize the list
test_list = [1, 3, 6, 10]
# list comprehension
list_comprehension = [x**3 for x in test_list]
# generator expression
test_generator = (x**3 for x in test_list)
print(list_comprehension)
print(type(test_generator))
print(tuple(test_generator))

Output —

[1, 27, 216, 1000]
<class 'generator'>
(1, 27, 216, 1000)

Coroutine in Python

  • Coroutines are computer program components that generalize subroutines for non-preemptive multitasking, by allowing execution to be suspended and resumed
  • Because coroutines can pause and resume execution context, they’re well suited to concurrent processing
  • Coroutines are a special type of function that yield control over to the caller, but does not end its context in the process, instead maintaining it in an idle state
  • Using coroutines the yield directive can also be used on the right-hand side of an = operator to signify it will accept a value at that point in time.

Example —

def func():

print(“My first Coroutine”)

while True:

var = (yield)

print(var)

coroutine = func()

next(coroutine)

Implementation —

def func(): 
            print("My first Coroutine") 
            while True: 
                         var = (yield) 
                         print(var) 
coroutine = func() 
next(coroutine)

Output —

My first Coroutine

That’s it for now!

Day 4 : Coming Soon :)

Projects Videos —

All the projects, data structures, SQL, algorithms, system design, Data Science and ML , Data Analytics, Data Engineering, , Implemented Data Science and ML projects, Implemented Data Engineering Projects, Implemented Deep Learning Projects, Implemented Machine Learning Ops Projects, Implemented Time Series Analysis and Forecasting Projects, Implemented Applied Machine Learning Projects, Implemented Tensorflow and Keras Projects, Implemented PyTorch Projects, Implemented Scikit Learn Projects, Implemented Big Data Projects, Implemented Cloud Machine Learning Projects, Implemented Neural Networks Projects, Implemented OpenCV Projects,Complete ML Research Papers Summarized, Implemented Data Analytics projects, Implemented Data Visualization Projects, Implemented Data Mining Projects, Implemented Natural Leaning Processing Projects, MLOps and Deep Learning, Applied Machine Learning with Projects Series, PyTorch with Projects Series, Tensorflow and Keras with Projects Series, Scikit Learn Series with Projects, Time Series Analysis and Forecasting with Projects Series, ML System Design Case Studies Series videos will be published on our youtube channel ( just launched).

Subscribe today!

Follow for more updates. Stay tuned !

Keep learning and coding :)

Complete System Design Series Parts —

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

Github —

Highly Recommended Data Science and Machine Learning Courses that you MUST take ( with certificate) —

Complete Data Scientist

Complete Data Analyst

Complete Data Engineering

Complete Machine Learning Engineer

Complete Deep Learning

Complete Natural Language Processing

Complete Self Driving Car Engineer

For Python Projects —

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Recurrent Neural Network with Keras

Clustering Geolocation Data in Python using DBSCAN and K-Means

Facial Expression Recognition using Keras

Hyperparameter Tuning with Keras Tuner

Custom Layers in Keras

Machine Learning
Artificial Intelligence
Tech
Programming
Data Science
Recommended from ReadMedium