Free AI web copilot to create summaries, insights and extended knowledge, download it at here

21707

Abstract

data mart</span> select_data_query = <span class="hljs-string">'SELECT * FROM Sales'</span> cursor.execute(select_data_query) data = cursor.fetchall() <span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> data: <span class="hljs-built_in">print</span>(row)

<span class="hljs-comment"># Perform aggregations and calculations</span> aggregate_query = <span class="hljs-string">''' SELECT product, SUM(amount) AS total_sales FROM Sales GROUP BY product '''</span> cursor.execute(aggregate_query) result = cursor.fetchall() <span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> result: <span class="hljs-built_in">print</span>(row)

<span class="hljs-comment"># Apply filters and conditions</span> filtered_query = <span class="hljs-string">"SELECT * FROM Sales WHERE date >= '2023-01-02'"</span> cursor.execute(filtered_query) filtered_data = cursor.fetchall() <span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> filtered_data: <span class="hljs-built_in">print</span>(row)

<span class="hljs-comment"># Close the connection to the data mart</span> conn.close()</pre></div><ul><li><b><i>A Data Lake</i></b><i> is a large, centralized repository of raw, unstructured data that is stored in its native format. It is designed to store all types of data, structured and unstructured, and is optimized for big data processing and analytics. It allows data scientists to store, process, and analyze data in a single place, and it is often used in conjunction with a data warehouse or data mart.</i></li></ul><div id="c163"><pre><span class="hljs-comment"># Import necessary libraries</span> <span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession

<span class="hljs-comment"># Create a Spark session</span> spark = SparkSession.builder
.appName(<span class="hljs-string">"DataLakeExample"</span>)
.config(<span class="hljs-string">"spark.some.config.option"</span>, <span class="hljs-string">"some-value"</span>)
.getOrCreate()

<span class="hljs-comment"># Read data from a file in the data lake</span> data = spark.read.<span class="hljs-built_in">format</span>(<span class="hljs-string">"csv"</span>).option(<span class="hljs-string">"header"</span>, <span class="hljs-string">"true"</span>).load(<span class="hljs-string">"s3://datalake/input/file.csv"</span>)

<span class="hljs-comment"># Perform transformations and data manipulations</span> transformed_data = data.select(<span class="hljs-string">"col1"</span>, <span class="hljs-string">"col2"</span>).<span class="hljs-built_in">filter</span>(<span class="hljs-string">"col3 > 0"</span>).groupBy(<span class="hljs-string">"col1"</span>).<span class="hljs-built_in">sum</span>(<span class="hljs-string">"col2"</span>)

<span class="hljs-comment"># Write the transformed data back to the data lake</span> transformed_data.write.<span class="hljs-built_in">format</span>(<span class="hljs-string">"parquet"</span>).mode(<span class="hljs-string">"overwrite"</span>).save(<span class="hljs-string">"s3://datalake/output/transformed_data.parquet"</span>)

<span class="hljs-comment"># Query data from the data lake</span> queried_data = spark.sql(<span class="hljs-string">"SELECT * FROM parquet.s3://datalake/output/transformed_data.parquet"</span>)

<span class="hljs-comment"># Perform data analysis and exploration</span> analysis_result = queried_data.describe()

<span class="hljs-comment"># Export analysis result to a file in the data lake</span> analysis_result.write.<span class="hljs-built_in">format</span>(<span class="hljs-string">"csv"</span>).mode(<span class="hljs-string">"overwrite"</span>).save(<span class="hljs-string">"s3://datalake/output/analysis_result.csv"</span>)

<span class="hljs-comment"># Create external tables for querying data</span> spark.sql(<span class="hljs-string">"CREATE EXTERNAL TABLE sales USING parquet LOCATION 's3://datalake/sales_data/'"</span>)

<span class="hljs-comment"># Query data using SQL on the external table</span> result = spark.sql(<span class="hljs-string">"SELECT * FROM sales WHERE date >= '2022-01-01'"</span>)

<span class="hljs-comment"># Export the result to a file in the data lake</span> result.write.<span class="hljs-built_in">format</span>(<span class="hljs-string">"csv"</span>).mode(<span class="hljs-string">"overwrite"</span>).save(<span class="hljs-string">"s3://datalake/output/query_result.csv"</span>)

<span class="hljs-comment"># Stop the Spark session</span> spark.stop()</pre></div><p id="1fef"><b>Complete Implementation ( for all)—</b></p><div id="4dd1"><pre><span class="hljs-comment"># Structured Data</span>

<span class="hljs-comment"># Read structured data from CSV file</span> <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd data = pd.read_csv(<span class="hljs-string">'structured_data.csv'</span>)

<span class="hljs-comment"># Perform data analysis and manipulation</span> data.head() data.describe() data.groupby(<span class="hljs-string">'category'</span>).mean()

<span class="hljs-comment"># Semi-Structured Data</span>

<span class="hljs-comment"># Read semi-structured data from JSON file</span> <span class="hljs-keyword">import</span> json <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(<span class="hljs-string">'semi_structured_data.json'</span>) <span class="hljs-keyword">as</span> f: data = json.load(f)

<span class="hljs-comment"># Access data elements</span> data[<span class="hljs-string">'key'</span>] data[<span class="hljs-string">'nested'</span>][<span class="hljs-string">'value'</span>]

<span class="hljs-comment"># Unstructured Data</span>

<span class="hljs-comment"># Read unstructured data from text file</span> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(<span class="hljs-string">'unstructured_data.txt'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f: data = f.read()

<span class="hljs-comment"># Perform text processing</span> words = data.split() unique_words = <span class="hljs-built_in">set</span>(words) word_counts = {word: words.count(word) <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> unique_words}

<span class="hljs-comment"># Data Warehouse</span>

<span class="hljs-comment"># Connect to a data warehouse</span> <span class="hljs-keyword">import</span> pyodbc conn = pyodbc.connect(<span class="hljs-string">'Driver={SQL Server};Server=localhost;Database=DataWarehouse;Trusted_Connection=yes;'</span>)

<span class="hljs-comment"># Execute SQL queries</span> cursor = conn.cursor() cursor.execute(<span class="hljs-string">'SELECT * FROM fact_table'</span>) data = cursor.fetchall()

<span class="hljs-comment"># Connect to a data mart</span> <span class="hljs-keyword">import</span> psycopg2 conn = psycopg2.connect(host=<span class="hljs-string">"localhost"</span>, port=<span class="hljs-string">"5432"</span>, database=<span class="hljs-string">"DataMart"</span>, user=<span class="hljs-string">"username"</span>, password=<span class="hljs-string">"password"</span>)

<span class="hljs-comment"># Execute SQL queries</span> cursor = conn.cursor() cursor.execute(<span class="hljs-string">'SELECT * FROM dimension_table'</span>) data = cursor.fetchall()

<span class="hljs-comment"># Access data in a data lake using Hadoop File System (HDFS)</span> <span class="hljs-keyword">from</span> pyarrow <span class="hljs-keyword">import</span> hdfs hdfs_client = hdfs.connect(host=<span class="hljs-string">'localhost'</span>, port=<span class="hljs-number">8020</span>) data = hdfs_client.cat(<span class="hljs-string">'/data_lake/file.parquet'</span>)

<span class="hljs-comment"># Perform data processing</span> <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd df = pd.read_parquet(data) df.head() df.groupby(<span class="hljs-string">'category'</span>).<span class="hljs-built_in">sum</span>()</pre></div><p id="e151">Snippet —</p><figure id="2e6e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*pWHfPRq5bj-66KNMaACG2Q.png"><figcaption></figcaption></figure><h1 id="ba18">Structured Data</h1><p id="71ad">Structured Data is the data which is highly organized, factual and quantitative in nature.</p><p id="fc64">It has a clear data model and can be displayed in rows, columns and relational database.</p><p id="c946">It consists of numbers, dates, strings and floats and requires less storage.</p><p id="5150">Advantage of structured data is that it’s easier to manage/maintain and all the legacy data can be stored in a well formatted way.</p><p id="e910">It resides in the relational databases and data warehouses.</p><p id="cae7"><b><i>Examples of Structured data —</i></b></p><p id="a92b"><i>Numerical data in excel files/google sheets</i></p><p id="a662"><i>Ratings on e-commerce website</i></p><p id="65a7"><i>Relational Databases data</i></p><figure id="d03e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*6sMBjnuOcF3i1w8R.png"><figcaption>Pic credits : Coding ninja</figcaption></figure><h2 id="66f6">How to deal with Structured Data —</h2><div id="22d9"><pre>import pandas as pd

<span class="hljs-comment"># Read structured data from a file (e.g., CSV, Excel)</span> data = pd.read_csv(<span class="hljs-string">"data.csv"</span>)

<span class="hljs-comment"># Display the structure and summary of the data</span> print(<span class="hljs-string">"Data structure:"</span>) print(data.head()) print(<span class="hljs-string">"\nData summary:"</span>) print(data.describe())

<span class="hljs-comment"># Select specific columns</span> selected_columns = [<span class="hljs-string">"column1"</span>, <span class="hljs-string">"column2"</span>, <span class="hljs-string">"column3"</span>] selected_data = data[selected_columns]

<span class="hljs-comment"># Filter data based on conditions</span> filtered_data = data[data[<span class="hljs-string">"column1"</span>] > 10]

<span class="hljs-comment"># Sort data by a column</span> sorted_data = data.sort_values(<span class="hljs-string">"column1"</span>)

<span class="hljs-comment"># Group data and calculate aggregates</span> grouped_data = data.groupby(<span class="hljs-string">"column2"</span>).agg({<span class="hljs-string">"column1"</span>: <span class="hljs-string">"sum"</span>, <span class="hljs-string">"column3"</span>: <span class="hljs-string">"mean"</span>})

<span class="hljs-comment"># Perform data transformations</span> transformed_data = data.copy() transformed_data[<span class="hljs-string">"new_column"</span>] = transformed_data[<span class="hljs-string">"column1"</span>] + transformed_data[<span class="hljs-string">"column2"</span>]

<span class="hljs-comment"># Perform data analysis</span> mean_value = data[<span class="hljs-string">"column1"</span>].mean() max_value = data[<span class="hljs-string">"column2"</span>].max()

<span class="hljs-comment"># Export data to a new file (e.g., CSV, Excel)</span> transformed_data.to_csv(<span class="hljs-string">"new_data.csv"</span>, index=False)

<span class="hljs-comment"># Load data from a database</span> import sqlite3 conn = sqlite3.connect(<span class="hljs-string">"database.db"</span>) db_data = pd.read_sql_query(<span class="hljs-string">"SELECT * FROM table"</span>, conn)

<span class="hljs-comment"># Write data to a database table</span> transformed_data.to_sql(<span class="hljs-string">"new_table"</span>, conn, if_exists=<span class="hljs-string">"replace"</span>)

<span class="hljs-comment"># Close the database connection</span> conn.close()</pre></div><p id="1594">Snippet —</p><figure id="3528"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*cPwuKx2gCMoHlujZREO-FA.png"><figcaption></figcaption></figure><h1 id="12a3">Semi Structured Data</h1><p id="ec71">Semi Structured data lacks fixed schema and is loosely organized data which is categorized using meta tags or markers. These are in the form of data files which follow a semi pattern.</p><p id="7237">Semi structured</p><p id="c3b1"><b><i>Examples of Semi Structured data —</i></b></p><p id="362e"><i>Posts with tags</i></p><p id="44e8"><i>Tweets with tags</i></p><p id="4d11"><i>Emails</i></p><p id="8c11"><i>XML, HTML, JSON Files</i></p><figure id="c61d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gEP6Ptu2vvlab23tU8ZYWg.png"><figcaption>Pic credits : Researchgate</figcaption></figure><h2 id="ac88">How to deal with Semi-Structured Data —</h2><div id="10fa"><pre><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd <span class="hljs-keyword">import</span> json <span class="hljs-keyword">import</span> xml.etree.ElementTree <span class="hljs-keyword">as</span> ET

<span class="hljs-comment"># Parse JSON data</span> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(<span class="hljs-string">"data.json"</span>) <span class="hljs-keyword">as</span> json_file: json_data = json.load(json_file)

<span class="hljs-comment"># Extract information from JSON</span> value = json_data[<span class="hljs-string">"key"</span>] nested_value = json_data[<span class="hljs-string">"nested"</span>][<span class="hljs-string">"key"</span>]

<span class="hljs-comment"># Parse XML data</span> tree = ET.parse(<span class="hljs-string">"data.xml"</span>) root = tree.getroot()

<span class="hljs-comment"># Extract information from XML</span> value = root.find(<span class="hljs-string">"element/tag"</span>).text

<span class="hljs-comment"># Parse HTML data (using BeautifulSoup library)</span> <span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup

<span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(<span class="hljs-string">"data.html"</span>) <span class="hljs-keyword">as</span> html_file: html_data = html_file.read()

soup = BeautifulSoup(html_data, <span class="hljs-string">"html.parser"</span>)

<span class="hljs-comment"># Extract information from HTML</span> value = soup.find(<span class="hljs-string">"tag"</span>).text

<span class="hljs-comment"># Export data to a new format (e.g., JSON, XML, CSV)</span> <span class="hljs-keyword">def</span> <span class="hljs-title function_">export_data</span>(<span class="hljs-params">data, <span class="hljs-built_in">format</span></span>): <span class="hljs-keyword">if</span> <span class="hljs-built_in">format</span> == <span class="hljs-string">"json"</span>: <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(<span class="hljs-string">"exported_data.json"</span>, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> json_file: json.dump(data, json_file) <span class="hljs-keyword">elif</span> <span class="hljs-built_in">format</span> == <span class="hljs-string">"xml"</span>: root = ET.Element(<span class="hljs-string">"root"</span>) <span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> data: element = ET.SubElement(root, <span class="hljs-string">"item"</span>) element.text = <span class="hljs-built_in">str</span>(item) tree = ET.ElementTree(root) tree.write(<span class="hljs-string">"exported_data.xml"</span>) <span class="hljs-keyword">elif</span> <span class="hljs-built_in">format</span> == <span class="hljs-string">"csv"</span>: df = pd.DataFrame(data) df.to_csv(<span class="hljs-string">"exported_data.csv"</span>, index=<span class="hljs-literal">False</span>) <span class="hljs-keyword">else</span>: <span class="hljs-built_in">print</span>(<span class="hljs-string">"Unsupported export format"</span>)

<span class="hljs-comment"># Load data from a database or API (depending on the data source)</span> <span class="hljs-keyword">def</span> <span class="hljs-title function_">load_data</span>(<span class="hljs-params">source</span>): <span class="hljs-keyword">if</span> source == <span class="hljs-string">"database"</span>: <span class="hljs-comment"># Load data from a database</span> <span class="hljs-comment"># ...</span> <span class="hljs-keyword">return</span> data_from_database <span class="hljs-keyword">elif</span> source == <span class="hljs-string">"api"</span>: <span class="hljs-comment"># Load data from an API</span> <span class="hljs-comment"># ...</span> <span class="hljs-keyword">return</span> data_from_api <span class="hljs-keyword">else</span>: <span class="hljs-built_in">print</span>(<span class="hljs-string">"Unsupported data source"</span>) <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

<span class="hljs-comment"># Write data to a database or API (depending on the data destination)</span> <span class="hljs-keyword">def</span> <span class="hljs-title function_">write_data</span>(<span class="hljs-params">data, destination</span>): <span class="hljs-keyword">if</span> destination == <span class="hljs-string">"database"</span>: <span class="hljs-comment"># Write data to a database</span> <span class="hljs-comment"># ...</span> <span class="hljs-built_in">print</span>(<span class="hljs-string">"Data written to the database"</span>) <span class="hljs-keyword">elif</span> destination == <span class="hljs-string">"api"</span>: <span class="hljs-comment"># Write data to an API</span> <span class="hljs-comment"># ...</span> <span class="hljs-built_in">print</span>(<span class="hljs-string">"Data written to the API"</span>) <span class="hljs-keyword">else</span>: <span class="hljs-built_in">print</span>(<span class="hljs-string">"Unsupported data destination"</span>)

<span class="hljs-comment"># Example usage</span> data = [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>]

<span class="hljs-comment"># Perform data transformations and analysis</span> perform_data_analysis(data)

<span class="hljs-comment"># Export data to a new format (e.g., JSON, XML, CSV)</span> export_data(data, <span class="hljs-string">"json"</span>) export_data(data, <span class="hljs-string">"xml"</span>) export_data(data, <span class="hljs-string">"csv"</span>)

<span class="hljs-comment"># Load data from a database or API</span> loaded_data = load_data(<span class="hljs-string">"database"</span>) loaded_data = load_data(<span class="hljs-string">"api"</span>)

<span class="hljs-comment"># Write data to a database or API</span> write_data(data, <span class="hljs-string">"database"</span>) write_data(data, <span class="hljs-string">"api"</span>)</pre></div><p id="79da">Snippet —</p><figure id="af78"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9Xhuvg9WPQt5-By7_lwkhQ.png"><figcaption></figcaption></figure><h1 id="08ea">Unstructured Data</h1><p id="a107">Unstructured data doesn’t have an inherent structure and stores in different types of formats and files. It doesn’t have predefined data models and very difficult to search the data. It’s qualitative in nature and the schema creation of read.</p><p id="b576">It cannot be displayed in rows, columns or relational database formats. It requires more storage and more difficult to manage as well as maintain.</p><p id="4b99">It resides on NOSQL Databases and Data lakes and Data warehouses.</p><p id="8488"><b><i>Examples of Unstructured Data —</i></b></p><p id="762d"><i>Surveys, transcripts</i></p><p id="a16b"><i>pdfs, images, videos etc</i></p><p id="5d5e"><i>Emails</i></p><p id="8c8a"><i>Audio Files</i></p><h2 id="644c">How to handle Unstructured Data —</h2><div id="c198"><pre><span class="hljs-keyword">import</span> nltk <span class="hljs-keyword">from</span> nltk.tokenize <span class="hljs-keyword">import</span> word_tokenize <span class="hljs-keyword">from</span> nltk.corpus <span class="hljs-keyword">import</span> stopwords <span class="hljs-keyword">from</span> nltk.stem <span class="hljs-keyword">import</span> WordNetLemmatizer

<span class="hljs-comment"># Perform text preprocessing</span> <span class="hljs-keyword">def</span> <span class="hljs-title function_">preprocess_text</span>(<span class="hljs-params">text</span>): <span class="hljs-comment"># Tokenize the text into individual words</span> tokens = word_tokenize(text)

<span class="hljs-comment"># Remove stopwords</span>
stop_words = <span class="hljs-built_in">set</span>(stopwords.words(<span class="hljs-string">"english"</span>))
filtered_tokens = [token <span class="hljs-keyword">for</span> token <span class="hljs-keyword">in</span> tokens <span class="hljs-keyword">if</span> token.lower() <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> stop_words]

<span class="hljs-comment"># Lemmatize the words</span>
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) <span class="hljs-keyword">for</span> token <span class="hljs-keyword">in</span> filtered_tokens]

<span class="hljs-comment"># Return the preprocessed text as a list of tokens</span>
<span class="hljs-keyword">return</span> lemmatized_tokens

<span class="hljs-comment"># Perform sentiment analysis</span> <span class="hljs-

Options

keyword">def</span> <span class="hljs-title function_">analyze_sentiment</span>(<span class="hljs-params">text</span>): <span class="hljs-comment"># Perform sentiment analysis using a pre-trained model or rule-based approach</span> <span class="hljs-comment"># ...</span> sentiment_score = <span class="hljs-number">0.5</span> <span class="hljs-comment"># Placeholder value</span>

<span class="hljs-keyword">return</span> sentiment_score

<span class="hljs-comment"># Extract key information using named entity recognition (NER)</span> <span class="hljs-keyword">def</span> <span class="hljs-title function_">extract_entities</span>(<span class="hljs-params">text</span>): <span class="hljs-comment"># Perform named entity recognition using a pre-trained model or rule-based approach</span> <span class="hljs-comment"># ...</span> entities = [] <span class="hljs-comment"># Placeholder value</span>

<span class="hljs-keyword">return</span> entities

<span class="hljs-comment"># Example usage</span> unstructured_text = <span class="hljs-string">"This is an example sentence. It contains multiple words and punctuation marks."</span>

<span class="hljs-comment"># Preprocess the text</span> preprocessed_text = preprocess_text(unstructured_text) <span class="hljs-built_in">print</span>(<span class="hljs-string">"Preprocessed text:"</span>, preprocessed_text)

<span class="hljs-comment"># Perform sentiment analysis</span> sentiment_score = analyze_sentiment(unstructured_text) <span class="hljs-built_in">print</span>(<span class="hljs-string">"Sentiment score:"</span>, sentiment_score)

<span class="hljs-comment"># Extract entities using named entity recognition (NER)</span> entities = extract_entities(unstructured_text) <span class="hljs-built_in">print</span>(<span class="hljs-string">"Entities:"</span>, entities)</pre></div><p id="0ac7">Snippet —</p><figure id="acdf"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*uqxjB6-dlititNS6k6J2Sw.png"><figcaption></figcaption></figure><h1 id="945a">Data Warehousing</h1><p id="9d8a">It summarizes the data and stores historical and up to date present information from various data sources. The data is structured and processed, non volatile and time variant.</p><p id="74d1">It’s very expensive for large data volumes and is less agile with fixed configuration.</p><figure id="5de7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*fIXbga9HkixoEge3.png"><figcaption>Pic credits : BoldBI</figcaption></figure><h1 id="43d6">Data Mart</h1><p id="bc51">Data Mart is the condensed summarized data which is like focussed data from different organizations/departments. It’s highly focussed and requires high level of prior processing.</p><h1 id="5ffc">Data Lake</h1><p id="9d73">Data lake contains the data which is raw, structured/semi-structured/unstructured. It’s designed for the low cost storage and is highly agile that you can configure as and when required. It’s used by the data scientists in its native format that makes it very flexible to use to analyze and build models from various data systems/sources.</p><figure id="1438"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*lli9nA3q3a2zgifv.jpg"><figcaption>Pic credits : Zuar</figcaption></figure><p id="bfba">It’s used for machine learning, discovery and deep analysis.</p><p id="382c"><b>Complete Code — Data Warehouse, Data Mart and Data Lake:</b></p><div id="9669"><pre>Author : Naina Chaturvedi

<span class="hljs-comment"># Connect to the data warehouse</span> <span class="hljs-keyword">import</span> pyodbc conn = pyodbc.connect(<span class="hljs-string">'Driver={SQL Server};Server=localhost;Database=DataWarehouse;Trusted_Connection=yes;'</span>)

<span class="hljs-comment"># Execute SQL queries</span>

<span class="hljs-comment"># Create a table in the data warehouse</span> create_table_query = <span class="hljs-string">''' CREATE TABLE Sales ( id INT, date DATE, amount DECIMAL(10,2), product VARCHAR(100), customer_id INT ) '''</span> cursor = conn.cursor() cursor.execute(create_table_query)

<span class="hljs-comment"># Insert data into the data warehouse</span> insert_data_query = <span class="hljs-string">''' INSERT INTO Sales (id, date, amount, product, customer_id) VALUES (1, '2023-01-01', 100.50, 'Product A', 1), (2, '2023-01-02', 200.75, 'Product B', 2), (3, '2023-01-03', 150.25, 'Product C', 1) '''</span> cursor.execute(insert_data_query) conn.commit()

<span class="hljs-comment"># Query data from the data warehouse</span> select_data_query = <span class="hljs-string">'SELECT * FROM Sales'</span> cursor.execute(select_data_query) data = cursor.fetchall() <span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> data: <span class="hljs-built_in">print</span>(row)

<span class="hljs-comment"># Update data in the data warehouse</span> update_query = <span class="hljs-string">"UPDATE Sales SET amount = amount * 1.1 WHERE product = 'Product A'"</span> cursor.execute(update_query) conn.commit()

<span class="hljs-comment"># Delete data from the data warehouse</span> delete_query = <span class="hljs-string">"DELETE FROM Sales WHERE customer_id = 2"</span> cursor.execute(delete_query) conn.commit()

<span class="hljs-comment"># Close the connection to the data warehouse</span> conn.close()

<span class="hljs-comment"># Connect to the data mart</span> <span class="hljs-keyword">import</span> pyodbc conn = pyodbc.connect(<span class="hljs-string">'Driver={SQL Server};Server=localhost;Database=DataMart;Trusted_Connection=yes;'</span>)

<span class="hljs-comment"># Execute SQL queries</span>

<span class="hljs-comment"># Create a table in the data mart</span> create_table_query = <span class="hljs-string">''' CREATE TABLE Sales ( id INT, date DATE, amount DECIMAL(10,2), product VARCHAR(100), customer_id INT ) '''</span> cursor = conn.cursor() cursor.execute(create_table_query)

<span class="hljs-comment"># Insert data into the data mart</span> insert_data_query = <span class="hljs-string">''' INSERT INTO Sales (id, date, amount, product, customer_id) VALUES (1, '2023-01-01', 100.50, 'Product A', 1), (2, '2023-01-02', 200.75, 'Product B', 2), (3, '2023-01-03', 150.25, 'Product C', 1) '''</span> cursor.execute(insert_data_query) conn.commit()

<span class="hljs-comment"># Query data from the data mart</span> select_data_query = <span class="hljs-string">'SELECT * FROM Sales'</span> cursor.execute(select_data_query) data = cursor.fetchall() <span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> data: <span class="hljs-built_in">print</span>(row)

<span class="hljs-comment"># Close the connection to the data mart</span> conn.close()

<span class="hljs-comment"># Import necessary libraries</span> <span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession

<span class="hljs-comment"># Query data from the data lake</span> queried_data = spark.sql(<span class="hljs-string">"SELECT * FROM parquet.s3://datalake/output/transformed_data.parquet"</span>)

<span class="hljs-comment"># Perform data analysis and exploration</span> analysis_result = queried_data.describe()

<span class="hljs-comment"># Query data using SQL on the external table</span> result = spark.sql(<span class="hljs-string">"SELECT * FROM sales WHERE date >= '2022-01-01'"</span>)

<span class="hljs-comment"># Stop the Spark session</span> spark.stop()</pre></div><p id="f3e1">Snippet —</p><figure id="8a24"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Hiw_9P7fewe2Gy6hoZMq7A.png"><figcaption></figcaption></figure><h1 id="9c93">That’s it for now.</h1><h1 id="7777">Find Day 22 Below:</h1><div id="dfe1" class="link-block"> <a href="https://readmedium.com/day-22-of-30-days-of-data-engineering-series-with-projects-6af0daa939a3"> <div> <div> <h2>Day 22 of 30 days of Data Engineering Series with Projects</h2> <div><h3>Welcome back peeps to Day 22 of Data Engineering Series with Projects!</h3></div> <div><p>medium.com=</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*PQgPhxWQyOFFaBQ1.JPG)"></div> </div> </div> </a> </div><p id="525f"><b><i>Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time</i></b></p><p id="0960"><b><i>Stay Tuned!!</i></b></p><h2 id="0bcd">Read more —</h2><h2 id="8842">All the Complete System Design Series Parts —</h2><blockquote id="f3fb"><p><a href="https://readmedium.com/complete-system-design-series-part-1-45bf9c8654bc"><b><i>1. System design basics</i></b></a></p></blockquote><blockquote id="c535"><p><a href="https://readmedium.com/complete-system-design-series-part-2-922f45f2faaf"><b><i>2. Horizontal and vertical scaling</i></b></a></p></blockquote><blockquote id="18a1"><p><a href="https://readmedium.com/part-3-complete-system-design-series-e1362baa8a4c"><b><i>3. Load balancing and Message queues</i></b></a></p></blockquote><blockquote id="4d43"><p><a href="https://readmedium.com/part-4-complete-system-design-series-138bc9fbcfc0"><b><i>4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture</i></b></a></p></blockquote><blockquote id="d211"><p><a href="https://readmedium.com/part-5-complete-system-design-series-4b9b04f23608"><b><i>5. Caching, Indexing, Proxies</i></b></a></p></blockquote><blockquote id="10ec"><p><a href="https://readmedium.com/part-6-complete-system-design-series-59a2d8bbf1ed"><b><i>6. Networking, How Browsers work, Content Network Delivery ( CDN)</i></b></a></p></blockquote><blockquote id="2fb1"><p><a href="https://readmedium.com/part-7-complete-system-design-series-1bef528923d6"><b><i>7. Database Sharding, CAP Theorem, Database schema Design</i></b></a></p></blockquote><blockquote id="982a"><p><a href="https://readmedium.com/part-8-complete-system-design-series-57bc88433c8e"><b><i>8. Concurrency, API, Components + OOP + Abstraction</i></b></a></p></blockquote><blockquote id="f09e"><p><a href="https://readmedium.com/part-9-complete-system-design-series-df975c85ec51"><b><i>9. Estimation and Planning, Performance</i></b></a></p></blockquote><blockquote id="9128"><p><b><i>10. <a href="https://readmedium.com/part-10-complete-system-design-series-523b4dd978bf?sk=741f92929c8639a2e4cf218521e8cc4a">Map Reduce, Patterns and Microservices</a></i></b></p></blockquote><blockquote id="f879"><p><b><i>11. <a href="https://naina0412.medium.com/part-11-complete-system-design-series-9c8efbc0237a?sk=5bddf2adc78ea4947ae88ab21c94af1c">SQL vs NoSQL and Cloud</a></i></b></p></blockquote><blockquote id="bdf5"><p><a href="https://readmedium.com/most-popular-system-design-questions-mega-compilation-45218129fe26"><b><i>12. Most Popular System Design Questions</i></b></a></p></blockquote><h1 id="a23a">Github —</h1><div id="b414" class="link-block"> <a href="https://github.com/Coder-World04/Complete-System-Design/blob/main/README.md"> <div> <div> <h2>Complete-System-Design/README.md at main · Coder-World04/Complete-System-Design</h2> <div><h3>This repository contains everything you need to become proficient in System Design Topics you should know in System…</h3></div> <div><p>github.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/)"></div> </div> </div> </a> </div><h1 id="cbd3">For Python Projects —</h1><div id="22a4" class="link-block"> <a href="https://readmedium.com/complete-python-and-projects-mega-compilation-7ec8f7adfe71"> <div> <div> <h2>Complete Python And Projects — Mega Compilation</h2> <div><h3>Everything that you need to know in Python with Projects…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*NnCSMN6etFjjw4Jn.jpg)"></div> </div> </div> </a> </div><div id="471c" class="link-block"> <a href="https://medium.datadriveninvestor.com/analyzing-video-using-python-opencv-and-numpy-5471cab200c4"> <div> <div> <h2>Analyzing Video using Python, OpenCV and NumPy</h2> <div><h3>With Code Implementation…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*PYNCDW3IXI2BcT5f.jpg)"></div> </div> </div> </a> </div><p id="f199"><b><i>For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML</i></b></p><div id="9d77" class="link-block"> <a href="https://readmedium.com/day-1-day-60-quick-recap-of-60-days-of-data-science-and-ml-6fc021643d1"> <div> <div> <h2>Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML</h2> <div><h3>Connect the ML dots…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*ZfJ1yKIzPLGABAI_.png)"></div> </div> </div> </a> </div><p id="1253"><b><i>Follow for more updates. Stay tuned and keep coding!</i></b></p><h1 id="21c3">For other projects, tune to —</h1><p id="b31f"><b>Build Machine Learning Pipelines( With Code)</b></p><div id="5b37" class="link-block"> <a href="https://medium.datadriveninvestor.com/build-machine-learning-pipelines-with-code-part-1-bd3ed7152124"> <div> <div> <h2>Build Machine Learning Pipelines( With Code) — Part 1</h2> <div><h3>Complete implementation…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*KdToBD8RDMBH4jXM.png)"></div> </div> </div> </a> </div><p id="946c"><b>Recurrent Neural Network with Keras</b></p><div id="f317" class="link-block"> <a href="https://medium.datadriveninvestor.com/recurrent-neural-network-with-keras-b5b5f6fe5187"> <div> <div> <h2>Recurrent Neural Network with Keras</h2> <div><h3>Project Implementation and cheatsheet…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*xs3Dya3qQBx6IU7C.png)"></div> </div> </div> </a> </div><p id="ec53"><b>Clustering Geolocation Data in Python using DBSCAN and K-Means</b></p><div id="2b3e" class="link-block"> <a href="https://medium.datadriveninvestor.com/clustering-geolocation-data-in-python-using-dbscan-and-k-means-3705d9f44522"> <div> <div> <h2>Clustering Geolocation Data in Python using DBSCAN and K-Means</h2> <div><h3>Project Implementation…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*0uPCZnohdaPCO4NN.png)"></div> </div> </div> </a> </div><p id="a29c"><b>Facial Expression Recognition using Keras</b></p><div id="ccaa" class="link-block"> <a href="https://medium.datadriveninvestor.com/facial-expression-recognition-using-keras-cbdd661a0a54"> <div> <div> <h2>Facial Expression Recognition using Keras</h2> <div><h3>Project Implementation…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*CGch7hzdjg1fpgKy.jpg)"></div> </div> </div> </a> </div><p id="0db7"><b>Hyperparameter Tuning with Keras Tuner</b></p><div id="6dff" class="link-block"> <a href="https://medium.datadriveninvestor.com/hyperparameter-tuning-with-keras-tuner-3a609d3fd85b"> <div> <div> <h2>Hyperparameter Tuning with Keras Tuner</h2> <div><h3>Project Implementation….</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*jlaEz8AZaptNWHEr.png)"></div> </div> </div> </a> </div><p id="fed8"><b>Custom Layers in Keras</b></p><div id="e4fd" class="link-block"> <a href="https://medium.datadriveninvestor.com/custom-layers-in-keras-de5f793217aa"> <div> <div> <h2>Custom Layers in Keras</h2> <div><h3>Code implementation …</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*1IH67KJadqeqeO01.png)"></div> </div> </div> </a> </div></article></body>

Day 21 of 30 days of Data Engineering Series with Projects

Welcome back peeps to Day 21 of Data Engineering Series with Projects!

In this we will cover —

Structured Data

Semi Structured Data

Unstructured Data

Data Warehouse

Data Mart

Data Lake

Pre-requisite to Day 21 is to complete Day 1–20( link below):

Day 1 : What’s Data Engineering, Why Data Engineering, Data Engineers — ML Engineers — Data Scientists, Purpose and Scope

Day 2 : Complete Python for Data Engineering — Part 1

Day 3 : Complete Advanced Python for Data Engineering — Part 2

Day 4: Techniques to write efficient and Optimized Code

Day 5 : SQL

Day 6 : Advanced SQL

Day 7 : BigQuery and SQL vs NOSQL databases

Day 8 : Advanced Functions

Day 9 : Query Optimizations

Day 10 : MySQL and PostgreSQL

Day 11: Shell scripting and Linux “touch” command

Day 12 : Map Reduce, Data Warehouse, Data Lakes

Day 13: Pandas, Pandas, Data Cleaning and processing, Outlier Detection, Noisy Data, Missing Data, Pandas Functions, Aggregate Functions, Joins

Day 14 : Numpy

Day 15 : Advanced Pandas Techniques

Day 16 : Data Pre-processing, Handling missing values, Data Cleaning, Mean/mode/median Imputation, Hot Deck Imputation, Rescale Data, Binarize Data, Regression Imputation, Stochastic regression imputation, Feature Scaling

Day 17 : Data Augmentation, Read and Process Large Datasets

Day 18 : Data Visualization basics, Data Visualization Projects, Data Visualization using Plotly and Bokeh, Data Profiling, Summary Functions, Indexing, Grouping, Linear Regression, Multi Linear Regression, Polynomial Regression, Regression, Support Vector Regression, Decision Tree Regression, Random Forest Regression, Feature Engineering, GroupBy Features, Categorical and Numerical Features, Missing Value Analysis, Fill the missing Values, Unique Value Analysis, Univariate Analysis, Bivariate Analysis, Multivariate Analysis, Correlation Analysis, Spearman’s ρ, Pearson’s r, Kendall’s τ, Cramér’s V (φc), Phik (φk)

Day 19 : MySQL and PostgreSQL

Day 20 : ETL ( Extract, Tranform and Load) basics, Why ETL is important?, How ETL works, ETL Tools

Day 21 : Structured Data, Semi Structured Data, Unstructured Data, Data Warehouse, Data Mart, Data Lake

Projects Videos —

Subscribe today!

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

www.youtube.com

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Ignito:

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication with hundreds of…

naina0405.substack.com

System Design Case Studies — In Depth

Design Instagram

Design Netflix

Design Reddit

Design Amazon

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Amazon Prime Video

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Mega Compilation : Solved System Design Case studies

Let’s get started!

Structured data refers to data that is organized in a specific format, such as tables in a relational database. It is typically easy to analyze and query because it follows a predefined schema. Examples include data from spreadsheets, tables in a relational database, and data from structured forms.
Semi-structured data refers to data that has some level of organization, but does not follow a strict schema. It may contain a mix of structured and unstructured elements. Examples include data from JSON, XML, and log files.
Unstructured data refers to data that does not have a specific format or organization. It is often unorganized and difficult to analyze, but can still provide valuable insights. Examples include data from social media posts, email messages, and images.
A Data Warehouse is a large, centralized repository of data that is specifically designed to support business intelligence (BI) activities. It is used to store historical data from various sources, and is optimized for reporting and analysis.

# Connect to the data warehouse
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};Server=localhost;Database=DataWarehouse;Trusted_Connection=yes;')

# Execute SQL queries

# Create a table in the data warehouse
create_table_query = '''
    CREATE TABLE Sales (
        id INT,
        date DATE,
        amount DECIMAL(10,2),
        product VARCHAR(100),
        customer_id INT
    )
'''
cursor = conn.cursor()
cursor.execute(create_table_query)

# Insert data into the data warehouse
insert_data_query = '''
    INSERT INTO Sales (id, date, amount, product, customer_id)
    VALUES (1, '2023-01-01', 100.50, 'Product A', 1),
           (2, '2023-01-02', 200.75, 'Product B', 2),
           (3, '2023-01-03', 150.25, 'Product C', 1)
'''
cursor.execute(insert_data_query)
conn.commit()

# Query data from the data warehouse
select_data_query = 'SELECT * FROM Sales'
cursor.execute(select_data_query)
data = cursor.fetchall()
for row in data:
    print(row)

# Perform aggregations and calculations
aggregate_query = '''
    SELECT product, SUM(amount) AS total_sales
    FROM Sales
    GROUP BY product
'''
cursor.execute(aggregate_query)
result = cursor.fetchall()
for row in result:
    print(row)

# Update data in the data warehouse
update_query = "UPDATE Sales SET amount = amount * 1.1 WHERE product = 'Product A'"
cursor.execute(update_query)
conn.commit()

# Delete data from the data warehouse
delete_query = "DELETE FROM Sales WHERE customer_id = 2"
cursor.execute(delete_query)
conn.commit()

# Close the connection to the data warehouse
conn.close()

A Data Mart is a subset of a data warehouse that is focused on a specific business area, such as sales or finance. It is designed to support the specific reporting and analysis needs of a particular department or business unit.

# Connect to the data mart
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};Server=localhost;Database=DataMart;Trusted_Connection=yes;')

# Execute SQL queries

# Create a table in the data mart
create_table_query = '''
    CREATE TABLE Sales (
        id INT,
        date DATE,
        amount DECIMAL(10,2),
        product VARCHAR(100),
        customer_id INT
    )
'''
cursor = conn.cursor()
cursor.execute(create_table_query)

# Insert data into the data mart
insert_data_query = '''
    INSERT INTO Sales (id, date, amount, product, customer_id)
    VALUES (1, '2023-01-01', 100.50, 'Product A', 1),
           (2, '2023-01-02', 200.75, 'Product B', 2),
           (3, '2023-01-03', 150.25, 'Product C', 1)
'''
cursor.execute(insert_data_query)
conn.commit()

# Query data from the data mart
select_data_query = 'SELECT * FROM Sales'
cursor.execute(select_data_query)
data = cursor.fetchall()
for row in data:
    print(row)

# Perform aggregations and calculations
aggregate_query = '''
    SELECT product, SUM(amount) AS total_sales
    FROM Sales
    GROUP BY product
'''
cursor.execute(aggregate_query)
result = cursor.fetchall()
for row in result:
    print(row)

# Apply filters and conditions
filtered_query = "SELECT * FROM Sales WHERE date >= '2023-01-02'"
cursor.execute(filtered_query)
filtered_data = cursor.fetchall()
for row in filtered_data:
    print(row)

# Close the connection to the data mart
conn.close()

A Data Lake is a large, centralized repository of raw, unstructured data that is stored in its native format. It is designed to store all types of data, structured and unstructured, and is optimized for big data processing and analytics. It allows data scientists to store, process, and analyze data in a single place, and it is often used in conjunction with a data warehouse or data mart.

# Import necessary libraries
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("DataLakeExample") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# Read data from a file in the data lake
data = spark.read.format("csv").option("header", "true").load("s3://datalake/input/file.csv")

# Perform transformations and data manipulations
transformed_data = data.select("col1", "col2").filter("col3 > 0").groupBy("col1").sum("col2")

# Write the transformed data back to the data lake
transformed_data.write.format("parquet").mode("overwrite").save("s3://datalake/output/transformed_data.parquet")

# Query data from the data lake
queried_data = spark.sql("SELECT * FROM parquet.`s3://datalake/output/transformed_data.parquet`")

# Perform data analysis and exploration
analysis_result = queried_data.describe()

# Export analysis result to a file in the data lake
analysis_result.write.format("csv").mode("overwrite").save("s3://datalake/output/analysis_result.csv")

# Create external tables for querying data
spark.sql("CREATE EXTERNAL TABLE sales USING parquet LOCATION 's3://datalake/sales_data/'")

# Query data using SQL on the external table
result = spark.sql("SELECT * FROM sales WHERE date >= '2022-01-01'")

# Export the result to a file in the data lake
result.write.format("csv").mode("overwrite").save("s3://datalake/output/query_result.csv")

# Stop the Spark session
spark.stop()

Complete Implementation ( for all)—

# Structured Data

# Read structured data from CSV file
import pandas as pd
data = pd.read_csv('structured_data.csv')

# Perform data analysis and manipulation
data.head()
data.describe()
data.groupby('category').mean()

# Semi-Structured Data

# Read semi-structured data from JSON file
import json
with open('semi_structured_data.json') as f:
    data = json.load(f)

# Access data elements
data['key']
data['nested']['value']

# Unstructured Data

# Read unstructured data from text file
with open('unstructured_data.txt', 'r') as f:
    data = f.read()

# Perform text processing
words = data.split()
unique_words = set(words)
word_counts = {word: words.count(word) for word in unique_words}

# Data Warehouse

# Connect to a data warehouse
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};Server=localhost;Database=DataWarehouse;Trusted_Connection=yes;')

# Execute SQL queries
cursor = conn.cursor()
cursor.execute('SELECT * FROM fact_table')
data = cursor.fetchall()

# Data Mart

# Connect to a data mart
import psycopg2
conn = psycopg2.connect(host="localhost", port="5432", database="DataMart", user="username", password="password")

# Execute SQL queries
cursor = conn.cursor()
cursor.execute('SELECT * FROM dimension_table')
data = cursor.fetchall()

# Data Lake

# Access data in a data lake using Hadoop File System (HDFS)
from pyarrow import hdfs
hdfs_client = hdfs.connect(host='localhost', port=8020)
data = hdfs_client.cat('/data_lake/file.parquet')

# Perform data processing
import pandas as pd
df = pd.read_parquet(data)
df.head()
df.groupby('category').sum()

Snippet —

Structured Data

Structured Data is the data which is highly organized, factual and quantitative in nature.

It has a clear data model and can be displayed in rows, columns and relational database.

It consists of numbers, dates, strings and floats and requires less storage.

Advantage of structured data is that it’s easier to manage/maintain and all the legacy data can be stored in a well formatted way.

It resides in the relational databases and data warehouses.

Examples of Structured data —

Numerical data in excel files/google sheets

Ratings on e-commerce website

Relational Databases data

How to deal with Structured Data —

import pandas as pd

# Read structured data from a file (e.g., CSV, Excel)
data = pd.read_csv("data.csv")

# Display the structure and summary of the data
print("Data structure:")
print(data.head())
print("\nData summary:")
print(data.describe())

# Select specific columns
selected_columns = ["column1", "column2", "column3"]
selected_data = data[selected_columns]

# Filter data based on conditions
filtered_data = data[data["column1"] > 10]

# Sort data by a column
sorted_data = data.sort_values("column1")

# Group data and calculate aggregates
grouped_data = data.groupby("column2").agg({"column1": "sum", "column3": "mean"})

# Perform data transformations
transformed_data = data.copy()
transformed_data["new_column"] = transformed_data["column1"] + transformed_data["column2"]

# Perform data analysis
mean_value = data["column1"].mean()
max_value = data["column2"].max()

# Export data to a new file (e.g., CSV, Excel)
transformed_data.to_csv("new_data.csv", index=False)

# Load data from a database
import sqlite3
conn = sqlite3.connect("database.db")
db_data = pd.read_sql_query("SELECT * FROM table", conn)

# Write data to a database table
transformed_data.to_sql("new_table", conn, if_exists="replace")

# Close the database connection
conn.close()

Snippet —

Semi Structured Data

Semi Structured data lacks fixed schema and is loosely organized data which is categorized using meta tags or markers. These are in the form of data files which follow a semi pattern.

Semi structured

Examples of Semi Structured data —

Posts with tags

Tweets with tags

Emails

XML, HTML, JSON Files

How to deal with Semi-Structured Data —

import pandas as pd
import json
import xml.etree.ElementTree as ET

# Parse JSON data
with open("data.json") as json_file:
    json_data = json.load(json_file)

# Extract information from JSON
value = json_data["key"]
nested_value = json_data["nested"]["key"]

# Parse XML data
tree = ET.parse("data.xml")
root = tree.getroot()

# Extract information from XML
value = root.find("element/tag").text

# Parse HTML data (using BeautifulSoup library)
from bs4 import BeautifulSoup

with open("data.html") as html_file:
    html_data = html_file.read()

soup = BeautifulSoup(html_data, "html.parser")

# Extract information from HTML
value = soup.find("tag").text


# Export data to a new format (e.g., JSON, XML, CSV)
def export_data(data, format):
    if format == "json":
        with open("exported_data.json", "w") as json_file:
            json.dump(data, json_file)
    elif format == "xml":
        root = ET.Element("root")
        for item in data:
            element = ET.SubElement(root, "item")
            element.text = str(item)
        tree = ET.ElementTree(root)
        tree.write("exported_data.xml")
    elif format == "csv":
        df = pd.DataFrame(data)
        df.to_csv("exported_data.csv", index=False)
    else:
        print("Unsupported export format")

# Load data from a database or API (depending on the data source)
def load_data(source):
    if source == "database":
        # Load data from a database
        # ...
        return data_from_database
    elif source == "api":
        # Load data from an API
        # ...
        return data_from_api
    else:
        print("Unsupported data source")
        return None

# Write data to a database or API (depending on the data destination)
def write_data(data, destination):
    if destination == "database":
        # Write data to a database
        # ...
        print("Data written to the database")
    elif destination == "api":
        # Write data to an API
        # ...
        print("Data written to the API")
    else:
        print("Unsupported data destination")

# Example usage
data = [1, 2, 3, 4, 5]

# Perform data transformations and analysis
perform_data_analysis(data)

# Export data to a new format (e.g., JSON, XML, CSV)
export_data(data, "json")
export_data(data, "xml")
export_data(data, "csv")

# Load data from a database or API
loaded_data = load_data("database")
loaded_data = load_data("api")

# Write data to a database or API
write_data(data, "database")
write_data(data, "api")

Snippet —

Unstructured Data

Unstructured data doesn’t have an inherent structure and stores in different types of formats and files. It doesn’t have predefined data models and very difficult to search the data. It’s qualitative in nature and the schema creation of read.

It cannot be displayed in rows, columns or relational database formats. It requires more storage and more difficult to manage as well as maintain.

It resides on NOSQL Databases and Data lakes and Data warehouses.

Examples of Unstructured Data —

Surveys, transcripts

pdfs, images, videos etc

Emails

Audio Files

How to handle Unstructured Data —

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Perform text preprocessing
def preprocess_text(text):
    # Tokenize the text into individual words
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words("english"))
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
    
    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    
    # Return the preprocessed text as a list of tokens
    return lemmatized_tokens

# Perform sentiment analysis
def analyze_sentiment(text):
    # Perform sentiment analysis using a pre-trained model or rule-based approach
    # ...
    sentiment_score = 0.5  # Placeholder value
    
    return sentiment_score

# Extract key information using named entity recognition (NER)
def extract_entities(text):
    # Perform named entity recognition using a pre-trained model or rule-based approach
    # ...
    entities = []  # Placeholder value
    
    return entities

# Example usage
unstructured_text = "This is an example sentence. It contains multiple words and punctuation marks."

# Preprocess the text
preprocessed_text = preprocess_text(unstructured_text)
print("Preprocessed text:", preprocessed_text)

# Perform sentiment analysis
sentiment_score = analyze_sentiment(unstructured_text)
print("Sentiment score:", sentiment_score)

# Extract entities using named entity recognition (NER)
entities = extract_entities(unstructured_text)
print("Entities:", entities)

Snippet —

Data Warehousing

It summarizes the data and stores historical and up to date present information from various data sources. The data is structured and processed, non volatile and time variant.

It’s very expensive for large data volumes and is less agile with fixed configuration.

Data Mart

Data Mart is the condensed summarized data which is like focussed data from different organizations/departments. It’s highly focussed and requires high level of prior processing.

Data Lake

Data lake contains the data which is raw, structured/semi-structured/unstructured. It’s designed for the low cost storage and is highly agile that you can configure as and when required. It’s used by the data scientists in its native format that makes it very flexible to use to analyze and build models from various data systems/sources.

It’s used for machine learning, discovery and deep analysis.

Complete Code — Data Warehouse, Data Mart and Data Lake:

Author : Naina Chaturvedi

# Connect to the data warehouse
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};Server=localhost;Database=DataWarehouse;Trusted_Connection=yes;')

# Execute SQL queries

# Create a table in the data warehouse
create_table_query = '''
    CREATE TABLE Sales (
        id INT,
        date DATE,
        amount DECIMAL(10,2),
        product VARCHAR(100),
        customer_id INT
    )
'''
cursor = conn.cursor()
cursor.execute(create_table_query)

# Insert data into the data warehouse
insert_data_query = '''
    INSERT INTO Sales (id, date, amount, product, customer_id)
    VALUES (1, '2023-01-01', 100.50, 'Product A', 1),
           (2, '2023-01-02', 200.75, 'Product B', 2),
           (3, '2023-01-03', 150.25, 'Product C', 1)
'''
cursor.execute(insert_data_query)
conn.commit()

# Query data from the data warehouse
select_data_query = 'SELECT * FROM Sales'
cursor.execute(select_data_query)
data = cursor.fetchall()
for row in data:
    print(row)

# Perform aggregations and calculations
aggregate_query = '''
    SELECT product, SUM(amount) AS total_sales
    FROM Sales
    GROUP BY product
'''
cursor.execute(aggregate_query)
result = cursor.fetchall()
for row in result:
    print(row)

# Update data in the data warehouse
update_query = "UPDATE Sales SET amount = amount * 1.1 WHERE product = 'Product A'"
cursor.execute(update_query)
conn.commit()

# Delete data from the data warehouse
delete_query = "DELETE FROM Sales WHERE customer_id = 2"
cursor.execute(delete_query)
conn.commit()

# Close the connection to the data warehouse
conn.close()

      
# Connect to the data mart
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};Server=localhost;Database=DataMart;Trusted_Connection=yes;')

# Execute SQL queries

# Create a table in the data mart
create_table_query = '''
    CREATE TABLE Sales (
        id INT,
        date DATE,
        amount DECIMAL(10,2),
        product VARCHAR(100),
        customer_id INT
    )
'''
cursor = conn.cursor()
cursor.execute(create_table_query)

# Insert data into the data mart
insert_data_query = '''
    INSERT INTO Sales (id, date, amount, product, customer_id)
    VALUES (1, '2023-01-01', 100.50, 'Product A', 1),
           (2, '2023-01-02', 200.75, 'Product B', 2),
           (3, '2023-01-03', 150.25, 'Product C', 1)
'''
cursor.execute(insert_data_query)
conn.commit()

# Query data from the data mart
select_data_query = 'SELECT * FROM Sales'
cursor.execute(select_data_query)
data = cursor.fetchall()
for row in data:
    print(row)

# Perform aggregations and calculations
aggregate_query = '''
    SELECT product, SUM(amount) AS total_sales
    FROM Sales
    GROUP BY product
'''
cursor.execute(aggregate_query)
result = cursor.fetchall()
for row in result:
    print(row)

# Apply filters and conditions
filtered_query = "SELECT * FROM Sales WHERE date >= '2023-01-02'"
cursor.execute(filtered_query)
filtered_data = cursor.fetchall()
for row in filtered_data:
    print(row)

# Close the connection to the data mart
conn.close()

      
      
# Import necessary libraries
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("DataLakeExample") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# Read data from a file in the data lake
data = spark.read.format("csv").option("header", "true").load("s3://datalake/input/file.csv")

# Perform transformations and data manipulations
transformed_data = data.select("col1", "col2").filter("col3 > 0").groupBy("col1").sum("col2")

# Write the transformed data back to the data lake
transformed_data.write.format("parquet").mode("overwrite").save("s3://datalake/output/transformed_data.parquet")

# Query data from the data lake
queried_data = spark.sql("SELECT * FROM parquet.`s3://datalake/output/transformed_data.parquet`")

# Perform data analysis and exploration
analysis_result = queried_data.describe()

# Export analysis result to a file in the data lake
analysis_result.write.format("csv").mode("overwrite").save("s3://datalake/output/analysis_result.csv")

# Create external tables for querying data
spark.sql("CREATE EXTERNAL TABLE sales USING parquet LOCATION 's3://datalake/sales_data/'")

# Query data using SQL on the external table
result = spark.sql("SELECT * FROM sales WHERE date >= '2022-01-01'")

# Export the result to a file in the data lake
result.write.format("csv").mode("overwrite").save("s3://datalake/output/query_result.csv")

# Stop the Spark session
spark.stop()

Snippet —

That’s it for now.

Find Day 22 Below:

Day 22 of 30 days of Data Engineering Series with Projects

Welcome back peeps to Day 22 of Data Engineering Series with Projects!

medium.com=

Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time

Stay Tuned!!

All the Complete System Design Series Parts —

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

Github —

Complete-System-Design/README.md at main · Coder-World04/Complete-System-Design

This repository contains everything you need to become proficient in System Design Topics you should know in System…

github.com

For Python Projects —

Complete Python And Projects — Mega Compilation

Everything that you need to know in Python with Projects…

medium.com

Analyzing Video using Python, OpenCV and NumPy

With Code Implementation…

medium.datadriveninvestor.com

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Connect the ML dots…

medium.com

Follow for more updates. Stay tuned and keep coding!

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

medium.datadriveninvestor.com

Recurrent Neural Network with Keras

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

medium.datadriveninvestor.com

Clustering Geolocation Data in Python using DBSCAN and K-Means

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

medium.datadriveninvestor.com

Facial Expression Recognition using Keras

Facial Expression Recognition using Keras

Project Implementation…

medium.datadriveninvestor.com

Hyperparameter Tuning with Keras Tuner

Hyperparameter Tuning with Keras Tuner

Project Implementation….

medium.datadriveninvestor.com

Custom Layers in Keras

Custom Layers in Keras

Code implementation …

medium.datadriveninvestor.com