avatarBenjamin Obi Tayo Ph.D.

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5619

Abstract

Z'</span>: <span class="hljs-string">'Arizona'</span>, <span class="hljs-string">'GU'</span>: <span class="hljs-string">'Guam'</span>, <span class="hljs-string">'MS'</span>: <span class="hljs-string">'Mississippi'</span>,<span class="hljs-string">'PR'</span>: <span class="hljs-string">'Puerto Rico'</span>, <span class="hljs-string">'NC'</span>: <span class="hljs-string">'North Carolina'</span>, <span class="hljs-string">'TX'</span>: <span class="hljs-string">'Texas'</span>, <span class="hljs-string">'SD'</span>: <span class="hljs-string">'South Dakota'</span>, <span class="hljs-string">'MP'</span>: <span class="hljs-string">'Northern Mariana Islands'</span>,<span class="hljs-string">'IA'</span>: <span class="hljs-string">'Iowa'</span>, <span class="hljs-string">'MO'</span>: <span class="hljs-string">'Missouri'</span>, <span class="hljs-string">'CT'</span>: <span class="hljs-string">'Connecticut'</span>, <span class="hljs-string">'WV'</span>: <span class="hljs-string">'West Virginia'</span>, <span class="hljs-string">'SC'</span>: <span class="hljs-string">'South Carolina'</span>, <span class="hljs-string">'LA'</span>: <span class="hljs-string">'Louisiana'</span>, <span class="hljs-string">'KS'</span>: <span class="hljs-string">'Kansas'</span>, <span class="hljs-string">'NY'</span>: <span class="hljs-string">'New York'</span>, <span class="hljs-string">'NE'</span>: <span class="hljs-string">'Nebraska'</span>, <span class="hljs-string">'OK'</span>: <span class="hljs-string">'Oklahoma'</span>, <span class="hljs-string">'FL'</span>: <span class="hljs-string">'Florida'</span>, <span class="hljs-string">'CA'</span>: <span class="hljs-string">'California'</span>, <span class="hljs-string">'CO'</span>: <span class="hljs-string">'Colorado'</span>, <span class="hljs-string">'PA'</span>: <span class="hljs-string">'Pennsylvania'</span>, <span class="hljs-string">'DE'</span>: <span class="hljs-string">'Delaware'</span>, <span class="hljs-string">'NM'</span>: <span class="hljs-string">'New Mexico'</span>, <span class="hljs-string">'RI'</span>: <span class="hljs-string">'Rhode Island'</span>, <span class="hljs-string">'MN'</span>: <span class="hljs-string">'Minnesota'</span>, <span class="hljs-string">'VI'</span>: <span class="hljs-string">'Virgin Islands'</span>,<span class="hljs-string">'NH'</span>: <span class="hljs-string">'New Hampshire'</span>, <span class="hljs-string">'MA'</span>: <span class="hljs-string">'Massachusetts'</span>, <span class="hljs-string">'GA'</span>: <span class="hljs-string">'Georgia'</span>, <span class="hljs-string">'ND'</span>: <span class="hljs-string">'North Dakota'</span>, <span class="hljs-string">'VA'</span>: <span class="hljs-string">'Virginia'</span>}</pre></div><h1 id="ef29">Order States Alphabetically</h1><div id="6ee6"><pre>order_states=<span class="hljs-built_in">np</span>.<span class="hljs-built_in">sort</span>(pd.Series(states).<span class="hljs-built_in">values</span>) order_states;</pre></div><h1 id="bdee">Data Cleaning, Preparation, and Organization</h1><div id="800b"><pre>NewState=<span class="hljs-selector-attr">[]</span> <span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>,<span class="hljs-built_in">len</span>(order_states)): <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>,<span class="hljs-built_in">len</span>(df.data)): <span class="hljs-keyword">if</span> (order_states<span class="hljs-selector-attr">[i]</span> +<span class="hljs-string">'['</span>+<span class="hljs-string">'edit'</span>+<span class="hljs-string">']'</span>==df<span class="hljs-selector-class">.data</span><span class="hljs-selector-attr">[j]</span>): NewState=np<span class="hljs-selector-class">.append</span>(NewState,order_states<span class="hljs-selector-attr">[i]</span>)</pre></div><div id="d537"><pre>StateRegion=<span class="hljs-selector-attr">[]</span> <span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>,<span class="hljs-built_in">len</span>(NewState)): <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>,<span class="hljs-built_in">len</span>(df.data)): <span class="hljs-keyword">if</span> (NewState<span class="hljs-selector-attr">[i]</span> +<span class="hljs-string">'['</span>+<span class="hljs-string">'edit'</span>+<span class="hljs-string">']'</span>==df<span class="hljs-selector-class">.data</span><span class="hljs-selector-attr">[j]</span>): n = j + <span class="hljs-number">1</span> StateRegion=np<span class="hljs-selector-class">.append</span>(StateRegion,n)</pre></div><div id="1a41"><pre>StateName=<span class="hljs-selector-attr">[]</span> CityName=<span class="hljs-selector-attr">[]</span> <span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>,<span class="hljs-built_in">len</span>(NewState)-<span class="hljs-number">1</span>): <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="

Options

hljs-number">0</span>,<span class="hljs-built_in">len</span>(df.data)): <span class="hljs-keyword">if</span> (NewState<span class="hljs-selector-attr">[i]</span> +<span class="hljs-string">'['</span>+<span class="hljs-string">'edit'</span>+<span class="hljs-string">']'</span>==df<span class="hljs-selector-class">.data</span><span class="hljs-selector-attr">[j]</span>): n = j + <span class="hljs-number">1</span> while (NewState<span class="hljs-selector-attr">[i+1]</span> +<span class="hljs-string">'['</span>+<span class="hljs-string">'edit'</span>+<span class="hljs-string">']'</span>!=df<span class="hljs-selector-class">.data</span><span class="hljs-selector-attr">[n]</span>): StateName=np<span class="hljs-selector-class">.append</span>(StateName,NewState<span class="hljs-selector-attr">[i]</span>) CityName=np<span class="hljs-selector-class">.append</span>(CityName,df<span class="hljs-selector-class">.data</span><span class="hljs-selector-attr">[n]</span>) n = n + <span class="hljs-number">1</span></pre></div><div id="ae66"><pre>CTYNAME=<span class="hljs-selector-attr">[]</span> <span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-keyword">in</span> CityName: CTYNAME=np<span class="hljs-selector-class">.append</span>(CTYNAME, <span class="hljs-selector-tag">i</span><span class="hljs-selector-class">.split</span>(<span class="hljs-string">'('</span>)<span class="hljs-selector-attr">[0]</span>)</pre></div><div id="21fd"><pre>city_new=<span class="hljs-selector-attr">[]</span> <span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-keyword">in</span> CTYNAME: <span class="hljs-keyword">if</span> <span class="hljs-selector-tag">i</span><span class="hljs-selector-attr">[-1]</span>==<span class="hljs-string">' '</span>: city_new = np<span class="hljs-selector-class">.append</span>(city_new,<span class="hljs-selector-tag">i</span><span class="hljs-selector-attr">[:-1]</span>) <span class="hljs-keyword">else</span>: city_new=np<span class="hljs-selector-class">.append</span>(city_new,i)</pre></div><h1 id="c45c">Structuring the Cleaned Dataset</h1><div id="784a"><pre><span class="hljs-attr">df_final</span>=pd.DataFrame([])</pre></div><div id="8cd3"><pre>StateName=np<span class="hljs-selector-class">.append</span>(StateName,<span class="hljs-string">'Wyoming'</span>) city_new=np<span class="hljs-selector-class">.append</span>(city_new, <span class="hljs-string">'Laramie'</span>) df_final<span class="hljs-selector-attr">[<span class="hljs-string">'State'</span>]</span>=StateName df_final<span class="hljs-selector-attr">[<span class="hljs-string">'RegionName'</span>]</span>=city_new</pre></div><div id="c10c"><pre>df_final.set_index([<span class="hljs-string">'State'</span>,<span class="hljs-string">'RegionName'</span>],<span class="hljs-attribute">inplace</span>=<span class="hljs-literal">True</span>)</pre></div><div id="e10c"><pre><span class="hljs-attribute">df_final</span>.head(<span class="hljs-number">20</span>)</pre></div><figure id="1189"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*JgZrsHO4PHoUXn9jScOdNA.png"><figcaption><b>Table 2: Clean and structured dataset showing states and their college towns.</b></figcaption></figure><p id="d233">Comparing <b>Table 1</b> to <b>Table 2</b>, we see that the dataset in <b>Table 2</b> represents a clean and structured dataset that can now be used for further analysis.</p><h1 id="9f2c">Data Analysis Using Wrangled Dataset</h1><p id="94d1"><b>a) University Towns in Kansas and Delaware</b></p><div id="e7f2"><pre>df_final<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[<span class="hljs-string">'Kansas'</span>]</span></pre></div><figure id="261e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*2HJBtzZACsD998UeVKmByQ.png"><figcaption><b>Table 3: College towns in the state of Kansas.</b></figcaption></figure><div id="4a97"><pre>df_final<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[<span class="hljs-string">'Delaware'</span>]</span></pre></div><figure id="ea45"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*2Pm8AMZgRmKdwtEr12wkrA.png"><figcaption><b>Table 4: College towns in the state of Delaware.</b></figcaption></figure><p id="a7f1"><b>b) Number of University Towns in a given state</b></p><p id="e340">This can be obtained using this code:</p><div id="4c6b"><pre>df_final<span class="hljs-selector-class">.reset_index</span>()<span class="hljs-selector-class">.groupby</span>(<span class="hljs-string">'State'</span>)<span class="hljs-selector-class">.count</span>()</pre></div><figure id="bfda"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*UEY5P-ZbZbQ0RzaBovkijQ.png"><figcaption><b>Table 5: Sample output for a number of college towns in each state.</b></figcaption></figure><p id="252b">In summary, we’ve shown how data wrangling can be used to convert an uncleaned, unstructured dataset into a tidied form that is ready for further analysis. The process of data wrangling is a critical step for any data scientist. Knowing how to wrangle and clean data will enable you to derive critical insights from your data that would otherwise be hidden.</p><p id="8d6b">The dataset and code for this article can be downloaded from this repository: <a href="https://github.com/bot13956/unstructured_data_university_towns">https://github.com/bot13956/unstructured_data_university_towns</a>.</p></article></body>

Data Wrangling Tutorial | Towards AI

Tutorial on Data Wrangling: College Towns Dataset

Data wrangling is the process of converting data from its raw form to the tidy form ready for analysis. Data wrangling is an important step in data preprocessing and includes several processes like data importing, data cleaning, data structuring, string processing, HTML parsing, handling dates and times, handling missing data, and text mining.

The process of data wrangling is a critical step for any data scientist. Very rarely is data easily accessible in a data science project for analysis. It’s more likely for the data to be in a file, a database, or extracted from documents such as web pages, tweets, or PDFs. Knowing how to wrangle and clean data will enable you to derive critical insights from your data that would otherwise be hidden.

This tutorial will demonstrate the data wrangling process. You’ll learn the following:

  1. Importation of unstructured data
  2. Cleaning and organizing unstructured data using string processing techniques
  3. Converting unstructured data into structured data
  4. Performing analysis of structured data

The dataset and code for this article can be downloaded from this repository: https://github.com/bot13956/unstructured_data_university_towns.

Project Objective: This code returns a DataFrame of college towns and the states they belong to, from the unstructured university_towns.txt dataset. The format of the final cleaned DataFrame is in the form:

DataFrame( [ [“Michigan”, “Ann Arbor”], [“Michigan”, “Yipsilanti”] ], columns=[“State”, “RegionName”] )

which shows the state, and name of college towns in the given state.

Import Necessary Libraries

import pandas as pd
import numpy as np

Import Unstructured, Unclean Dataset

df=pd.read_table("university_towns.txt",header=None)
df.columns=["data"]
df.head(10)
Table 1: Unstructured, unclean dataset of college towns and the state where they belong.

As we can see from Table 1, the dataset is very unstructured, unclean and requires a lot of tidying before any further analysis can be done.

Let’s define a dictionary that can be used to map states to their respective two-letter acronym:

states = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 'WY': 'Wyoming', 
          'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 'AK': 'Alaska', 'UT': 'Utah', 
          'OR': 'Oregon','MT': 'Montana', 'IL': 'Illinois', 'TN': 'Tennessee', 'DC': 'District of Columbia', 
          'VT': 'Vermont','ID': 'Idaho', 'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii',
          'WI': 'Wisconsin','MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 
          'GU': 'Guam', 'MS': 'Mississippi','PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas', 
          'SD': 'South Dakota', 'MP': 'Northern Mariana Islands','IA': 'Iowa', 'MO': 'Missouri', 'CT': 'Connecticut',
          'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana',  'KS': 'Kansas', 'NY': 'New York', 
          'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida', 'CA': 'California', 'CO': 'Colorado', 
          'PA': 'Pennsylvania', 'DE': 'Delaware', 'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota',
          'VI': 'Virgin Islands','NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia',
          'ND': 'North Dakota', 'VA': 'Virginia'}

Order States Alphabetically

order_states=np.sort(pd.Series(states).values)
order_states;

Data Cleaning, Preparation, and Organization

NewState=[]
for i in range(0,len(order_states)):
    for j in range(0,len(df.data)):
        if (order_states[i] +'['+'edit'+']'==df.data[j]):
            NewState=np.append(NewState,order_states[i])
StateRegion=[]
for i in range(0,len(NewState)):
    for j in range(0,len(df.data)):
        if (NewState[i] +'['+'edit'+']'==df.data[j]):
            n = j + 1
    StateRegion=np.append(StateRegion,n)
StateName=[]
CityName=[]
for i in range(0,len(NewState)-1):
    for j in range(0,len(df.data)):
        if (NewState[i] +'['+'edit'+']'==df.data[j]):
            n = j + 1
    while (NewState[i+1] +'['+'edit'+']'!=df.data[n]):
          StateName=np.append(StateName,NewState[i])
          CityName=np.append(CityName,df.data[n])
          n = n + 1
CTYNAME=[]
for i in CityName:
    CTYNAME=np.append(CTYNAME, i.split('(')[0])
city_new=[]
for i in CTYNAME:
    if i[-1]==' ':
        city_new = np.append(city_new,i[:-1])
    else:
        city_new=np.append(city_new,i)

Structuring the Cleaned Dataset

df_final=pd.DataFrame([])
StateName=np.append(StateName,'Wyoming')
city_new=np.append(city_new, 'Laramie')
df_final['State']=StateName
df_final['RegionName']=city_new
df_final.set_index(['State','RegionName'],inplace=True)
df_final.head(20)
Table 2: Clean and structured dataset showing states and their college towns.

Comparing Table 1 to Table 2, we see that the dataset in Table 2 represents a clean and structured dataset that can now be used for further analysis.

Data Analysis Using Wrangled Dataset

a) University Towns in Kansas and Delaware

df_final.loc['Kansas']
Table 3: College towns in the state of Kansas.
df_final.loc['Delaware']
Table 4: College towns in the state of Delaware.

b) Number of University Towns in a given state

This can be obtained using this code:

df_final.reset_index().groupby('State').count()
Table 5: Sample output for a number of college towns in each state.

In summary, we’ve shown how data wrangling can be used to convert an uncleaned, unstructured dataset into a tidied form that is ready for further analysis. The process of data wrangling is a critical step for any data scientist. Knowing how to wrangle and clean data will enable you to derive critical insights from your data that would otherwise be hidden.

The dataset and code for this article can be downloaded from this repository: https://github.com/bot13956/unstructured_data_university_towns.

Data Wrangling
Data Analysis
Data Preparation
Data Science
Data Cleaning
Recommended from ReadMedium