avatarNaina Chaturvedi

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

16345

Abstract

ut —</p><div id="575f"><pre>(<span class="hljs-number">568454</span><span class="hljs-punctuation">,</span> <span class="hljs-number">10</span>)</pre></div><h2 id="1415">Analyze Data and Visualization</h2><div id="35bb"><pre><span class="hljs-comment"># Get all the information about the data</span></pre></div><div id="9af9"><pre>df.<span class="hljs-built_in">info</span>()</pre></div><p id="6986">Output —</p><div id="6c68"><pre><<span class="hljs-keyword">class</span> <span class="hljs-string">'pandas.core.frame.DataFrame'</span>> Range<span class="hljs-keyword">Index</span>: <span class="hljs-number">568454</span> entries, <span class="hljs-number">0</span> <span class="hljs-keyword">to</span> <span class="hljs-number">568453</span> Data <span class="hljs-keyword">columns</span> (total <span class="hljs-number">10</span> <span class="hljs-keyword">columns</span>):

<span class="hljs-keyword">Column</span> Non-<span class="hljs-keyword">Null</span> Count Dtype

<span class="hljs-comment">--- ------ -------------- ----- </span> <span class="hljs-number">0</span> Id <span class="hljs-number">568454</span> non-<span class="hljs-keyword">null</span> int64 <span class="hljs-number">1</span> ProductId <span class="hljs-number">568454</span> non-<span class="hljs-keyword">null</span> <span class="hljs-keyword">object</span> <span class="hljs-number">2</span> UserId <span class="hljs-number">568454</span> non-<span class="hljs-keyword">null</span> <span class="hljs-keyword">object</span> <span class="hljs-number">3</span> ProfileName <span class="hljs-number">568438</span> non-<span class="hljs-keyword">null</span> <span class="hljs-keyword">object</span> <span class="hljs-number">4</span> HelpfulnessNumerator <span class="hljs-number">568454</span> non-<span class="hljs-keyword">null</span> int64 <span class="hljs-number">5</span> HelpfulnessDenominator <span class="hljs-number">568454</span> non-<span class="hljs-keyword">null</span> int64 <span class="hljs-number">6</span> Score <span class="hljs-number">568454</span> non-<span class="hljs-keyword">null</span> int64 <span class="hljs-number">7</span> <span class="hljs-type">Time</span> <span class="hljs-number">568454</span> non-<span class="hljs-keyword">null</span> int64 <span class="hljs-number">8</span> <span class="hljs-keyword">Summary</span> <span class="hljs-number">568427</span> non-<span class="hljs-keyword">null</span> <span class="hljs-keyword">object</span> <span class="hljs-number">9</span> <span class="hljs-type">Text</span> <span class="hljs-number">568454</span> non-<span class="hljs-keyword">null</span> <span class="hljs-keyword">object</span> dtypes: int64(<span class="hljs-number">5</span>), <span class="hljs-keyword">object</span>(<span class="hljs-number">5</span>) memory <span class="hljs-keyword">usage</span>: <span class="hljs-number">43.4</span>+ MB</pre></div><p id="dd00">Get the Statistics —</p><div id="e28e"><pre># <span class="hljs-keyword">Get</span> the <span class="hljs-keyword">statistics</span></pre></div><div id="3196"><pre>df.describe<span class="hljs-comment">()</span></pre></div><p id="f9ba">Find duplicates —</p><div id="e356"><pre><span class="hljs-meta"># Find duplicates</span></pre></div><div id="4cd4"><pre>df.duplicated<span class="hljs-comment">()</span>.sum<span class="hljs-comment">()</span></pre></div><p id="8103">Output —</p><div id="f84c"><pre>0</pre></div><p id="ca00">Find null values —</p><div id="0382"><pre># Find if <span class="hljs-keyword">null</span> <span class="hljs-keyword">values</span> <span class="hljs-keyword">exists</span></pre></div><div id="51e8"><pre>df<span class="hljs-selector-class">.isna</span>()<span class="hljs-selector-class">.sum</span>()</pre></div><p id="6f4d">Output —</p><div id="0da7"><pre><span class="hljs-attribute">Id</span> <span class="hljs-number">0</span> <span class="hljs-attribute">ProductId</span> <span class="hljs-number">0</span> <span class="hljs-attribute">UserId</span> <span class="hljs-number">0</span> <span class="hljs-attribute">ProfileName</span> <span class="hljs-number">16</span> <span class="hljs-attribute">HelpfulnessNumerator</span> <span class="hljs-number">0</span> <span class="hljs-attribute">HelpfulnessDenominator</span> <span class="hljs-number">0</span> <span class="hljs-attribute">Score</span> <span class="hljs-number">0</span> <span class="hljs-attribute">Time</span> <span class="hljs-number">0</span> <span class="hljs-attribute">Summary</span> <span class="hljs-number">27</span> <span class="hljs-attribute">Text</span> <span class="hljs-number">0</span> <span class="hljs-attribute">dtype</span>: int64</pre></div><p id="950c">Get Categorical Columns in the Dataset —</p><div id="7386"><pre><span class="hljs-meta"># Get categorical Features</span></pre></div><div id="57ba"><pre>categorical_features = <span class="hljs-selector-attr">[feature for feature in df.columns if df[feature]</span><span class="hljs-selector-class">.dtypes</span> == <span class="hljs-string">'O'</span>] <span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">'Categorical Columns in the Dataset: '</span>,categorical_features)</span></span></pre></div><p id="18a3">Output —</p><div id="eab6"><pre>Categorical <span class="hljs-keyword">Variables</span> in <span class="hljs-comment">the Dataset: [</span><span class="hljs-comment">'ProductId'</span><span class="hljs-comment">,</span> <span class="hljs-comment">'UserId'</span><span class="hljs-comment">,</span> <span class="hljs-comment">'ProfileName'</span><span class="hljs-comment">,</span> <span class="hljs-comment">'Summary'</span><span class="hljs-comment">,</span> <span class="hljs-comment">'Text'</span><span class="hljs-comment">]</span></pre></div><p id="b97a">Extract Text and Score Columns —</p><div id="e89f"><pre><span class="hljs-comment">#Text and Score columns </span></pre></div><div id="25ff"><pre>new_df = df<span class="hljs-string">[['Text','Score']]</span>.dropna()</pre></div><h2 id="14df">Visualization —</h2><div id="61d6"><pre>plt<span class="hljs-selector-class">.figure</span>(figsize=(<span class="hljs-number">8</span>,<span class="hljs-number">6</span>),dpi=<span class="hljs-number">100</span>) score_index=df<span class="hljs-selector-attr">[<span class="hljs-string">'Score'</span>]</span><span class="hljs-selector-class">.value_counts</span>()<span class="hljs-selector-class">.index</span> score_values = df<span class="hljs-selector-attr">[<span class="hljs-string">'Score'</span>]</span><span class="hljs-selector-class">.value_counts</span>()<span class="hljs-selector-class">.values</span> sns<span class="hljs-selector-class">.barplot</span>(x=score_index,y=score_values,palette =<span class="hljs-string">'mako'</span>,edgecolor=<span class="hljs-string">'black'</span>,linewidth=<span class="hljs-number">0.8</span>) plt<span class="hljs-selector-class">.xlabel</span>(<span class="hljs-string">'Score Ratings'</span>) plt<span class="hljs-selector-class">.ylabel</span>(<span class="hljs-string">'Count'</span>) plt<span class="hljs-selector-class">.title</span>(<span class="hljs-string">'Score Ratings count'</span>) plt<span class="hljs-selector-class">.xticks</span>(rotation =<span class="hljs-number">90</span>)

plt<span class="hljs-selector-class">.show</span>()</pre></div><p id="5990">Output —</p><figure id="0b43"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Ybicf9aZAnmratYtPwnleA.png"><figcaption></figcaption></figure><div id="0f07"><pre><span class="hljs-meta"># Separate scores </span></pre></div><div id="2bcc"><pre><span class="hljs-keyword">new</span><span class="hljs-type">_df</span>.Score[df.Score<=<span class="hljs-number">3</span>]=<span class="hljs-number">0</span> <span class="hljs-keyword">new</span><span class="hljs-type">_df</span>.Score[df.Score><span class="hljs-number">3</span>]=<span class="hljs-number">1</span> <span class="hljs-keyword">new</span><span class="hljs-type">_df</span>.Score.value_counts()</pre></div><p id="1c11">Output —</p><div id="cd0e"><pre><span class="hljs-attribute">1</span> <span class="hljs-number">443777</span> <span class="hljs-attribute">0</span> <span class="hljs-number">124677</span> <span class="hljs-attribute">Name</span>: Score, dtype: int64</pre></div><p id="51d1">Visualize again!</p><div id="c5fa"><pre>plt<span class="hljs-selector-class">.figure</span>(figsize=(<span class="hljs-number">6</span>,<span class="hljs-number">4</span>),dpi=<span class="hljs-number">100</span>) score_index=new_df<span class="hljs-selector-attr">[<span class="hljs-string">'Score'</span>]</span><span class="hljs-selector-class">.value_counts</span>()<span class="hljs-selector-class">.index</span> score_values = new_df<span class="hljs-selector-attr">[<span class="hljs-string">'Score'</span>]</span><span class="hljs-selector-class">.value_counts</span>()<span class="hljs-selector-class">.values</span> sns<span class="hljs-selector-class">.barplot</span>(x=score_index,y=score_values,palette =<span class="hljs-string">'mako'</span>,edgecolor=<span class="hljs-string">'black'</span>,linewidth=<span class="hljs-number">0.8</span>) plt<span class="hljs-selector-class">.xlabel</span>(<span class="hljs-string">'Score Ratings'</span>) plt<span class="hljs-selector-class">.ylabel</span>(<span class="hljs-string">'Count'</span>) plt<span class="hljs-selector-class">.title</span>(<span class="hljs-string">'Score Ratings count'</span>) plt<span class="hljs-selector-class">.xticks</span>(rotation =<span class="hljs-number">90</span>)</pre></div><div id="a123"><pre>plt.<span class="hljs-keyword">show</span>()</pre></div><p id="dc93">Output —</p><figure id="c370"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*fw7YTPZ0VuDqAREPN-dtig.png"><figcaption></figcaption></figure><h2 id="adb2">Split Train and Validation data —</h2><div id="768a"><pre><span class="hljs-meta"># Prepare train dataset</span> positive_train_df=<span class="hljs-keyword">new</span><span class="hljs-type">_df</span>[<span class="hljs-keyword">new</span><span class="hljs-type">_df</span>.Score==<span class="hljs-number">1</span>][:<span class="hljs-type">50000</span>] negative_train_df=<span class="hljs-keyword">new</span><span class="hljs-type">_df</span>[<span class="hljs-keyword">new</span><span class="hljs-type">_df</span>.Score==<span class="hljs-number">0</span>][:<span class="hljs-type">50000</span>]</pre></div><div id="8a35"><pre># <span class="hljs-keyword">Prepare</span> Validation <span class="hljs-keyword">dataset</span></pre></div><div id="7887"><pre><span class="hljs-attribute">positive_val_df</span>=new_df[new_df.<span class="hljs-attribute">Score</span>==1][50000:70000] <span class="hljs-attribute">negative_val_df</span>=new_df[new_df.<span class="hljs-attribute">Score</span>==0][50000:70000] <span class="hljs-attribute">val_df</span>=positive_val_df.append(negative_val_df) val_df.shape</pre></div><p id="0f27">Output —</p><div id="027f"><pre>(<span class="hljs-number">40000</span><span class="hljs-punctuation">,</span> <span class="hljs-number">2</span>)</pre></div><h2 id="977a">SpaCy</h2><div id="ad63"><pre><span class="hljs-meta"># load the model</span></pre></div><div id="93c2"><pre>tn = spacy<span class="hljs-selector-class">.load</span>(<span class="hljs-string">'en_core_web_sm'</span>) r=new_df<span class="hljs-selector-class">.Text</span><span class="hljs-selector-attr">[12]</span> r</pre></div><p id="cc64">Output —</p><div id="1b7b"><pre><span class="hljs-comment">"My cats have been happily eating Felidae Platinum for more than two years. I just got a new bag and the shape of the food is different. They tried the new food when I first put it in their bowls and now the bowls sit full and the kitties will not touch the food. I've noticed similar reviews related to formula changes in the past. Unfortunately, I now need to find a new food that my cats will eat."</span></pre></div><p id="3d18"><b>Tokenization —</b></p><div id="e343"><pre><span class="hljs-comment"># tokenize the review</span> <span class="hljs-built_in">tr</span> = tn(r) <span class="hljs-built_in">tr</span></pre></div><p id="f826">Output —</p><div id="f819"><pre>My cats have been happily eating Felidae Platinum <span class="hljs-keyword">for</span> more than <span class="hljs-literal">two</span> years. I just got <span class="hljs-keyword">a</span> <span class="hljs-built_in">new</span> bag <span class="hljs-keyword">and</span> <span class="hljs-keyword">the</span> shape <span class="hljs-keyword">of</span> <span class="hljs-keyword">the</span> food is different. They tried <span class="hljs-keyword">the</span> <span class="hljs-built_in">new</span> food when I <span class="hljs-keyword">first</span> <span class="hljs-built_in">put</span> <span class="hljs-keyword">it</span> <span class="hljs-keyword">in</span> their bowls <span class="hljs-keyword">and</span> now <span class="hljs-keyword">the</span> bowls sit full <span class="hljs-keyword">and</span> <span class="hljs-keyword">the</span> kitties will <span class="hljs-keyword">not</span> touch <span class="hljs-keyword">the</span> food. I<span class="hljs-string">'ve noticed similar reviews related to formula changes in the past. Unfortunately, I now need to find a new food that my cats will eat.</span></pre></div><div id="259a" class="link-block"> <a href="https://readmedium.com/day-1-day-60-quick-recap-of-60-days-of-data-science-and-ml-6fc021643d1"> <div> <div> <h2>Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML</h2> <div><h3>Connect the ML dots…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*ZfJ1yKIzPLGABAI_.png)"></div> </div> </div> </a> </div><p id="a677"><b>Explacy —</b></p><div id="cd4c"><pre>explacy.print_parse_info(tn, <span class="hljs-symbol">'Newyork</span> <span class="hljs-keyword">is</span> a beautiful city')</pre></div><p id="30ad">Output —</p><div id="a9c7"><pre>Dep tree <span class="hljs-built_in">Token</span> Dep <span class="hljs-built_in">type</span> Lemma Part of Sp ──────── ───────── ──────── ───────── ────────── ┌─► Newyork nsubj Newyork PROPN
┌───┴── is ROOT be AUX
│ ┌──► a <span class="hljs-built_in">det</span> a <span class="hljs-built_in">DET</span>
│ │┌─► beautiful amod beautiful ADJ
└─►└┴── city attr city NOUN</pre></div><p id="24f9">Use explacy for the review —</p><div id="31a6"><pre>explacy<span class="hljs-selector-class">.print_parse_info</span>(tn,new_df<span class="hljs-selector-class">.Text</span><span class="hljs-selector-attr">[12]</span>)</pre></div><p id="3357">Output —</p><div id="f5e1"><pre>Dep tree Token Dep type Lemma Part <span class="hljs-keyword">of</span> Sp ───────────────── ───────────── ──────── ───────────── ────────── ┌─► My poss my PRON
┌─►└── cats nsubj cat NOUN
│┌───► have aux have AUX
││┌──► been aux be AUX
│││┌─► happily advmod happily ADV
┌┬──────┼┴┴┴── eating ROOT eat VERB
││ │ ┌─► Felidae <span class="hljs-built_in">compound</span> Felidae PROPN
││ └─►└── Platinum dobj Platinum PROPN
│└─►┌───────── <span class="hljs-keyword">for</span> prep <span class="hljs-keyword">for</span> ADP
│ │ ┌──► more amod more ADJ

Options

│ │ │┌─► than quantmod than SCONJ
│ │ ┌─►└┴── <span class="hljs-literal">two</span> nummod <span class="hljs-literal">two</span> NUM
│ └─►└────── years pobj year NOUN
└────────────► . punct . PUNCT
┌──► I nsubj I PRON
│┌─► just advmod just ADV
┌┬┬──────────┴┴── got ROOT <span class="hljs-built_in">get</span> VERB
│││ ┌──► <span class="hljs-keyword">a</span> det <span class="hljs-keyword">a</span> DET
│││ │┌─► <span class="hljs-built_in">new</span> amod <span class="hljs-built_in">new</span> ADJ
││└─►┌───────┴┼── bag dobj bag NOUN
││ │ └─► <span class="hljs-keyword">and</span> cc <span class="hljs-keyword">and</span> CCONJ
││ │ ┌─► <span class="hljs-keyword">the</span> det <span class="hljs-keyword">the</span> DET
││ └─►┌─────┴── shape conj shape NOUN
││ └─►┌───── <span class="hljs-keyword">of</span> prep <span class="hljs-keyword">of</span> ADP
││ │ ┌─► <span class="hljs-keyword">the</span> det <span class="hljs-keyword">the</span> DET
││ └─►└── food pobj food NOUN
│└───────────►┌── is ccomp be AUX
│ └─► different acomp different ADJ
└───────────────► . punct . PUNCT
┌─► They nsubj they PRON
┌────────┬───┴── tried ROOT <span class="hljs-keyword">try</span> VERB
│ │ ┌──► <span class="hljs-keyword">the</span> det <span class="hljs-keyword">the</span> DET
│ │ │┌─► <span class="hljs-built_in">new</span> amod <span class="hljs-built_in">new</span> ADJ
│ └─►└┴── food dobj food NOUN
│ ┌───► when advmod when ADV
│ │┌──► I nsubj I PRON
│ ││┌─► <span class="hljs-keyword">first</span> advmod <span class="hljs-keyword">first</span> ADV
└─►┌──┬┬───┴┴┼── <span class="hljs-built_in">put</span> advcl <span class="hljs-built_in">put</span> VERB
│ ││ └─► <span class="hljs-keyword">it</span> dobj <span class="hljs-keyword">it</span> PRON
│ │└─►┌───── <span class="hljs-keyword">in</span> prep <span class="hljs-keyword">in</span> ADP
│ │ │ ┌─► their poss their PRON
│ │ └─►└── bowls pobj bowl NOUN
│ └────────► <span class="hljs-keyword">and</span> cc <span class="hljs-keyword">and</span> CCONJ
│ ┌─────► now advmod now ADV
│ │ ┌─► <span class="hljs-keyword">the</span> det <span class="hljs-keyword">the</span> DET
│ │┌─►└── bowls nsubj bowl NOUN
└─►┌──┴┴─┬┬── sit conj sit VERB
│ │└─► full acomp full ADJ
│ └──► <span class="hljs-keyword">and</span> cc <span class="hljs-keyword">and</span> CCONJ
│ ┌─► <span class="hljs-keyword">the</span> det <span class="hljs-keyword">the</span> DET
│ ┌─►└── kitties nsubj kitty NOUN
│ │ ┌──► will aux will AUX
│ │ │┌─► <span class="hljs-keyword">not</span> neg <span class="hljs-keyword">not</span> PART
└─►┌┼─┴┴── touch conj touch VERB
││ ┌─► <span class="hljs-keyword">the</span> det <span class="hljs-keyword">the</span> DET
│└─►└── food dobj food NOUN
└─────► . punct . PUNCT
┌──► I nsubj I PRON
│┌─► <span class="hljs-string">'ve aux '</span>ve AUX
┌┬───────────┴┴── noticed ROOT notice VERB
││ ┌─► similar amod similar ADJ
│└─►┌─────────┴── reviews dobj review NOUN
│ └─►┌┬──────── related acl relate VERB
│ │└─►┌───── <span class="hljs-built_in">to</span> prep <span class="hljs-built_in">to</span> ADP
│ │ │ ┌─► formula <span class="hljs-built_in">compound</span> formula NOUN
│ │ └─►└── changes pobj change NOUN
│ └──►┌───── <span class="hljs-keyword">in</span> prep <span class="hljs-keyword">in</span> ADP
│ │ ┌─► <span class="hljs-keyword">the</span> det <span class="hljs-keyword">the</span> DET
│ └─►└── past pobj past NOUN
└───────────────► . punct . PUNCT
┌────► Unfortunately advmod unfortunately ADV
│┌───► , punct , PUNCT
││┌──► I nsubj I PRON
│││┌─► now advmod now ADV
┌┬─────────┴┴┴┴── need ROOT need VERB
││ ┌─► <span class="hljs-built_in">to</span> aux <span class="hljs-built_in">to</span> PART
│└─►┌─────────┴── find xcomp find VERB
│ │ ┌──► <span class="hljs-keyword">a</span> det <span class="hljs-keyword">a</span> DET
│ │ │┌─► <span class="hljs-built_in">new</span> amod <span class="hljs-built_in">new</span> ADJ
│ └─►┌─────┴┴── food dobj food NOUN
│ │ ┌─────► that dobj that DET
│ │ │ ┌─► my poss my PRON
│ │ │┌─►└── cats nsubj cat NOUN
│ │ ││ ┌─► will aux will AUX
│ └─►└┴──┴── eat relcl eat VERB
└───────────────► . punct . PUNCT</pre></div><p id="8936"><b>Part of Speech Tagging —</b></p><div id="9051"><pre><span class="hljs-attribute">tt</span> <span class="hljs-operator">=</span> pd.DataFrame()</pre></div><div id="9ecf"><pre><span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span>, token <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(tr): tt<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'text'</span>]</span> = token<span class="hljs-selector-class">.text</span> tt<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'tag'</span>]</span> = token<span class="hljs-selector-class">.tag_</span> tt<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'dep'</span>]</span> = token<span class="hljs-selector-class">.dep_</span> tt<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'shape'</span>]</span> = token<span class="hljs-selector-class">.shape_</span> tt<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'is_alpha'</span>]</span> = token<span class="hljs-selector-class">.is_alpha</span> tt<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'is_stop'</span>]</span> = token<span class="hljs-selector-class">.is_stop</span> tt<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'is_punctuation'</span>]</span> = token<span class="hljs-selector-class">.is_punct</span> tt<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'lemma'</span>]</span> = token<span class="hljs-selector-class">.lemma_</span>, tt<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'pos'</span>]</span> = token.pos_</pre></div><div id="8b13"><pre><span class="hljs-attribute">tt</span>[:<span class="hljs-number">20</span>]</pre></div><p id="4505">Output —</p><figure id="0651"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gpAmmiWIeygawJSrF-QVXA.png"><figcaption></figcaption></figure><p id="defc"><b>Named Entities Recognition ( NER) —</b></p><div id="cfd5"><pre><span class="hljs-meta"># NER</span></pre></div><div id="09ab"><pre>spacy.displacy.render(tr, <span class="hljs-attribute">style</span>=<span class="hljs-string">'ent'</span>, <span class="hljs-attribute">jupyter</span>=<span class="hljs-literal">True</span>)</pre></div><p id="4e78">Output —</p><p id="bc03">My cats have been happily eating Felidae Platinum <b>PERSON</b> for more than two years <b>DATE</b> . I just got a new bag and the shape of the food is different. They tried the new food when I first <b>ORDINAL</b> put it in their bowls and now the bowls sit full and the kitties will not touch the food. I’ve noticed similar reviews related to formula changes in the past. Unfortunately, I now need to find a new food that my cats will eat.</p><p id="1f20"><b>Chunking —</b></p><div id="2d04"><pre><span class="hljs-meta"># Chunking</span></pre></div><div id="ccf4"><pre><span class="hljs-attribute">nc</span> <span class="hljs-operator">=</span> pd.DataFrame()</pre></div><div id="e595"><pre><span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span>, chunk <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(<span class="hljs-selector-tag">tr</span>.noun_chunks): nc<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'text'</span>]</span> = chunk<span class="hljs-selector-class">.text</span> nc<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'root.text'</span>]</span> = chunk<span class="hljs-selector-class">.root</span><span class="hljs-selector-class">.text</span>, nc<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'root.dep_'</span>]</span> = chunk<span class="hljs-selector-class">.root</span><span class="hljs-selector-class">.dep_</span> nc<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'root.head.text'</span>]</span> = chunk<span class="hljs-selector-class">.root</span><span class="hljs-selector-class">.head</span><span class="hljs-selector-class">.text</span> nc<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[i, <span class="hljs-string">'root'</span>]</span> = chunk<span class="hljs-selector-class">.root</span>

nc<span class="hljs-selector-attr">[:20]</span></pre></div><figure id="2b0a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*-2ucC5MuiMKzhaPKsyVAvw.png"><figcaption></figcaption></figure><p id="a98b"><i>Day 6 : Part 2 : Coming soon</i></p><p id="51b0"><b><i>Follow for more updates, stay tuned and of-course let me end this post with a quote by Steve Jobs ;)</i></b></p><p id="74a9" type="7">“Your time is limited, so don’t waste it living someone else’s life.”</p><h1 id="21c3">For other projects, tune to —</h1><p id="b31f"><b>Build Machine Learning Pipelines( With Code)</b></p><div id="5b37" class="link-block"> <a href="https://medium.datadriveninvestor.com/build-machine-learning-pipelines-with-code-part-1-bd3ed7152124"> <div> <div> <h2>Build Machine Learning Pipelines( With Code) — Part 1</h2> <div><h3>Complete implementation…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*KdToBD8RDMBH4jXM.png)"></div> </div> </div> </a> </div><p id="946c"><b>Recurrent Neural Network with Keras</b></p><div id="607d" class="link-block"> <a href="https://medium.datadriveninvestor.com/recurrent-neural-network-with-keras-b5b5f6fe5187"> <div> <div> <h2>Recurrent Neural Network with Keras</h2> <div><h3>Project Implementation and cheatsheet…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*xs3Dya3qQBx6IU7C.png)"></div> </div> </div> </a> </div><p id="56e1"><b>Clustering Geolocation Data in Python using DBSCAN and K-Means</b></p><div id="2b3e" class="link-block"> <a href="https://medium.datadriveninvestor.com/clustering-geolocation-data-in-python-using-dbscan-and-k-means-3705d9f44522"> <div> <div> <h2>Clustering Geolocation Data in Python using DBSCAN and K-Means</h2> <div><h3>Project Implementation…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*0uPCZnohdaPCO4NN.png)"></div> </div> </div> </a> </div><p id="a29c"><b>Facial Expression Recognition using Keras</b></p><div id="ccaa" class="link-block"> <a href="https://medium.datadriveninvestor.com/facial-expression-recognition-using-keras-cbdd661a0a54"> <div> <div> <h2>Facial Expression Recognition using Keras</h2> <div><h3>Project Implementation…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*CGch7hzdjg1fpgKy.jpg)"></div> </div> </div> </a> </div><p id="0db7"><b>Hyperparameter Tuning with Keras Tuner</b></p><div id="6dff" class="link-block"> <a href="https://medium.datadriveninvestor.com/hyperparameter-tuning-with-keras-tuner-3a609d3fd85b"> <div> <div> <h2>Hyperparameter Tuning with Keras Tuner</h2> <div><h3>Project Implementation….</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*jlaEz8AZaptNWHEr.png)"></div> </div> </div> </a> </div><p id="fed8"><b>Custom Layers in Keras</b></p><div id="e4fd" class="link-block"> <a href="https://medium.datadriveninvestor.com/custom-layers-in-keras-de5f793217aa"> <div> <div> <h2>Custom Layers in Keras</h2> <div><h3>Code implementation …</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*1IH67KJadqeqeO01.png)"></div> </div> </div> </a> </div></article></body>

Day 5: 30 days of Natural Language Processing Series with Projects

SpaCy with a project — Part 1

Welcome back peeps. Hope you all’s well at your end. Work is getting hectic everyday so day 5 post is bit delayed. Anyways, in the last few posts we learned the basics of SpaCy (Go through below posts before starting this project)

Some of the other best Series —

30 Days of Natural Language Processing ( NLP) Series

30 days of Data Engineering with projects Series

60 days of Data Science and ML Series with projects

100 days : Your Data Science and Machine Learning Degree Series with projects

23 Data Science Techniques You Should Know

Tech Interview Series — Curated List of coding questions

Complete System Design with most popular Questions Series

Complete Data Visualization and Pre-processing Series with projects

Complete Python Series with Projects

Complete Advanced Python Series with Projects

Kaggle Best Notebooks that will teach you the most

Complete Developers Guide to Git

All the Data Science and Machine Learning Resources

210 Machine Learning Projects

30 days of Machine Learning Ops

Projects Videos —

All the projects, data structures, SQL, algorithms, system design, Data Science and ML , Data Analytics, Data Engineering, , Implemented Data Science and ML projects, Implemented Data Engineering Projects, Implemented Deep Learning Projects, Implemented Machine Learning Ops Projects, Implemented Time Series Analysis and Forecasting Projects, Implemented Applied Machine Learning Projects, Implemented Tensorflow and Keras Projects, Implemented PyTorch Projects, Implemented Scikit Learn Projects, Implemented Big Data Projects, Implemented Cloud Machine Learning Projects, Implemented Neural Networks Projects, Implemented OpenCV Projects,Complete ML Research Papers Summarized, Implemented Data Analytics projects, Implemented Data Visualization Projects, Implemented Data Mining Projects, Implemented Natural Leaning Processing Projects, MLOps and Deep Learning, Applied Machine Learning with Projects Series, PyTorch with Projects Series, Tensorflow and Keras with Projects Series, Scikit Learn Series with Projects, Time Series Analysis and Forecasting with Projects Series, ML System Design Case Studies Series videos will be published on our youtube channel ( just launched).

Subscribe today!

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Tech Brew :

In this post we are going to build a project in which we will implement —

  1. Tokenization
  2. POS tagging
  3. Chunking
  4. Named Entities Recognition ( NER)

For the basics of NLP and pre-requisites read the below post ( complete this before starting this project)—

Let’s dive in!

  • Tokenization is the process of breaking down a string of text into individual words or phrases (called tokens).
  • POS (part-of-speech) tagging is the process of marking each token in a text with its corresponding part of speech, such as noun, verb, adjective, etc.
  • Chunking is the process of grouping tokens together into “chunks” that represent a complete idea or concept.
  • Named Entity Recognition (NER) is the process of identifying and classifying named entities, such as people, organizations, and locations, in a text.

Import Necessary Libraries

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy
import explacy
from spacy import displacy
%matplotlib inline

Load Dataset

df=pd.read_csv('../Path to file/Data.csv')
df.shape

Output —

(568454, 10)

Analyze Data and Visualization

# Get all the information about the data
df.info()

Output —

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568438 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB

Get the Statistics —

# Get the statistics
df.describe()

Find duplicates —

# Find duplicates
df.duplicated().sum()

Output —

0

Find null values —

# Find if null values exists
df.isna().sum()

Output —

Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

Get Categorical Columns in the Dataset —

# Get categorical Features
categorical_features = [feature for feature in df.columns if df[feature].dtypes == 'O']
print('Categorical Columns in the Dataset: ',categorical_features)

Output —

Categorical Variables in the Dataset:  ['ProductId', 'UserId', 'ProfileName', 'Summary', 'Text']

Extract Text and Score Columns —

#Text and Score columns 
new_df = df[['Text','Score']].dropna()

Visualization —

plt.figure(figsize=(8,6),dpi=100)
score_index=df['Score'].value_counts().index
score_values = df['Score'].value_counts().values
sns.barplot(x=score_index,y=score_values,palette ='mako',edgecolor='black',linewidth=0.8)
plt.xlabel('Score Ratings')
plt.ylabel('Count')
plt.title('Score Ratings count')
plt.xticks(rotation =90)

plt.show()

Output —

# Separate scores 
new_df.Score[df.Score<=3]=0
new_df.Score[df.Score>3]=1
new_df.Score.value_counts()

Output —

1    443777
0    124677
Name: Score, dtype: int64

Visualize again!

plt.figure(figsize=(6,4),dpi=100)
score_index=new_df['Score'].value_counts().index
score_values = new_df['Score'].value_counts().values
sns.barplot(x=score_index,y=score_values,palette ='mako',edgecolor='black',linewidth=0.8)
plt.xlabel('Score Ratings')
plt.ylabel('Count')
plt.title('Score Ratings count')
plt.xticks(rotation =90)
plt.show()

Output —

Split Train and Validation data —

# Prepare train dataset
positive_train_df=new_df[new_df.Score==1][:50000]
negative_train_df=new_df[new_df.Score==0][:50000]
# Prepare Validation dataset
positive_val_df=new_df[new_df.Score==1][50000:70000]
negative_val_df=new_df[new_df.Score==0][50000:70000]
val_df=positive_val_df.append(negative_val_df)
val_df.shape

Output —

(40000, 2)

SpaCy

# load the model
tn = spacy.load('en_core_web_sm')
r=new_df.Text[12]
r

Output —

"My cats have been happily eating Felidae Platinum for more than two years. I just got a new bag and the shape of the food is different. They tried the new food when I first put it in their bowls and now the bowls sit full and the kitties will not touch the food. I've noticed similar reviews related to formula changes in the past. Unfortunately, I now need to find a new food that my cats will eat."

Tokenization —

# tokenize the review
tr = tn(r)
tr

Output —

My cats have been happily eating Felidae Platinum for more than two years. I just got a new bag and the shape of the food is different. They tried the new food when I first put it in their bowls and now the bowls sit full and the kitties will not touch the food. I've noticed similar reviews related to formula changes in the past. Unfortunately, I now need to find a new food that my cats will eat.

Explacy —

explacy.print_parse_info(tn, 'Newyork is a beautiful city')

Output —

Dep tree Token     Dep type Lemma     Part of Sp
──────── ───────── ──────── ───────── ──────────
    ┌─►  Newyork   nsubj    Newyork   PROPN     
┌───┴──  is        ROOT     be        AUX       
│  ┌──►  a         det      a         DET       
│  │┌─►  beautiful amod     beautiful ADJ       
└─►└┴──  city      attr     city      NOUN

Use explacy for the review —

explacy.print_parse_info(tn,new_df.Text[12])

Output —

Dep tree          Token         Dep type Lemma         Part of Sp
───────────────── ───────────── ──────── ───────────── ──────────
              ┌─► My            poss     my            PRON      
           ┌─►└── cats          nsubj    cat           NOUN      
           │┌───► have          aux      have          AUX       
           ││┌──► been          aux      be            AUX       
           │││┌─► happily       advmod   happily       ADV       
   ┌┬──────┼┴┴┴── eating        ROOT     eat           VERB      
   ││      │  ┌─► Felidae       compound Felidae       PROPN     
   ││      └─►└── Platinum      dobj     Platinum      PROPN     
   │└─►┌───────── for           prep     for           ADP       
   │   │     ┌──► more          amod     more          ADJ       
   │   │     │┌─► than          quantmod than          SCONJ     
   │   │  ┌─►└┴── two           nummod   two           NUM       
   │   └─►└────── years         pobj     year          NOUN      
   └────────────► .             punct    .             PUNCT     
             ┌──► I             nsubj    I             PRON      
             │┌─► just          advmod   just          ADV       
┌┬┬──────────┴┴── got           ROOT     get           VERB      
│││          ┌──► a             det      a             DET       
│││          │┌─► new           amod     new           ADJ       
││└─►┌───────┴┼── bag           dobj     bag           NOUN      
││   │        └─► and           cc       and           CCONJ     
││   │        ┌─► the           det      the           DET       
││   └─►┌─────┴── shape         conj     shape         NOUN      
││      └─►┌───── of            prep     of            ADP       
││         │  ┌─► the           det      the           DET       
││         └─►└── food          pobj     food          NOUN      
│└───────────►┌── is            ccomp    be            AUX       
│             └─► different     acomp    different     ADJ       
└───────────────► .             punct    .             PUNCT     
              ┌─► They          nsubj    they          PRON      
 ┌────────┬───┴── tried         ROOT     try           VERB      
 │        │  ┌──► the           det      the           DET       
 │        │  │┌─► new           amod     new           ADJ       
 │        └─►└┴── food          dobj     food          NOUN      
 │          ┌───► when          advmod   when          ADV       
 │          │┌──► I             nsubj    I             PRON      
 │          ││┌─► first         advmod   first         ADV       
 └─►┌──┬┬───┴┴┼── put           advcl    put           VERB      
    │  ││     └─► it            dobj     it            PRON      
    │  │└─►┌───── in            prep     in            ADP       
    │  │   │  ┌─► their         poss     their         PRON      
    │  │   └─►└── bowls         pobj     bowl          NOUN      
    │  └────────► and           cc       and           CCONJ     
    │     ┌─────► now           advmod   now           ADV       
    │     │   ┌─► the           det      the           DET       
    │     │┌─►└── bowls         nsubj    bowl          NOUN      
    └─►┌──┴┴─┬┬── sit           conj     sit           VERB      
       │     │└─► full          acomp    full          ADJ       
       │     └──► and           cc       and           CCONJ     
       │      ┌─► the           det      the           DET       
       │   ┌─►└── kitties       nsubj    kitty         NOUN      
       │   │ ┌──► will          aux      will          AUX       
       │   │ │┌─► not           neg      not           PART      
       └─►┌┼─┴┴── touch         conj     touch         VERB      
          ││  ┌─► the           det      the           DET       
          │└─►└── food          dobj     food          NOUN      
          └─────► .             punct    .             PUNCT     
             ┌──► I             nsubj    I             PRON      
             │┌─► 've           aux      've           AUX       
┌┬───────────┴┴── noticed       ROOT     notice        VERB      
││            ┌─► similar       amod     similar       ADJ       
│└─►┌─────────┴── reviews       dobj     review        NOUN      
│   └─►┌┬──────── related       acl      relate        VERB      
│      │└─►┌───── to            prep     to            ADP       
│      │   │  ┌─► formula       compound formula       NOUN      
│      │   └─►└── changes       pobj     change        NOUN      
│      └──►┌───── in            prep     in            ADP       
│          │  ┌─► the           det      the           DET       
│          └─►└── past          pobj     past          NOUN      
└───────────────► .             punct    .             PUNCT     
           ┌────► Unfortunately advmod   unfortunately ADV       
           │┌───► ,             punct    ,             PUNCT     
           ││┌──► I             nsubj    I             PRON      
           │││┌─► now           advmod   now           ADV       
┌┬─────────┴┴┴┴── need          ROOT     need          VERB      
││            ┌─► to            aux      to            PART      
│└─►┌─────────┴── find          xcomp    find          VERB      
│   │        ┌──► a             det      a             DET       
│   │        │┌─► new           amod     new           ADJ       
│   └─►┌─────┴┴── food          dobj     food          NOUN      
│      │  ┌─────► that          dobj     that          DET       
│      │  │   ┌─► my            poss     my            PRON      
│      │  │┌─►└── cats          nsubj    cat           NOUN      
│      │  ││  ┌─► will          aux      will          AUX       
│      └─►└┴──┴── eat           relcl    eat           VERB      
└───────────────► .             punct    .             PUNCT

Part of Speech Tagging —

tt = pd.DataFrame()
for i, token in enumerate(tr):
    tt.loc[i, 'text'] = token.text
    tt.loc[i, 'tag'] = token.tag_
    tt.loc[i, 'dep'] = token.dep_
    tt.loc[i, 'shape'] = token.shape_
    tt.loc[i, 'is_alpha'] = token.is_alpha
    tt.loc[i, 'is_stop'] = token.is_stop
    tt.loc[i, 'is_punctuation'] = token.is_punct
    tt.loc[i, 'lemma'] = token.lemma_,
    tt.loc[i, 'pos'] = token.pos_
tt[:20]

Output —

Named Entities Recognition ( NER) —

# NER
spacy.displacy.render(tr, style='ent', jupyter=True)

Output —

My cats have been happily eating Felidae Platinum PERSON for more than two years DATE . I just got a new bag and the shape of the food is different. They tried the new food when I first ORDINAL put it in their bowls and now the bowls sit full and the kitties will not touch the food. I’ve noticed similar reviews related to formula changes in the past. Unfortunately, I now need to find a new food that my cats will eat.

Chunking —

# Chunking
nc = pd.DataFrame()
for i, chunk in enumerate(tr.noun_chunks):
    nc.loc[i, 'text'] = chunk.text
    nc.loc[i, 'root.text'] = chunk.root.text,
    nc.loc[i, 'root.dep_'] = chunk.root.dep_
    nc.loc[i, 'root.head.text'] = chunk.root.head.text
    nc.loc[i, 'root'] = chunk.root
    
nc[:20]

Day 6 : Part 2 : Coming soon

Follow for more updates, stay tuned and of-course let me end this post with a quote by Steve Jobs ;)

“Your time is limited, so don’t waste it living someone else’s life.”

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Recurrent Neural Network with Keras

Clustering Geolocation Data in Python using DBSCAN and K-Means

Facial Expression Recognition using Keras

Hyperparameter Tuning with Keras Tuner

Custom Layers in Keras

Machine Learning
Programming
Artificial Intelligence
Data Science
Tech
Recommended from ReadMedium