avatarNaina Chaturvedi

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

13376

Abstract

Day 1–30 Days of Natural Language Processing Series with Projects</h2> <div><h3>Get set go…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*oI-Ivbm-aFYfagPl)"></div> </div> </div> </a> </div><p id="790b">Lets’ dive in the second part of SpaCy —</p><h2 id="3203">Linguistics annotations</h2><p id="e836" type="7">Linguistics annotations in natural language processing (NLP) refer to the process of adding structured information to text in order to capture its linguistic features, such as syntax, semantics, and pragmatics.</p><p id="aa98">This information is typically added in the form of tags, labels, or other metadata that can be used to analyze the text and perform specific NLP tasks. Some common types of linguistic annotations in NLP include:</p><ol><li><i>Part-of-Speech (POS) Tagging: The process of identifying the role of each word in a sentence (e.g. noun, verb, adjective)</i></li><li><i>Syntactic Parsing: The process of analyzing the grammatical structure of a sentence to determine the relationships between words.</i></li><li><i>Semantic Role Labelling (SRL): The process of identifying the semantic roles of different words in a sentence.</i></li><li><i>Coreference resolution: The process of identifying and linking mentions of the same entity across a text.</i></li><li><i>Named-entity recognition (NER): The process of identifying and classifying named entities, such as persons, organizations, and locations</i></li><li><i>Chunking: The process of grouping words into “chunks” or “noun phrases”</i></li></ol><p id="ad65">It gives a detailed peek into a text’s grammatical structure.</p><div id="32af"><pre><span class="hljs-keyword">import</span> spacy</pre></div><div id="28b6"><pre>nlp = spacy<span class="hljs-selector-class">.load</span>(<span class="hljs-string">"en_core_web_sm"</span>) doc = <span class="hljs-built_in">nlp</span>(<span class="hljs-string">"Neywork is a city of dreams. It has a population of 20.1 million"</span>) <span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> doc: <span class="hljs-built_in">print</span>(t<span class="hljs-selector-class">.text</span>, t<span class="hljs-selector-class">.pos_</span>, t.dep_)</pre></div><p id="3510">Output —</p><div id="f70a"><pre>Neywork PROPN nsubj is AUX ROOT a <span class="hljs-built_in">DET</span> <span class="hljs-built_in">det</span> city NOUN attr of ADP prep dreams NOUN pobj . PUNCT punct It PRON nsubj has VERB ROOT a <span class="hljs-built_in">DET</span> <span class="hljs-built_in">det</span> population NOUN dobj of ADP prep <span class="hljs-number">20.1</span> NUM compound million NUM pobj</pre></div><h2 id="f32d">Part of Speech tagging</h2><p id="1885">It’s a NLP technique which is used in the tasks like language understanding, information extraction, feature engineering etc to automatically assign POS tags to all the words of a sentence.</p><figure id="83c2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*O9XyaPYPtkkmpFjQW8Hunw.jpeg"><figcaption>Pic credits : Research Gate</figcaption></figure><p id="e7fb">Part-of-speech (POS) tagging is a natural language processing (NLP) task that involves identifying the grammatical role of each word in a sentence, such as a noun, verb, adjective, or adverb. POS tagging is an important step in many NLP tasks, such as syntactic parsing, semantic role labeling, and text generation.</p><p id="914e">There are different methods for POS tagging, but the most common method is based on statistical models, specifically, Hidden Markov Models (HMM) and Conditional Random Fields (CRF)</p><p id="b540">The basic process of POS tagging is as follows:</p><ol><li><i>Tokenization: The first step is to break the text down into individual words, which will be the basic units for POS tagging.</i></li><li><i>Feature extraction: The next step is to extract relevant features from each word that can be used to predict its POS tag. These features can include the word itself, its prefix and suffix, its capitalization, and its context (e.g. the words that come before and after it).</i></li><li><i>Training: The next step is to train a statistical model on a large annotated corpus of text, where each word has been manually labeled with its POS tag. The model learns to predict the POS tag of a word based on the features extracted in step 2.</i></li><li><i>Tagging: Once the model is trained, it can be used to predict the POS tag of new words. This is done by applying the model to each word in the new text, and using the model’s prediction as the POS tag.</i></li><li><i>Evaluation: The final step is to evaluate the performance of the model on a separate set of text that has been manually annotated with POS tags. This allows to measure how well the model is able to predict the correct POS tag for each word.</i></li></ol><div id="c1fa"><pre><span class="hljs-keyword">import</span> spacy</pre></div><div id="886b"><pre><span class="hljs-attribute">nlp</span> <span class="hljs-operator">=</span> spacy.load(<span class="hljs-string">"en_core_web_sm"</span>) <span class="hljs-attribute">doc</span> <span class="hljs-operator">=</span> nlp(<span class="hljs-string">"Neywork is the city of dreams. It has a population of 20.1 million"</span>)</pre></div><div id="0d61"><pre><span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> doc: <span class="hljs-built_in">print</span>(t<span class="hljs-selector-class">.text</span>, t<span class="hljs-selector-class">.lemma_</span>, t<span class="hljs-selector-class">.pos_</span>, t<span class="hljs-selector-class">.tag_</span>, t<span class="hljs-selector-class">.dep_</span>, t<span class="hljs-selector-class">.shape_</span>, t<span class="hljs-selector-class">.is_alpha</span>, t.is_stop)</pre></div><p id="ae02">Output —</p><div id="604d"><pre><span class="hljs-section">TEXT LEMMA POS TAG DEP SHAPE ALPHA STOP -------------------------------------------</span></pre></div><div id="808f"><pre><span class="hljs-variable">Neywork</span> <span class="hljs-variable">Neywork</span> <span class="hljs-variable">PROPN</span> <span class="hljs-variable">NNP</span> <span class="hljs-variable">nsubj</span> <span class="hljs-variable">Xxxxx</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">False</span> <span class="hljs-variable">is</span> <span class="hljs-variable">be</span> <span class="hljs-variable">AUX</span> <span class="hljs-variable">VBZ</span> <span class="hljs-variable">ROOT</span> <span class="hljs-variable">xx</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">True</span> <span class="hljs-variable">the</span> <span class="hljs-variable">the</span> <span class="hljs-variable">DET</span> <span class="hljs-variable">DT</span> <span class="hljs-variable">det</span> <span class="hljs-variable">xxx</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">True</span> <span class="hljs-variable">city</span> <span class="hljs-variable">city</span> <span class="hljs-variable">NOUN</span> <span class="hljs-variable">NN</span> <span class="hljs-variable">attr</span> <span class="hljs-variable">xxxx</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">False</span> <span class="hljs-variable">of</span> <span class="hljs-variable">of</span> <span class="hljs-variable">ADP</span> <span class="hljs-variable">IN</span> <span class="hljs-variable">prep</span> <span class="hljs-variable">xx</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">True</span> <span class="hljs-variable">dreams</span> <span class="hljs-variable">dream</span> <span class="hljs-variable">NOUN</span> <span class="hljs-variable">NNS</span> <span class="hljs-variable">pobj</span> <span class="hljs-variable">xxxx</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">False</span> <span class="hljs-operator">.</span> <span class="hljs-operator">.</span> <span class="hljs-variable">PUNCT</span> <span class="hljs-operator">.</span> <span class="hljs-variable">punct</span> <span class="hljs-operator">.</span> <span class="hljs-built_in">False</span> <span class="hljs-built_in">False</span> <span class="hljs-variable">It</span> <span class="hljs-variable">it</span> <span class="hljs-variable">PRON</span> <span class="hljs-variable">PRP</span> <span class="hljs-variable">nsubj</span> <span class="hljs-variable">Xx</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">True</span> <span class="hljs-variable">has</span> <span class="hljs-variable">have</span> <span class="hljs-variable">VERB</span> <span class="hljs-variable">VBZ</span> <span class="hljs-variable">ROOT</span> <span class="hljs-variable">xxx</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">True</span> <span class="hljs-variable">a</span> <span class="hljs-variable">a</span> <span class="hljs-variable">DET</span> <span class="hljs-variable">DT</span> <span class="hljs-variable">det</span> <span class="hljs-variable">x</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">True</span> <span class="hljs-variable">population</span> <span class="hljs-variable">population</span> <span class="hljs-variable">NOUN</span> <span class="hljs-variable">NN</span> <span class="hljs-variable">dobj</span> <span class="hljs-variable">xxxx</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">False</span> <span class="hljs-variable">of</span> <span class="hljs-variable">of</span> <span class="hljs-variable">ADP</span> <span class="hljs-variable">IN</span> <span class="hljs-variable">prep</span> <span class="hljs-variable">xx</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">True</span> <span class="hljs-number">20.1</span> <span class="hljs-number">20.1</span> <span class="hljs-variable">NUM</span> <span class="hljs-variable">CD</span> <span class="hljs-variable">compound</span> <span class="hljs-variable">dd</span><span class="hljs-operator">.</span><span class="hljs-variable">d</span> <span class="hljs-built_in">False</span> <span class="hljs-built_in">False</span> <span class="hljs-variable">million</span> <span class="hljs-variable">million</span> <span class="hljs-variable">NUM</span> <span class="hljs-variable">CD</span> <span class="hljs-variable">pobj</span> <span class="hljs-variable">xxxx</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">False</span></pre></div><h2 id="ec46">Named Entities</h2><p id="de23">Entities are nothing but proper names that represent information about persons, locations, organizations etc i.e real world objects. It is available as <code>Doc</code> <code>ents</code> property.</p><figure id="2c57"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*Y5mOrO1bADYo1wb_.png"><figcaption>Pic credits : tds</figcaption></figure><div id="9b72"><pre><span class="hljs-keyword">import</span> spacy</pre></div><div id="f905"><pre><span class="hljs-attribute">nlp</span> <span class="hljs-operator">=</span> spacy.load(<span class="hljs-string">"en_core_web_sm"</span>) <span class="hljs-attribute">doc</span> <span class="hljs-operator">=</span> nlp(<span class="hljs-string">"NewYork is the city of dreams. It has the population of 20.1 million"</span>)</pre></div><div id="6660"><pre><span class="hljs-keyword">for</span> e <span class="hljs-keyword">in</span> doc<span class="hljs-selector-class">.ents</span>: <span class="hljs-built_in">print</span>(e<span class="hljs-selector-class">.text</span>, e<span class="hljs-selector-class">.start_char</span>, e<span class="hljs-selector-class">.end_char</span>, e.label_)</pre></div><p id="4b12">Output —</p><div id="cbe5"><pre><span class="hljs-attribute">NewYork</span> <span class="hljs-number">0</span> <span class="hljs-number">7</span> ORG <span class="hljs-attribute">20</span>.<span class="hljs-number">1</span> million <span class="hljs-number">54</span> <span class="hljs-number">66</span> CARDINAL</pre></div><h2 id="629f">Word Vectors</h2><p id="3a89">In order to preserve the semantic information, in Word2vector each word is represented as a vector of 32 or more dimension instead of a number.</p><figure id="67c8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*WsEBmSETOiJKpjZu.png"><figcaption>Pic credits : ResearchGate</figcaption></figure><div id="c07f"><pre><span class="hljs-keyword">import</span> spacy</pre></div><div id="070c"><pre><span class="hljs-attribute">nlp</span> <span class="hljs-operator">=</span> spacy.load(<span class="hljs-string">"en_core_web_md"</span>) <span class="hljs-attribute">tokens</span> <span class="hljs-operator">=</span> nlp(<span class="hljs-string">"python java php newyork data"</span>)</pre></div><div id="41ec"><pre><span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> tokens: <span class="hljs-built_in">print</span>(t<span class="hljs-selector-class">.text</span>, t<span class="hljs-selector-class">.has_vector</span>, t<span class="hljs-selector-class">.vector_norm</span>, t.is_oov)</pre></div><p id="ab57">Output —</p><div id

Options

="712f"><pre>python <span class="hljs-literal">True</span> <span class="hljs-number">7.2741637</span> <span class="hljs-literal">False</span> java <span class="hljs-literal">True</span> <span class="hljs-number">7.489749</span> <span class="hljs-literal">False</span> php <span class="hljs-literal">True</span> <span class="hljs-number">8.073938</span> <span class="hljs-literal">False</span> newyork <span class="hljs-literal">True</span> <span class="hljs-number">6.6223097</span> <span class="hljs-literal">False</span> <span class="hljs-built_in">data</span> <span class="hljs-literal">True</span> <span class="hljs-number">7.1505103</span> <span class="hljs-literal">False</span></pre></div><h2 id="9194">Similarity</h2><p id="1f0c">It’s a NLP technique which is used to compare words, text spans and documents and find how similar they are to each other.</p><figure id="d765"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*SS6gIHyPpKJc2fnK"><figcaption>Pic credits : peltariaon</figcaption></figure><div id="13b0"><pre><span class="hljs-keyword">import</span> spacy</pre></div><div id="6c13"><pre><span class="hljs-attribute">nlp</span> <span class="hljs-operator">=</span> spacy.load(<span class="hljs-string">"en_core_web_md"</span>)
<span class="hljs-attribute">doc1</span> <span class="hljs-operator">=</span> nlp(<span class="hljs-string">"Books are great"</span>) <span class="hljs-attribute">doc2</span> <span class="hljs-operator">=</span> nlp(<span class="hljs-string">"Wild is a great book by Cheryl Strayed"</span>)</pre></div><div id="6186"><pre>print(<span class="hljs-meta">doc</span><span class="hljs-number">1</span>, <span class="hljs-string">"<->"</span>, <span class="hljs-meta">doc</span><span class="hljs-number">2</span>, <span class="hljs-meta">doc</span><span class="hljs-number">1</span>.similarity(<span class="hljs-meta">doc</span><span class="hljs-number">2</span>))</pre></div><p id="3dc6">Output —</p><div id="eedf"><pre>B<span class="hljs-function"><span class="hljs-title">ooks</span> are great <-></span> Wild <span class="hljs-keyword">is</span> a great book <span class="hljs-keyword">by</span> Cheryl Strayed <span class="hljs-number">0.7321118470519549</span></pre></div><div id="3b9b" class="link-block"> <a href="https://readmedium.com/day-1-day-60-quick-recap-of-60-days-of-data-science-and-ml-6fc021643d1"> <div> <div> <h2>Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML</h2> <div><h3>Connect the ML dots…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*ZfJ1yKIzPLGABAI_.png)"></div> </div> </div> </a> </div><h2 id="7b17">Vocabs, Lexemes and Matcher</h2><p id="524f">In natural language processing (NLP), a vocab is a collection of words and their corresponding numerical IDs, which are used to represent the words in a computational model. A lexeme is a base form of a word, often used as a key to retrieve word forms from a lexicon or a word-to-ID mapping in a vocab.</p><ul><li><i>The Matcher in spaCy is a utility class that helps to match sequences of tokens in a document based on their text, tag, or attributes. It can be used to find specific words, phrases, or patterns in a text, and can be useful for tasks such as named entity recognition, part-of-speech tagging, and more.</i></li><li><i>For example, the Matcher can be used to identify specific named entities in a text by matching patterns of tokens based on their text, POS tags, and other attributes. The Matcher can be trained on a specific set of patterns, and then used to match those patterns against new text.</i></li><li><i>In summary, a vocab is a collection of words and their corresponding numerical IDs, a lexeme is a base form of a word, and Matcher is a utility class that helps to match sequences of tokens in a document based on their text, tag, or attributes, it can be used to find specific words, phrases, or patterns in a text.</i></li></ul><p id="e507">SpaCy stores the data in the vocabulary and encodes all the strings into the hash values.</p><ul><li><b>Orth</b>: It’s the hash value of the lexeme.</li><li><b>Shape</b>: It’s the abstract word shape of the lexeme.</li><li><b>Prefix</b>: By default, the first letter of the word string.</li><li><b>Suffix</b>: By default, the last three letters of the word string.</li></ul><figure id="9a3d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*MQ0q7wBBsTW1Vlt_.png"><figcaption>Pic credits : ResearchGate</figcaption></figure><div id="b040"><pre><span class="hljs-keyword">import</span> spacy</pre></div><div id="6333"><pre>nlp = spacy<span class="hljs-selector-class">.load</span>(<span class="hljs-string">"en_core_web_sm"</span>) doc = <span class="hljs-built_in">nlp</span>(<span class="hljs-string">"I love books"</span>) <span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(doc.vocab.strings[<span class="hljs-string">"books"</span>])</span></span>
<span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(doc.vocab.strings[<span class="hljs-number">17837313582142403287</span>])</span></span></pre></div><p id="98a5">Output —</p><div id="cd4f"><pre>17837313582142403287 books</pre></div><div id="c412"><pre><span class="hljs-attribute">nlp</span> <span class="hljs-operator">=</span> spacy.load(<span class="hljs-string">"en_core_web_sm"</span>)</pre></div><div id="d1e7"><pre><span class="hljs-attribute">doc</span> <span class="hljs-operator">=</span> nlp(<span class="hljs-string">"I love books"</span>)</pre></div><div id="cc55"><pre><span class="hljs-keyword">for</span> <span class="hljs-built_in">word</span> <span class="hljs-keyword">in</span> doc:</pre></div><div id="fecf"><pre><span class="hljs-attribute">l</span> <span class="hljs-operator">=</span> doc.vocab[word.text]</pre></div><div id="a1ec"><pre>print(l<span class="hljs-meta">.text</span>, l.<span class="hljs-keyword">orth, </span>l.<span class="hljs-keyword">shape_, </span>l.<span class="hljs-keyword">prefix_, </span>l.suffix_,l.is_alpha, l.is_digit, l.is_title, l.lang_)</pre></div><p id="c677">Output —</p><div id="2093"><pre><span class="hljs-built_in">I</span> <span class="hljs-number">4690420944186131903</span> <span class="hljs-variable">X</span> <span class="hljs-built_in">I</span> <span class="hljs-built_in">I</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">False</span> <span class="hljs-built_in">True</span> <span class="hljs-variable">en</span> <span class="hljs-variable">love</span> <span class="hljs-number">3702023516439754181</span> <span class="hljs-variable">xxxx</span> <span class="hljs-variable">l</span> <span class="hljs-variable">ove</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">False</span> <span class="hljs-built_in">False</span> <span class="hljs-variable">en</span> <span class="hljs-variable">books</span> <span class="hljs-number">17837313582142403287</span> <span class="hljs-variable">xxxx</span> <span class="hljs-variable">b</span> <span class="hljs-variable">oks</span> <span class="hljs-built_in">True</span> <span class="hljs-built_in">False</span> <span class="hljs-built_in">False</span> <span class="hljs-variable">en</span></pre></div><p id="3e16">Spans from Matcher —</p><div id="92d2"><pre><span class="hljs-keyword">import</span> spacy <span class="hljs-title">from</span> spacy.matcher <span class="hljs-keyword">import</span> Matcher <span class="hljs-title">from</span> spacy.tokens <span class="hljs-keyword">import</span> Span</pre></div><div id="c6e6"><pre>nlp = spacy<span class="hljs-selector-class">.blank</span>(<span class="hljs-string">"en"</span>) matcher = <span class="hljs-built_in">Matcher</span>(nlp.vocab) matcher<span class="hljs-selector-class">.add</span>(<span class="hljs-string">"PERSON"</span>, <span class="hljs-selector-attr">[[{<span class="hljs-string">"lower"</span>: <span class="hljs-string">"Steve"</span>}, {<span class="hljs-string">"lower"</span>: <span class="hljs-string">"Jobs"</span>}]</span>]) doc = <span class="hljs-built_in">nlp</span>(<span class="hljs-string">"Steve Jobs was one of the founder of Apple"</span>)</pre></div><div id="c014"><pre>m = matcher<span class="hljs-built_in">(doc</span>) <span class="hljs-keyword">for</span> match_id, <span class="hljs-keyword">start</span>, <span class="hljs-keyword">end</span> <span class="hljs-keyword">in</span> m: span = Span<span class="hljs-built_in">(doc</span>, <span class="hljs-keyword">start</span>, <span class="hljs-keyword">end</span>, label=match_id) print(span.<span class="hljs-type">text</span>, span.label<span class="hljs-number">_</span>)</pre></div><h2 id="5eda">Day 5: Coming soon!</h2><p id="3a69"><b><i>Follow for more updates, stay tuned and of-course let me end this post with a quote by Steve Jobs ;)</i></b></p><p id="74a9" type="7">“Your time is limited, so don’t waste it living someone else’s life.”</p><h1 id="21c3">For other projects, tune to —</h1><p id="b31f"><b>Build Machine Learning Pipelines( With Code)</b></p><div id="5b37" class="link-block"> <a href="https://medium.datadriveninvestor.com/build-machine-learning-pipelines-with-code-part-1-bd3ed7152124"> <div> <div> <h2>Build Machine Learning Pipelines( With Code) — Part 1</h2> <div><h3>Complete implementation…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*KdToBD8RDMBH4jXM.png)"></div> </div> </div> </a> </div><p id="946c"><b>Recurrent Neural Network with Keras</b></p><div id="607d" class="link-block"> <a href="https://medium.datadriveninvestor.com/recurrent-neural-network-with-keras-b5b5f6fe5187"> <div> <div> <h2>Recurrent Neural Network with Keras</h2> <div><h3>Project Implementation and cheatsheet…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*xs3Dya3qQBx6IU7C.png)"></div> </div> </div> </a> </div><p id="56e1"><b>Clustering Geolocation Data in Python using DBSCAN and K-Means</b></p><div id="2b3e" class="link-block"> <a href="https://medium.datadriveninvestor.com/clustering-geolocation-data-in-python-using-dbscan-and-k-means-3705d9f44522"> <div> <div> <h2>Clustering Geolocation Data in Python using DBSCAN and K-Means</h2> <div><h3>Project Implementation…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*0uPCZnohdaPCO4NN.png)"></div> </div> </div> </a> </div><p id="a29c"><b>Facial Expression Recognition using Keras</b></p><div id="ccaa" class="link-block"> <a href="https://medium.datadriveninvestor.com/facial-expression-recognition-using-keras-cbdd661a0a54"> <div> <div> <h2>Facial Expression Recognition using Keras</h2> <div><h3>Project Implementation…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*CGch7hzdjg1fpgKy.jpg)"></div> </div> </div> </a> </div><p id="0db7"><b>Hyperparameter Tuning with Keras Tuner</b></p><div id="6dff" class="link-block"> <a href="https://medium.datadriveninvestor.com/hyperparameter-tuning-with-keras-tuner-3a609d3fd85b"> <div> <div> <h2>Hyperparameter Tuning with Keras Tuner</h2> <div><h3>Project Implementation….</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*jlaEz8AZaptNWHEr.png)"></div> </div> </div> </a> </div><p id="fed8"><b>Custom Layers in Keras</b></p><div id="e4fd" class="link-block"> <a href="https://medium.datadriveninvestor.com/custom-layers-in-keras-de5f793217aa"> <div> <div> <h2>Custom Layers in Keras</h2> <div><h3>Code implementation …</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*1IH67KJadqeqeO01.png)"></div> </div> </div> </a> </div></article></body>

Day 4: 30 days of Natural Language Processing Series with Projects

SpaCy part 2…

Pic credits : ResearchGate

Welcome back peeps. This is the second part of SpaCy where we will cover some of the basics and advanced concepts of SpaCy. For the SpaCy part 1 —

Some of the other best Series —

30 Days of Natural Language Processing ( NLP) Series

30 days of Data Engineering with projects Series

60 days of Data Science and ML Series with projects

100 days : Your Data Science and Machine Learning Degree Series with projects

23 Data Science Techniques You Should Know

Tech Interview Series — Curated List of coding questions

Complete System Design with most popular Questions Series

Complete Data Visualization and Pre-processing Series with projects

Complete Python Series with Projects

Complete Advanced Python Series with Projects

Kaggle Best Notebooks that will teach you the most

Complete Developers Guide to Git

All the Data Science and Machine Learning Resources

210 Machine Learning Projects

30 days of Machine Learning Ops

Projects Videos —

All the projects, data structures, SQL, algorithms, system design, Data Science and ML , Data Analytics, Data Engineering, , Implemented Data Science and ML projects, Implemented Data Engineering Projects, Implemented Deep Learning Projects, Implemented Machine Learning Ops Projects, Implemented Time Series Analysis and Forecasting Projects, Implemented Applied Machine Learning Projects, Implemented Tensorflow and Keras Projects, Implemented PyTorch Projects, Implemented Scikit Learn Projects, Implemented Big Data Projects, Implemented Cloud Machine Learning Projects, Implemented Neural Networks Projects, Implemented OpenCV Projects,Complete ML Research Papers Summarized, Implemented Data Analytics projects, Implemented Data Visualization Projects, Implemented Data Mining Projects, Implemented Natural Leaning Processing Projects, MLOps and Deep Learning, Applied Machine Learning with Projects Series, PyTorch with Projects Series, Tensorflow and Keras with Projects Series, Scikit Learn Series with Projects, Time Series Analysis and Forecasting with Projects Series, ML System Design Case Studies Series videos will be published on our youtube channel ( just launched).

Subscribe today!

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Tech Brew :

For the NLP pre-requisites —

Lets’ dive in the second part of SpaCy —

Linguistics annotations

Linguistics annotations in natural language processing (NLP) refer to the process of adding structured information to text in order to capture its linguistic features, such as syntax, semantics, and pragmatics.

This information is typically added in the form of tags, labels, or other metadata that can be used to analyze the text and perform specific NLP tasks. Some common types of linguistic annotations in NLP include:

  1. Part-of-Speech (POS) Tagging: The process of identifying the role of each word in a sentence (e.g. noun, verb, adjective)
  2. Syntactic Parsing: The process of analyzing the grammatical structure of a sentence to determine the relationships between words.
  3. Semantic Role Labelling (SRL): The process of identifying the semantic roles of different words in a sentence.
  4. Coreference resolution: The process of identifying and linking mentions of the same entity across a text.
  5. Named-entity recognition (NER): The process of identifying and classifying named entities, such as persons, organizations, and locations
  6. Chunking: The process of grouping words into “chunks” or “noun phrases”

It gives a detailed peek into a text’s grammatical structure.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Neywork is a city of dreams. It has a population of 20.1 million")
for t in doc:
    print(t.text, t.pos_, t.dep_)

Output —

Neywork PROPN nsubj
is AUX ROOT
a DET det
city NOUN attr
of ADP prep
dreams NOUN pobj
. PUNCT punct
It PRON nsubj
has VERB ROOT
a DET det
population NOUN dobj
of ADP prep
20.1 NUM compound
million NUM pobj

Part of Speech tagging

It’s a NLP technique which is used in the tasks like language understanding, information extraction, feature engineering etc to automatically assign POS tags to all the words of a sentence.

Pic credits : Research Gate

Part-of-speech (POS) tagging is a natural language processing (NLP) task that involves identifying the grammatical role of each word in a sentence, such as a noun, verb, adjective, or adverb. POS tagging is an important step in many NLP tasks, such as syntactic parsing, semantic role labeling, and text generation.

There are different methods for POS tagging, but the most common method is based on statistical models, specifically, Hidden Markov Models (HMM) and Conditional Random Fields (CRF)

The basic process of POS tagging is as follows:

  1. Tokenization: The first step is to break the text down into individual words, which will be the basic units for POS tagging.
  2. Feature extraction: The next step is to extract relevant features from each word that can be used to predict its POS tag. These features can include the word itself, its prefix and suffix, its capitalization, and its context (e.g. the words that come before and after it).
  3. Training: The next step is to train a statistical model on a large annotated corpus of text, where each word has been manually labeled with its POS tag. The model learns to predict the POS tag of a word based on the features extracted in step 2.
  4. Tagging: Once the model is trained, it can be used to predict the POS tag of new words. This is done by applying the model to each word in the new text, and using the model’s prediction as the POS tag.
  5. Evaluation: The final step is to evaluate the performance of the model on a separate set of text that has been manually annotated with POS tags. This allows to measure how well the model is able to predict the correct POS tag for each word.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Neywork is the city of dreams. It has a population of 20.1 million")
for t in doc:
    print(t.text, t.lemma_, t.pos_, t.tag_, t.dep_,
            t.shape_, t.is_alpha, t.is_stop)

Output —

TEXT LEMMA POS TAG DEP SHAPE ALPHA STOP
-------------------------------------------
Neywork Neywork PROPN NNP nsubj Xxxxx True False
is be AUX VBZ ROOT xx True True
the the DET DT det xxx True True
city city NOUN NN attr xxxx True False
of of ADP IN prep xx True True
dreams dream NOUN NNS pobj xxxx True False
. . PUNCT . punct . False False
It it PRON PRP nsubj Xx True True
has have VERB VBZ ROOT xxx True True
a a DET DT det x True True
population population NOUN NN dobj xxxx True False
of of ADP IN prep xx True True
20.1 20.1 NUM CD compound dd.d False False
million million NUM CD pobj xxxx True False

Named Entities

Entities are nothing but proper names that represent information about persons, locations, organizations etc i.e real world objects. It is available as Doc ents property.

Pic credits : tds
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("NewYork is the city of dreams. It has the population of 20.1 million")
for e in doc.ents:
    print(e.text, e.start_char, e.end_char, e.label_)

Output —

NewYork 0 7 ORG
20.1 million 54 66 CARDINAL

Word Vectors

In order to preserve the semantic information, in Word2vector each word is represented as a vector of 32 or more dimension instead of a number.

Pic credits : ResearchGate
import spacy
nlp = spacy.load("en_core_web_md")
tokens = nlp("python java php newyork data")
for t in tokens:
    print(t.text, t.has_vector, t.vector_norm, t.is_oov)

Output —

python True 7.2741637 False
java True 7.489749 False
php True 8.073938 False
newyork True 6.6223097 False
data True 7.1505103 False

Similarity

It’s a NLP technique which is used to compare words, text spans and documents and find how similar they are to each other.

Pic credits : peltariaon
import spacy
nlp = spacy.load("en_core_web_md")  
doc1 = nlp("Books are great")
doc2 = nlp("Wild is a great book by Cheryl Strayed")
print(doc1, "<->", doc2, doc1.similarity(doc2))

Output —

Books are great <-> Wild is a great book by Cheryl Strayed 0.7321118470519549

Vocabs, Lexemes and Matcher

In natural language processing (NLP), a vocab is a collection of words and their corresponding numerical IDs, which are used to represent the words in a computational model. A lexeme is a base form of a word, often used as a key to retrieve word forms from a lexicon or a word-to-ID mapping in a vocab.

  • The Matcher in spaCy is a utility class that helps to match sequences of tokens in a document based on their text, tag, or attributes. It can be used to find specific words, phrases, or patterns in a text, and can be useful for tasks such as named entity recognition, part-of-speech tagging, and more.
  • For example, the Matcher can be used to identify specific named entities in a text by matching patterns of tokens based on their text, POS tags, and other attributes. The Matcher can be trained on a specific set of patterns, and then used to match those patterns against new text.
  • In summary, a vocab is a collection of words and their corresponding numerical IDs, a lexeme is a base form of a word, and Matcher is a utility class that helps to match sequences of tokens in a document based on their text, tag, or attributes, it can be used to find specific words, phrases, or patterns in a text.

SpaCy stores the data in the vocabulary and encodes all the strings into the hash values.

  • Orth: It’s the hash value of the lexeme.
  • Shape: It’s the abstract word shape of the lexeme.
  • Prefix: By default, the first letter of the word string.
  • Suffix: By default, the last three letters of the word string.
Pic credits : ResearchGate
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I love books")
print(doc.vocab.strings["books"])  
print(doc.vocab.strings[17837313582142403287])

Output —

17837313582142403287
books
nlp = spacy.load("en_core_web_sm")
doc = nlp("I love books")
for word in doc:
l = doc.vocab[word.text]
print(l.text, l.orth, l.shape_, l.prefix_, l.suffix_,l.is_alpha, l.is_digit, l.is_title, l.lang_)

Output —

I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
books 17837313582142403287 xxxx b oks True False False en

Spans from Matcher —

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("PERSON", [[{"lower": "Steve"}, {"lower": "Jobs"}]])
doc = nlp("Steve Jobs was one of the founder of Apple")
m = matcher(doc)
for match_id, start, end in m:
    span = Span(doc, start, end, label=match_id)
    print(span.text, span.label_)

Day 5: Coming soon!

Follow for more updates, stay tuned and of-course let me end this post with a quote by Steve Jobs ;)

“Your time is limited, so don’t waste it living someone else’s life.”

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Recurrent Neural Network with Keras

Clustering Geolocation Data in Python using DBSCAN and K-Means

Facial Expression Recognition using Keras

Hyperparameter Tuning with Keras Tuner

Custom Layers in Keras

Machine Learning
Programming
Tech
Data Science
Artificial Intelligence
Recommended from ReadMedium