avatarSatyam Kumar

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5329

Abstract

93de">The dataset contains features including URL of web pages, boilerplate text, categories from alchemy API, other features about the webpage. This competition also provides raw HTML files of web pages, that will help to learn to work with HTML documents. The target class is binary, whether the webpage is evergreen or not.</p><div id="bb26" class="link-block"> <a href="https://www.kaggle.com/c/stumbleupon/overview"> <div> <div> <h2>StumbleUpon Evergreen Classification Challenge</h2> <div><h3>Build a classifier to categorize webpages as evergreen or non-evergreen</h3></div> <div><p>www.kaggle.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*bh6FgB8UAo9rom7D)"></div> </div> </div> </a> </div><h1 id="0158">Avito Duplicate Ads Detection:</h1><figure id="4102"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gthLuKplozABDtr5UCsIjA.png"><figcaption>(<a href="https://www.kaggle.com/c/avito-duplicate-ads-detection">Source</a>), Challenge Logo from Kaggle</figcaption></figure><p id="440a"><a href="https://www.avito.ru/">Avito</a> is a Russian classified advertisements website with sections devoted to general goods for sale, jobs, real estate, personals, cars for sale, and services. The objective of this challenge is to develop a model that can automatically spot duplicate ads. It will help Avito to make their platform easier for buyers and make their next purchase with an honest seller.</p><figure id="09c2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*yKYtZMJvPkLPmJZ_a_8ZVA.png"><figcaption>(<a href="https://www.kaggle.com/c/avito-duplicate-ads-detection/data">Source</a>), Schema of Data</figcaption></figure><p id="5d3d">In this data, you have given information about two items, and you need to predict whether they are duplicates of each other or not. The features of the items include text title, description, images, and other information. You can combine concepts of NLP and CNN techniques to develop a model for this case study.</p><div id="ca04" class="link-block"> <a href="https://www.kaggle.com/c/avito-duplicate-ads-detection/overview"> <div> <div> <h2>Avito Duplicate Ads Detection</h2> <div><h3>Can you detect duplicitous duplicate ads?</h3></div> <div><p>www.kaggle.com?</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*mdhcd6omOwLYRa17)"></div> </div> </div> </a> </div><h1 id="bcdc">Quora Question Pair Similarity:</h1><p id="5c94"><a href="https://www.quora.com/">Quora</a> is a platform to ask questions and connect with people who contribute unique insights and quality answers. In this challenge, Quora wants Kagglers to develop a model that can classify whether two given questions are duplicates or not. This model can help Quora to develop a platform with an improved experience for Quora writers, seekers, and readers.</p><p id="44eb">The dataset of this challenge has features including the two questions and their unique ids. The target class label is ‘is_duplicate’, which denotes whether the two questions are duplicates or not. You can employ NLP techniques to generate features for the two texts and train a model out of it to classify whether questions are duplicates or not.</p><div id="ae5f" class="link-block"> <a href="https://www.kaggle.com/c/quora-question-pairs/overview"> <div> <div> <h2>Quora Question Pairs</h2> <div><h3>Can you identify question pairs that have the same intent?</h3></div> <div><p>www.kaggle.com?</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*_NV-sICIu_ugVLHC)"></div> </div> </div> </a> </div><h1 id="f7fe">Quora Insincere Questions Classification:</h1><p id="b2d0">Social Media and the Internet are full of toxicity. Quora as a platform wants to be free with toxic comments and questions. In this competition, Quora asked Kagglers to build a model that can detect toxic content to improve online conversations.</p><p id="76e1">Dataset for this competition has features including question id, question text, and binary target class ‘insincere’ that has a value of <code>1</code>, otherwise <code>0</code> . According to Quora, an Insincere statement is that makes any statement rather than looking for any answer, or has had a non-neutral tone, Is disparaging or inflammatory, Isn’t grounded in reality, or uses sexual content.</p><div id="72a8" class="link-block"> <a href="https://www.kaggle.com/c/quora-insincere-questions-classification/overview"> <div> <div> <h2>Quora Insincere Questions Classification</h2> <div><h3>Detect toxic content to improve online conversations</h3></div> <div><p>www.kaggle.com</p></div> </d

Options

iv> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*R1N69UizfbPnGEhu)"></div> </div> </div> </a> </div><h1 id="c022">Toxic Comment Classification Challenge:</h1><p id="a8de">This challenge is organized by Jigsaw to identify and classify toxic online comments. You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The text comments in the data are in the English language.</p><p id="0751">The target class label is the type of toxicity: toxic, severe toxic, obscene, threat, insult, identity hate.</p><p id="a058">You are expected to develop a model that predicts the probability of each type of toxicity for each comment.</p><div id="c295" class="link-block"> <a href="https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview"> <div> <div> <h2>Toxic Comment Classification Challenge</h2> <div><h3>Identify and classify toxic online comments</h3></div> <div><p>www.kaggle.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*Nh5YTcrEQOJ30p8e)"></div> </div> </div> </a> </div><h1 id="dd89">Jigsaw Multilingual Toxic Comment Classification</h1><p id="112d">This challenge is organized by Jigsaw and is very similar to the last challenge. In the last challenge, the text is in the English language, but in this challenge, Jigsaw gives a better problem statement, where your test data will be Wikipedia talk page comments in several different languages.</p><div id="2703" class="link-block"> <a href="https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification"> <div> <div> <h2>Jigsaw Multilingual Toxic Comment Classification</h2> <div><h3>Use TPUs to identify toxicity comments across multiple languages</h3></div> <div><p>www.kaggle.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*iFDsNadeim98BcEu)"></div> </div> </div> </a> </div><h1 id="6c25">Keyword Extraction — StackOverflow Tag Prediction:</h1><p id="1baa">This challenge was organized by Facebook to recruit data science top performers. The objective of the challenge is to identify keywords and tags from millions of text questions. This case study is a multi-label classification problem, where multiple tags need to predict for a given text.</p><p id="58c7">The features of the data include unique id, question title, body, and the target class is ‘tags’. One needs to predict multiple tags for a given text.</p><div id="dfb7" class="link-block"> <a href="https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data"> <div> <div> <h2>Facebook Recruiting III - Keyword Extraction</h2> <div><h3>Identify keywords and tags from millions of text questions</h3></div> <div><p>www.kaggle.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*OX_xuhVeEw4CVT4f)"></div> </div> </div> </a> </div><h1 id="8cee">Two Sigma Connect: Rental Listing Inquiries:</h1><p id="c65a">This competition was organized by <a href="https://www.twosigma.com/">Two Sigma</a>, and the dataset for this competition is from an apartment listing website — <a href="https://www.renthop.com/">renthop.com</a>. The objective of this competition is to predict how much interest will a new rental listing on RentHop receive.</p><p id="9036">The prediction will be based on the listing content like text description, photos, number of bedrooms, price, etc, and you need to predict the target class label — <code><b>interest label </b></code>. The target class has 3 categories ‘high’, ‘medium’, ‘low’, and hence it is a multiclass classification challenge.</p><div id="6aa1" class="link-block"> <a href="https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/overview"> <div> <div> <h2>Two Sigma Connect: Rental Listing Inquiries</h2> <div><h3>How much interest will a new rental listing on RentHop receive?</h3></div> <div><p>www.kaggle.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*D6oNFlWOTGye51gl)"></div> </div> </div> </a> </div><h1 id="ed6d">Conclusion:</h1><p id="a320">In this article, we have discussed 10 top NLP projects hosted on Kaggle. Implementing these case study requires good knowledge of the NLP, and one can learn a lot from it. An aspiring data scientists should try these NLP projects rather than working on basic projects such as Sentiment Analysis.</p><p id="9d29" type="7">Thank You for Reading</p></article></body>

Top 10 NLP Projects on Kaggle to Strengthen Your Portfolio

Ever wish to work on an NLP project, this article is for you

Photo by Clark Tibbs on Unsplash

Natural Language Processing (NLP) is a subfield of Artificial Intelligence involving the interactions between a computer and human language, in particular how to program or train an AI model to understand and implement the natural human language.

There are various topics or challenges in NLP tasks such as Tokenization, Lemmitazation, Text Encoding techniques, LSTM, RNN, etc. Combinations of these concepts can be used for NLP scenarios such as Question Answering, Text Classification and Summarization, Sentiment Analysis, Language Recognition and Translation, OCR, and many more.

Did you ever wish you could make an NLP project of your own? Well, this article covers best NLP cometitions.

This article will cover the problem definition and dataset of the top 10 NLP projects, that covers most of the NLP topics. One can learn a lot from working on these past NLP competitions.

Tweet Sentiment Extraction:

Sentiment Analysis is one of the famous case studies every data scientists have tried in their career. Tweet Sentiment Extraction is a competition organized by Kaggle, objective of the competition is to predict the word or phrase from the tweet that exemplifies the provided sentiment. This case study is similar to the Question-Answer system.

(Image by Author), Sample Data from Tweet Sentiment Extraction Challenge

Sentiment Analysis objective is to predict only the sentiment of the text, instead, this case study is a step ahead, that predicts the phrase or words in the text that results in the given sentiment. Here, the target class label is the column selected_text , and given a sentiment column and a text column, you need to extract the target column.

Predict Closed Questions on StackOverflow:

Stack Overflow is the largest, most trusted online question and answers community for professional and enthusiast programmers to learn, share​ ​their programming ​knowledge, and build their careers. This competition was organized on Kaggle to predict whether new questions asked on Stack Overflow is closed or not.

Thousands of questions are asked by developers on the StackOverflow platform, so of them being constructive and good questions, and some of the questions being waste or lacked information. This competition was to predict whether a new question or post will be closed or not, and also predict the reason to close the post.

The dataset has several features including the text of question title, the text of question body, tags, Post Creation Date, Owner User Id, Owner Creation Date, and many more. The target class label is columns ‘OpenStatus’, and also includes additional information about Post Id, Post Closed Date, if the post is closed.

StumbleUpon Evergreen Classification Challenge:

StumbleUpon was a discovery and advertisement engine that pushed relevant, high-quality web pages and media recommendations to its users. For this challenge, you have been given web pages from the StumbleUpon site, and the task was to build a classifier to categorize webpages as evergreen or non-evergreen.

The dataset contains features including URL of web pages, boilerplate text, categories from alchemy API, other features about the webpage. This competition also provides raw HTML files of web pages, that will help to learn to work with HTML documents. The target class is binary, whether the webpage is evergreen or not.

Avito Duplicate Ads Detection:

(Source), Challenge Logo from Kaggle

Avito is a Russian classified advertisements website with sections devoted to general goods for sale, jobs, real estate, personals, cars for sale, and services. The objective of this challenge is to develop a model that can automatically spot duplicate ads. It will help Avito to make their platform easier for buyers and make their next purchase with an honest seller.

(Source), Schema of Data

In this data, you have given information about two items, and you need to predict whether they are duplicates of each other or not. The features of the items include text title, description, images, and other information. You can combine concepts of NLP and CNN techniques to develop a model for this case study.

Quora Question Pair Similarity:

Quora is a platform to ask questions and connect with people who contribute unique insights and quality answers. In this challenge, Quora wants Kagglers to develop a model that can classify whether two given questions are duplicates or not. This model can help Quora to develop a platform with an improved experience for Quora writers, seekers, and readers.

The dataset of this challenge has features including the two questions and their unique ids. The target class label is ‘is_duplicate’, which denotes whether the two questions are duplicates or not. You can employ NLP techniques to generate features for the two texts and train a model out of it to classify whether questions are duplicates or not.

Quora Insincere Questions Classification:

Social Media and the Internet are full of toxicity. Quora as a platform wants to be free with toxic comments and questions. In this competition, Quora asked Kagglers to build a model that can detect toxic content to improve online conversations.

Dataset for this competition has features including question id, question text, and binary target class ‘insincere’ that has a value of 1, otherwise 0 . According to Quora, an Insincere statement is that makes any statement rather than looking for any answer, or has had a non-neutral tone, Is disparaging or inflammatory, Isn’t grounded in reality, or uses sexual content.

Toxic Comment Classification Challenge:

This challenge is organized by Jigsaw to identify and classify toxic online comments. You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The text comments in the data are in the English language.

The target class label is the type of toxicity: toxic, severe toxic, obscene, threat, insult, identity hate.

You are expected to develop a model that predicts the probability of each type of toxicity for each comment.

Jigsaw Multilingual Toxic Comment Classification

This challenge is organized by Jigsaw and is very similar to the last challenge. In the last challenge, the text is in the English language, but in this challenge, Jigsaw gives a better problem statement, where your test data will be Wikipedia talk page comments in several different languages.

Keyword Extraction — StackOverflow Tag Prediction:

This challenge was organized by Facebook to recruit data science top performers. The objective of the challenge is to identify keywords and tags from millions of text questions. This case study is a multi-label classification problem, where multiple tags need to predict for a given text.

The features of the data include unique id, question title, body, and the target class is ‘tags’. One needs to predict multiple tags for a given text.

Two Sigma Connect: Rental Listing Inquiries:

This competition was organized by Two Sigma, and the dataset for this competition is from an apartment listing website — renthop.com. The objective of this competition is to predict how much interest will a new rental listing on RentHop receive.

The prediction will be based on the listing content like text description, photos, number of bedrooms, price, etc, and you need to predict the target class label — interest label . The target class has 3 categories ‘high’, ‘medium’, ‘low’, and hence it is a multiclass classification challenge.

Conclusion:

In this article, we have discussed 10 top NLP projects hosted on Kaggle. Implementing these case study requires good knowledge of the NLP, and one can learn a lot from it. An aspiring data scientists should try these NLP projects rather than working on basic projects such as Sentiment Analysis.

Thank You for Reading

Artificial Intelligence
Machine Learning
Data Science
Education
NLP
Recommended from ReadMedium