Character-level Convolutional Networks for Text Classification

Summary

The website content discusses the evolution of text classification models, with a focus on the effectiveness of character-level convolutional neural networks (ConvNets) for handling large-scale datasets and capturing text structure directly from raw character inputs.

Abstract

The article delves into the progression of text classification techniques, starting from simple bag of words models to more intricate methods like character-level ConvNets. It highlights the benefits of using raw character streams as input, which allows models to learn about word structure, misspellings, and emoticons, and to be language-agnostic. The paper "Character-level Convolutional Networks for Text Classification" by Xiang Zhang, Junbo Zhao, and Yann LeCun is examined for its comparison of various models across datasets ranging from 120,000 to 3,600,000 samples. The character-level ConvNet, consisting of 6 convolutional layers and 3 fully-connected layers, is shown to outperform traditional models on larger datasets and is particularly effective for less curated texts, such as Amazon reviews. The article also notes that the choice of alphabet, such as not distinguishing between upper and lower case characters, can improve the ConvNet's performance.

Opinions

The trend in text classification is moving towards models that can learn from raw data, capturing more information about text structure.
Character-level ConvNets are seen as a logical next step in text classification, as they can learn directly from a stream of characters.
Traditional models like bag of words and n-grams are still effective for smaller datasets, but their performance plateaus with increased data size.
The character-level ConvNet's ability to handle different languages and texts with non-standard elements like misspellings and emoticons is considered an additional benefit.
The performance of character-level ConvNets is particularly notable in datasets with less curated content, suggesting robustness in handling varied and informal language use.
The article suggests that simplifying the alphabet by ignoring case sensitivity can lead to better performance of the ConvNet, indicating that less preprocessing might be beneficial.

Character-level Convolutional Networks for Text Classification

One of the common natural language understanding problems is text classification. Over last few decades, machine learning researchers have been moving from the simplest “bag of words” model to more sophisticated models for text classification.

Bag of words model uses only information about which words are used in the text. Adding TFIDF to the bag of words helps to track relevancy of each word to the document. Bag of n-grams enables using partial information about structure of the text. Recurrent neural networks, like LSTM, can capture dependencies between words even if they are far from each other. LSTM learns structure of sentences from the raw data, but we still have to provide a list of words. Word2vec algorithm adds knowledge about word similarity, which helps a lot. Convolutional neural networks can also help to process word-based datasets.

A trend is to learn using raw data, and provide machine learning models with an access to more information about text structure. A logical next step would be to feed a stream of characters to the model and let it learn all about the words. What can be cruder than a stream of characters? An additional benefit is that the model can learn misspellings and emoticons. Also, the same model can be used for different languages, even those where segmentation into words is not possible.

The article “Character-level Convolutional Networks for Text Classification” (Xiang Zhang, Junbo Zhao, Yann LeCun) explores usage of character-level ConvNet networks for text classification. They compare performance of a few different models on several large-scale datasets.

Datasets contain from 120,000 to 3,600,000 samples in the training set, from 2 to 14 classes. The smallest dataset is AG’s News: news articles divided into 4 classes, 30,000 articles for each class in the training set. The largest is Amazon Review Polarity: 2 polarity segments with 1,800,000 reviews for each of them.

Character-level ConvNet was compared with state-of-the-art models: Bag-of-words and its TFIDF, Bag-of-ngrams and its TFIDF, Bag-of-means on word embedding, word-based ConvNet, word-based LSTM.

Character-level ConvNet contains 6 convolutional layers and 3 fully-connected layers.

Results are quite interesting. N-gram and N-gram TFIDF models have are the best for smaller datasets, up to several hundreds of thousands of samples. But when dataset size grows to several million we can observe that character-level ConvNet performs better.

ConvNet tends to work better for texts which are less curated. For example, ConvNet works better on Amazon reviews dataset. Amazon Reviews are raw user inputs, whereas users could be more careful in writings on Yahoo! Answers.

Choice of alphabet matters. ConvNet works better is not distinguishing between upper and lower case characters.