The web content discusses the use of Decision Trees and Random Forests for classification and regression tasks, highlighting their speed, interpretability, and performance in comparison to other machine learning algorithms like Logistic Regression and Neural Networks.
Abstract
The article "Decision Trees and Random Forests for Classification and Regression pt.1" provides an overview of Decision Trees and Random Forests as powerful and interpretable machine learning tools. It emphasizes that Decision Trees are faster to train compared to simple neural networks and offer comparable performance. The interpretability of Decision Trees makes them suitable for variable selection and feature engineering. The article also demonstrates the use of Decision Trees with the scikit-learn package and visualizes a learned tree using Graphviz. It compares the classification performance of Decision Trees to logistic regression and neural networks using the Receiver Operating Characteristic (ROC) and the area under the ROC curve (AUROC), showing that Decision Trees can outperform logistic regression and are more interpretable than neural networks, despite the latter's slightly better performance. The article concludes by recommending Decision Trees as essential tools for data scientists and machine learning engineers, with a teaser for the next article on Random Forests and Bootstrap Aggregation for improved predictive power and robustness.
Opinions
The author suggests that Decision Trees are a preferable alternative to Logistic Regression and Neural Networks for classification and regression tasks, especially when interpretability is crucial.
Decision Trees are praised for their ease of use, speed of training, and ability to handle smaller datasets effectively.
The article conveys that Decision Trees can be used for feature selection by identifying the most important variables based on their location in the tree.
The author is of the opinion that the performance of Decision Trees, as measured by the AUROC, is competitive with other classifiers, and they can even outperform logistic regression.
While acknowledging that neural networks can achieve slightly better performance, the author maintains that the interpretability and training speed of Decision Trees make them a valuable tool.
The author expresses enthusiasm for Random Forests, hinting at their superiority in handling larger, more complex datasets, which will be discussed in the subsequent article.
Decision Trees and Random Forests for Classification and Regression pt.1
A light through a random forest.
Highlights:
Want to use something more interpertable, something that trains faster and performs pretty much just as well as the old Logistic Regression or even Neural Networks? You should consider Decision Trees for classification and regression. Part 2 on Random Forests here.
Much faster to train versus simple neural networks for comparable performance (The time complexity of decision trees is a function of [number of features, number of rows in dataset], whereas for neural networks it is a function of [number of features, number of rows in dataset, number of hidden layers, number of nodes in each hidden layer])
Easily interpretable, suitable for variable selection
Fairly robust on smaller datasets
Decision Trees and Decision Tree Learning are simple to understand
Decision Trees and their extension Random Forests are robust andeasy-to-interpretmachine learning algorithms for Classification and Regression tasks. Decision Treesand Decision Tree Learning together comprise a simple and fast way of learning a function that maps data x to outputs y, where x can be a mix of categorical and numeric variables and y can be categorical for classification, or numeric for regression. Methods such as SVMs, Logistic Regression and Deep Neural Nets pretty much do the same thing. However despite their power against larger and more complex datasets, they are extremely hard to interpret and neural nets can take many iterations and hyperparameter adjustments before a good result is had. As well, one of the biggest advantages of using Decision Trees and Random Forests is the ease in which we can see what features or variables contribute to the classification or regression and their relative importance based on their location depthwise in the tree.
We’ll look at decision trees in this article and compare their classification performance using information derived from the Receiver Operating Characteristic (ROC) against logistic regression and a simple neural net.
Decision Trees:
A Decision Tree is a tree (and a type of directed, acyclic graph) in which the nodes represent decisions (a square box), random transitions (a circular box) or terminal nodes, and the edges or branches are binary (yes/no, true/false) representing possible paths from one node to another. The specific type of decision tree used for machine learning contains no random transitions. To use a decision tree for classification or regression, one grabs a row of data or a set of features and starts at the root, and then through each subsequent decision node to the terminal node. The process is very intuitive and easy to interpret, which allows trained decision trees to be used for variable selection or more generally, feature engineering. To illustrate this, suppose you wanted to buy a new car to drive up a random dirt road into a random forest. You have a dataset of different cars with three features: Car Drive Type (Categorical), Displacement (Numeric) and Clearance (Numeric). An example of a learned decision tree to help you make your decision is below:
I like to go off-roading when I’m not making machines learn.
The root or topmost node of the tree (and there is only one root) is the decision node that splits the dataset using a variable or feature that results in the the best splitting metric evaluated for each subset or class in the dataset that results from the split. The decision tree learns by recursively splitting the dataset from the root onwards (in a greedy, node by node manner) according to the splitting metric at each decision node. The terminal nodes are reached when the splitting metric is at a global extremum. Popular splitting metrics include the minimizing the Gini Impurity (used by CART) or maximizing the Information Gain (used by ID3, C4.5).
Example:
Now that we have seen how decision tree training works, let’s use the scikit-learn package (scikit-learn contains many nice data processing, dimensionality reduction, clustering and shallow machine learning tools) and implement a simple decision tree for classification on the Wine Dataset (13 features/variables with 3 classes), and then visualize the learned tree with Graphviz.
No need to waste money on wine tasting courses.
Right away, from the learned decision tree we can see that the feature proline (proline content in the wine) is the root node with the highest Gini Impurity value of 0.658, and this means that all three wine classes have this as their base separation. This also means that in principle, if we used only one feature in a predictive model, the proline content will allow us to predict correctly to a maximum 1-0.658 = 0.342 = 34.2% of the time, assuming that the original learned decision tree predicts perfectly. Then, from the root we see that the classes split off further with the od280/od315_of_dilute_wines feature and the flavinoid feature. We can also see that a majority of the class_1 wine (81.7%) have an alcohol content ≤ 13.175 and a flavinoid content ≤ 0.795. Also, recall that there are 13 features in the original dataset, but the decision tree picked only a subset of 7 features for the classification.
We can use this information to select which features/variables in a general dataset are important (in cases where there may be non-useful, redundant or noisy features) for a more advanced model such a deep neural net. We’ll see how to do this with more robust random forests in part 2. The learned decision tree can be used to predict data using a simple function call on a row of input data. A regression tree for predicting numerical output values from input features can be created very easily as well: check out this scikit-learn tutorial.
Performance:
The Receiver Operating Characteristic (ROC) is a plot that can be used to determine the performance and robustness of a binary or multi-class classifier. The x-axis is the false positive rate (FPR) and the y-axis is the true positive rate (TPR). The ROC plot gives information about the about true postive/negative rate and false positive/negative rate and something called the C-statistic or area under ROC curve (AUROC) for each class predicted by the classifier (there is one ROC for each class predicted by the classifier). The AUROC is defined as the probability that a randomly selected positive sample will have a higher prediction value than a randomly selected negative sample. A quote from this article on the subject:
“Assuming that one is not interested in a specific trade-off between true positive rate and false positive rate (that is, a particular point on the ROC curve), the AUC [AUROC] is useful in that it aggregates performance across the entire range of trade-offs. Interpretation of the AUC is easy: the higher the AUC, the better, with 0.50 indicating random performance and 1.00 denoting perfect performance.”
The AUROCs for different classifiers can be compared against each other. Alternatives to this metric include using the scikit-learn confusion matrix calculator for the prediction results and using the resultant matrix to derive basic positive/negative accuracy, F1-scores, etc.
The above output is the AUROC for each class predicted by the decision tree. The compare, in scikit-learn we re-run the same dataset with 20% test set on a logistic regression model and a shallow MLP neural net model. The logistic regression model’s performance (with all default parameters) is below:
And for the shallow MLP net: hidden layers = 2, nodes per layer = 25, optimizer = adam, activation = logistic, iterations = 50000:
{0:1.0, 1:1.0, 2:1.0, 'micro':1.0}
We can see that the decision tree outperforms logistic regression, and although the Neural Net beats it, it is still much faster to train and has the advantage of interpretability.
What’s Next:
Decision Trees should always be in the toolkit of the adept Data Scientist and Machine Learning Engineer. For a more thorough user guide/manual of how to use decision trees in scikit-learm, refer to http://scikit-learn.org/stable/modules/tree.html#decision-trees.
However, despite the ease of use and apparent power of decision trees, their reliance on a greedy strategy for learning may cause the tree to split the wrong features at each node or cause the tree to overfit. Stay tuned, in the next article I will be demonstrating ensemble decision trees, or so-called Random Forests and Bootstrap Aggregation which when used together drastically increase predictive power and robustness for larger and more complicated datasets. As well we’ll see how we can use random forests for robust Variable Selection.