Quant Investment & Machine Learning Interview Questions (1)

Below are a series of questions I find from websites that others share from their recent experience. How these are helpful to whoever are preparing for their interviews :)
▌Source:Point72
Question
Describe the motivation behind random forests. What are two ways in which they improve upon individual decision trees?
Answer
Random forests are used since individual decision trees are usually prone to overfitting. Not only can these use multiple decision trees and then average their decisions, but they can be used for either classification or regression. There are a few main ways in which they allow for stronger out-of-sample prediction than do individual decision trees.
* As in other ensemble models, using a large set of trees created in a resample of data (bootstrap aggregation) will lead to a model yielding more consistent results, More specifically, and in contrast to decision trees, it leads to diversity in training data for each tree and so contributes to better results in terms of bias-variance trade-off (particularly with respect to variance).
* Using only m < p features at each split helps to de-correlate the decision trees, thereby avoiding having very important features always appearing at the first splits of the trees (which would happen on standalone trees due to the nature of information gain).
* They’re fairly easy to implement and fast to run.
* They can produce very interpretable feature-importance values, thereby improving model understandability and feature selection.
The first two bullet points are the main ways random forests improve upon single decision trees.
▌Source:Two Sigma
Question
Say you were running a linear regression for a dataset but you accidentally duplicated every data point. What happens to your beta coefficient?
Answer
we see that the coefficient remains unchanged.
▌Source:Facebook
Question
When performing K-means clustering, how do you choose k?
Answer
The elbow method is the most well-known method for choosing k in k-means clustering. The intuition behind this technique is that first few clusters will explain a lot of the variation in the data, but past a certain point, the amount of information added is diminishing. Looking at a graph of explained variation (on the y-axis) versus the number of clusters (k), there should be a sharp change in the y-axis at some level of k. For example, in the graph that follows, we see a drop off at approximately k = 6.
Note that the explained variation is quantified by the within-cluster sum of squared errors. To calculate this error metric, we look at, for each cluster, the total sum of squared errors (using Euclidean distance). A caveat to keep in mind: the assumption of a drop in variation may not necessarily be true — the y-axis may be continuously decreasing slowly (i.e., there is no significant drop).
Another popular alternative to determining k in k-means clustering is to apply the silhouette method, which aims to measure how similar points are in its cluster compared to other clusters. Concretely, it looks at:
where x is the mean distance to the examples of the nearest cluster, and y is the mean distance to other examples in the same cluster. The coefficient varies between -1 and 1 for any given point. A value of 1 implies that the point is in the “right” cluster and vice versa for a score of -1. By plotting the score on the y-axis versus k, we can get an idea for the optimal number of clusters based on this metric. Note that the metric used in the silhouette method is more computationally intensive to calculate for all points versus the elbow method.
Taking a step back, while both the elbow and silhouette methods serve their purpose, sometimes it helps to lean on your business intuition when choosing the number of clusters. For example, if you are clustering patients or customer groups, stakeholders and subject matter experts should have a hunch concerning how many groups they expect to see in the data. Additionally, you can visualize the features for the different groups and assess whether they are indeed behaving similarly. There is no perfect method for picking k, because if there were, it would be a supervised problem and not an unsupervised one.
▌Source:AQR
Question
Say that you are running a multiple linear regression and that you have reason to believe that several of the predictors are correlated. How will the results of the regression be affected if several are indeed correlated? How would you deal with this problem?
Answer
There will be two primary problems when running a regression if several of the predictor variables are correlated. The first is that the coefficient estimates and signs will vary dramatically, depending on what particular variables you included in the model. Certain coefficients may even have confidence intervals that include 0 (meaning it is difficult to tell whether an increase in that X value is associated with an increase or decrease in Y or not), and hence results will not be statistically significant. The second is that the resulting p-values will be misleading. For instance, an important variable might have a high p-value and so be deemed as statistically insignificant even though it is actually important. It is as if the effect of the correlated features were “split” between them, leading to uncertainty about which features are actually relevant to the model.
You can deal this problem by either removing or combining the correlated predictors. To effectively remove one of the predictors, it is best to understand the cause of the correlation (i.e., did you include extraneous predictors such as X and 2X or are there some latent variables underlying one or more of the ones you have included that affect both? To combine predictors, it is possible to include interaction terms (the product of the two that are correlated). Additionally, you could also (1) center the data and (2) try to obtain a larger size of sample, thereby giving you narrower confidence intervals. Lastly, you can apply regularization methods (such as in ridge regression)
Thanks for Reading!
If you enjoyed it, please follow me on Medium for more. It’s great cardio for your 👏 AND will help other people see the story.
If you want to continue getting this type of article, you can support me by becoming a Medium subscriber.




