Summary

The content provides an insider's perspective on the technical and non-technical requirements for a Data Scientist role at Amazon, detailing the necessary skills and knowledge areas, with links to resources for further learning.

Abstract

The article, written by a former Product Manager turned Data Scientist at Amazon, offers a comprehensive guide to the skills and knowledge required for a Data Scientist position within the company. It emphasizes the importance of both technical and non-technical abilities, with a focus on data retrieval, analysis, visualization, statistical knowledge, machine learning, and specialized domain expertise, such as Natural Language Processing. The author breaks down the technical requirements into five key categories: data retrieval (often through SQL), data analysis and visualization (using tools like Python, R, Pandas, and Matplotlib), statistical and probabilistic concepts, machine learning algorithms, and specialized science knowledge pertinent to the specific role. The article serves as a roadmap for aspiring Data Scientists, providing links to detailed posts and learning materials created by the author to aid in the preparation for each aspect of the role.

Opinions

The author believes that non-technical and soft skills are crucial for differentiating oneself in the Data Scientist role, in addition to meeting the minimum technical requirements.
The article suggests that a Data Scientist at Amazon should be proficient in retrieving data independently, even though support may be available from other teams like Business Intelligence Engineers or Business Analysts.
The author stresses the foundational importance of statistics and probability in data science, despite the prevalence of machine learning.
There is an opinion that specialized knowledge in a particular data science realm, such as NLP or time series analysis, is essential depending on the team one is interviewing for or working with.
The author advocates for the use of Python over R for data manipulation and analysis, based on personal preference and experience.
The article implies that a general familiarity with various machine learning algorithms is necessary for both interviewing and performing daily tasks as a Data Scientist.
The author values the role of sampling in data science, referring to it as an "unsung hero."
The piece encourages the use of pre-trained models in NLP, suggesting they are beneficial and time-saving for practitioners.
The author endorses the use of Hyperopt for hyperparameter optimization, indicating its effectiveness in distributed hyperparameter optimization.

Data Scientist Role Requirements — An Amazon Interviewer’s Perspective

Photo by Maranda Vandergriff on Unsplash

I joined Amazon as a Product Manager and although I enjoyed the data-driven mindset in Amazon, I soon realized that my interests were more aligned with the depth that only science provided. Therefore, I started talking to various teams, scientists, hiring managers, etc. to learn how I can make such a transition. My plan is to share my learnings in the form of a series of posts with all those who might be interested in learning more about Amazon science roles or who might be interested in making such a transition or starting their own career in data science. This transition took me about a year and my hope is that if you are interested in doing the same, you can leverage my learnings and make the transition in a shorter period of time and with a higher level of certainty.

Overall Requirements

This particular post is meant to walk us through what requirements I identified for a Data Scientist role in Amazon. I will breakdown each of the steps further and then there will be follow-up stories in this series to go further into details of what I did for that particular step, what I studied and how it worked out for me. I will also try to share cheat sheets or other learning materials that I created along the way for anyone that might be interested in doing the same.

Requirements can be broken down into (1) technical and (2) non-technical skills. Most of my posts will be focusing on the technical skills but do not discount the importance of the non-technical skills. When I interview or take a coffee chat or an informational session with potential candidates, many of them have the minimum technical requirements for the role and the way you can differentiate yourself, once you have the technical expertise, is through the non-technical and soft skills. Let’s get started!

Technical Requirements

When I look at a Data Scientist role, I break down the technical requirements into the following 5 categories. In the upcoming posts, I will explain what I studied to get prepared for each of these steps and will link them here as I make those posts. For now, let’s look at the steps:

1. Data Retrieval

As the name of the role suggests, a “typical day in the life” of a Data Scientist involves playing around with a lot of data. But before doing that, Data Scientists are expected to be able to retrieve their own data from the existing data bases (this is usually in the form of SQL queries). In some larger teams and companies, there may be teams that support Data Scientists and retrieve the data so that the Data Scientists can focus more on diving deep into the data but that does not mean that Data Scientists do not need to know how to get the data themselves. For example in Amazon sometimes Business Intelligence Engineers (BIEs) or Business Analysts (BAs) help with the data retrieval part but it is not always the case, therefore Data Scientists are expected to be able to retrieve their own data. In fact, this is one of the most common parts of any Data Scientist interview among various teams in my experience.

Data Retrieval — Links

For data retrieval, I have made a detailed post with examples, practice questions and my cheat sheet, which is linked below:

1.1. Data Retrieval with SQL

2. Data Analysis and Visualization

Now that you know how to use SQL and query the data that you need from the previous step, it is time for you to start playing around with the data — that is called data manipulation. I group the data manipulation piece into two categories and both are expected from a Data Scientist. First category is basic and/or preliminary analysis and manipulation that can be implemented within the SQL query when you retrieve the data (think about filtering, aggregation and/or window functions in SQL as a few examples). Second category is for more advanced data manipulation. There are various acceptable ways to do this but the most common ones that I have encountered among Data Scientists are using Python and R. I personally use Python so my posts will also mostly use Python and available libraries such as Pandas and NumPy but if you use R or anything similar, that is generally acceptable as well. I will also have separate posts dedicated to this section and will update this post with relevant links as those posts become available.

Data Analysis and Visualization — Links

I will keep adding new links here as I make new posts and will do my best to include practice notebooks in each post to help with learning.

2.1. Python: Top Programming Language for Data Science

2.2. Python Advanced Functions

2.3. Data Manipulation and Analysis with Pandas

2.4. Data Visualization in Python Using Matplotlib and Pandas

3. General Science Knowledge — Statistics and Probability

In order to do the analysis and data manipulation that you want to implement to answer questions using the data, there are various statistical and probabilistic concepts that Data Scientists are expected to be familiar with (examples include sampling, hypothesis tests, etc.). With the prevalence of Machine Learning these days, some candidates discount the importance of this category but in my experience this is foundational knowledge that is required for the job. I will also have separate posts dedicated to this section and will update this post with relevant links as those posts become available.

Statistics and Probability — Links

3.1. Central Limit Theorem, Confidence Level/Interval, Sampling and Sample Size Calculation

3.2. Univariate Analysis

3.3. Multivariate Analysis

3.4. Sampling

3.5. Correlation

4. General Science Knowledge — Machine Learning

And finally the part that I am sure everyone has been waiting for — Machine Learning! You do not need me to tell you how machine learning has become part of the everyday life of Data Scientists across the world so let’s skip the speech about the importance of this topic. Depending on the area within the realm of data science that you interview for, you will end up using very specific machine learning algorithms but a general familiarity with various machine learning algorithms and concepts are essential to pass the interviews and more importantly, to be able to perform the day-to-day tasks of a Data Scientist. I will also have separate posts dedicated to this section and will update this post with relevant links as those posts become available.

4.1. Regression vs. Classification in Machine Learning

4.2. Linear Regression

4.3. Logistic Regression

4.4. Classification — Logistic Regression vs. K-Nearest Neighbors vs. Naive Bayes

4.5. Decision Trees Regression and Classification — Intro & Implementation

4.6. Hierarchical and K-Means Clustering — Intro & Implementation

4.7. AutoML — Intro & Implementation

4.8. XGBoost — Intro & Implementation

5. Specialized Science Knowledge

This is where Data Scientists start to specialize and depending on the level of the role that you interview for, you are expected to know different levels of depth in that particular data science realm. Let me explain with an example to further clarify this point. My team in Amazon focuses on Natural Language Processing (NLP) outcomes. What that means is that my day-to-day work involves working with textual data and as a result Data Scientists in my team are expected to have a deeper knowledge of machine learning algorithms and concepts that are relevant to the NLP space (e.g. embeddings, tokenization, etc.). On the other hand, a Data Scientist that is working on forecasting demand for a particular product in the future, is expected to have a deeper knowledge about time series or other concepts that are more relevant to that role. I will also have separate posts dedicated to this section and will update this post with relevant links as those posts become available.

Specialized Science Knowledge — Links

5.1. Natural Language Processing (NLP)

5.1.1. Language Modeling — Sentiment Analysis, Machine Translation and Named-Entity Recognition

5.1.2. Pre-Trained Models (a.k.a. Large Language Models) in NLP

5.1.3. Sentiment Analysis (Deep Dive)

5.1.4. Grammatical Error Correction with Machine Learning

5.1.5. Topic Modeling

5.1.6. Hugging Face Intro with NLP Tasks Implementation

5.1.7. Hugging Face Agents and Tools for NLP Tasks

5.2. Hyperparameter Optimization

5.2.1. Grid Search, Random Search and Bayesian Optimization

5.2.2. Bayesian Optimization (Deep Dive)

5.2.3. Hyperopt: Distributed Hyperparameter Optimization

Thanks for Reading!

If you found this post helpful, please follow me on Medium and subscribe to receive my latest posts!