Data Scientist Role Requirements — An Amazon Interviewer’s Perspective

I joined Amazon as a Product Manager and although I enjoyed the data-driven mindset in Amazon, I soon realized that my interests were more aligned with the depth that only science provided. Therefore, I started talking to various teams, scientists, hiring managers, etc. to learn how I can make such a transition. My plan is to share my learnings in the form of a series of posts with all those who might be interested in learning more about Amazon science roles or who might be interested in making such a transition or starting their own career in data science. This transition took me about a year and my hope is that if you are interested in doing the same, you can leverage my learnings and make the transition in a shorter period of time and with a higher level of certainty.
Overall Requirements
This particular post is meant to walk us through what requirements I identified for a Data Scientist role in Amazon. I will breakdown each of the steps further and then there will be follow-up stories in this series to go further into details of what I did for that particular step, what I studied and how it worked out for me. I will also try to share cheat sheets or other learning materials that I created along the way for anyone that might be interested in doing the same.
Requirements can be broken down into (1) technical and (2) non-technical skills. Most of my posts will be focusing on the technical skills but do not discount the importance of the non-technical skills. When I interview or take a coffee chat or an informational session with potential candidates, many of them have the minimum technical requirements for the role and the way you can differentiate yourself, once you have the technical expertise, is through the non-technical and soft skills. Let’s get started!
Technical Requirements
When I look at a Data Scientist role, I break down the technical requirements into the following 5 categories. In the upcoming posts, I will explain what I studied to get prepared for each of these steps and will link them here as I make those posts. For now, let’s look at the steps:
1. Data Retrieval
As the name of the role suggests, a “typical day in the life” of a Data Scientist involves playing around with a lot of data. But before doing that, Data Scientists are expected to be able to retrieve their own data from the existing data bases (this is usually in the form of SQL queries). In some larger teams and companies, there may be teams that support Data Scientists and retrieve the data so that the Data Scientists can focus more on diving deep into the data but that does not mean that Data Scientists do not need to know how to get the data themselves. For example in Amazon sometimes Business Intelligence Engineers (BIEs) or Business Analysts (BAs) help with the data retrieval part but it is not always the case, therefore Data Scientists are expected to be able to retrieve their own data. In fact, this is one of the most common parts of any Data Scientist interview among various teams in my experience.
Data Retrieval — Links
For data retrieval, I have made a detailed post with examples, practice questions and my cheat sheet, which is linked below:
2. Data Analysis and Visualization
Now that you know how to use SQL and query the data that you need from the previous step, it is time for you to start playing around with the data — that is called data manipulation. I group the data manipulation piece into two categories and both are expected from a Data Scientist. First category is basic and/or preliminary analysis and manipulation that can be implemented within the SQL query when you retrieve the data (think about filtering, aggregation and/or window functions in SQL as a few examples). Second category is for more advanced data manipulation. There are various acceptable ways to do this but the most common ones that I have encountered among Data Scientists are using Python and R. I personally use Python so my posts will also mostly use Python and available libraries such as Pandas and NumPy but if you use R or anything similar, that is generally acceptable as well. I will also have separate posts dedicated to this section and will update this post with relevant links as those posts become available.
Data Analysis and Visualization — Links
I will keep adding new links here as I make new posts and will do my best to include practice notebooks in each post to help with learning.
2.1. Python: Top Programming Language for Data Science
2.2. Python Advanced Functions
2.3. Data Manipulation and Analysis with Pandas
2.4. Data Visualization in Python Using Matplotlib and Pandas
3. General Science Knowledge — Statistics and Probability
In order to do the analysis and data manipulation that you want to implement to answer questions using the data, there are various statistical and probabilistic concepts that Data Scientists are expected to be familiar with (examples include sampling, hypothesis tests, etc.). With the prevalence of Machine Learning these days, some candidates discount the importance of this category but in my experience this is foundational knowledge that is required for the job. I will also have separate posts dedicated to this section and will update this post with relevant links as those posts become available.
Statistics and Probability — Links
3.1. Central Limit Theorem, Confidence Level/Interval, Sampling and Sample Size Calculation
3.2. Univariate Analysis
3.4. Sampling
3.5. Correlation
4. General Science Knowledge — Machine Learning
And finally the part that I am sure everyone has been waiting for — Machine Learning! You do not need me to tell you how machine learning has become part of the everyday life of Data Scientists across the world so let’s skip the speech about the importance of this topic. Depending on the area within the realm of data science that you interview for, you will end up using very specific machine learning algorithms but a general familiarity with various machine learning algorithms and concepts are essential to pass the interviews and more importantly, to be able to perform the day-to-day tasks of a Data Scientist. I will also have separate posts dedicated to this section and will update this post with relevant links as those posts become available.
4.1. Regression vs. Classification in Machine Learning
4.2. Linear Regression
4.3. Logistic Regression
4.4. Classification — Logistic Regression vs. K-Nearest Neighbors vs. Naive Bayes
4.5. Decision Trees Regression and Classification — Intro & Implementation
4.6. Hierarchical and K-Means Clustering — Intro & Implementation
4.7. AutoML — Intro & Implementation
4.8. XGBoost — Intro & Implementation
5. Specialized Science Knowledge
This is where Data Scientists start to specialize and depending on the level of the role that you interview for, you are expected to know different levels of depth in that particular data science realm. Let me explain with an example to further clarify this point. My team in Amazon focuses on Natural Language Processing (NLP) outcomes. What that means is that my day-to-day work involves working with textual data and as a result Data Scientists in my team are expected to have a deeper knowledge of machine learning algorithms and concepts that are relevant to the NLP space (e.g. embeddings, tokenization, etc.). On the other hand, a Data Scientist that is working on forecasting demand for a particular product in the future, is expected to have a deeper knowledge about time series or other concepts that are more relevant to that role. I will also have separate posts dedicated to this section and will update this post with relevant links as those posts become available.
Specialized Science Knowledge — Links
5.1. Natural Language Processing (NLP)
5.1.1. Language Modeling — Sentiment Analysis, Machine Translation and Named-Entity Recognition
5.1.2. Pre-Trained Models (a.k.a. Large Language Models) in NLP
5.1.3. Sentiment Analysis (Deep Dive)
5.1.4. Grammatical Error Correction with Machine Learning
5.1.5. Topic Modeling
5.1.6. Hugging Face Intro with NLP Tasks Implementation
5.1.7. Hugging Face Agents and Tools for NLP Tasks
5.2. Hyperparameter Optimization
5.2.1. Grid Search, Random Search and Bayesian Optimization
5.2.2. Bayesian Optimization (Deep Dive)
5.2.3. Hyperopt: Distributed Hyperparameter Optimization
Thanks for Reading!
If you found this post helpful, please follow me on Medium and subscribe to receive my latest posts!





