avatarSoner Yıldırım

Summary

The article outlines the three most utilized tools by the author in their first year as a data scientist: Pandas, SQL, and Tableau.

Abstract

In the realm of data science, the author emphasizes the importance of mastering a select few tools rather than spreading oneself too thin across numerous options. The article delves into the author's personal experience, highlighting Pandas, SQL, and Tableau as the cornerstone tools that significantly contributed to their success in the first year of their data science career. Pandas is praised for its efficiency in data manipulation and analysis within Python, while SQL is recognized as an indispensable skill for managing and analyzing data in relational databases. Tableau stands out for its ability to create insightful visual analytics and dashboards, which are crucial for communicating results to stakeholders. The author suggests that focusing on these three tools can greatly enhance a data scientist's job prospects.

Opinions

  • The author believes that a rich selection of software tools in data science can be a disadvantage if not used wisely.
  • They advocate for focusing on a small subset of tools rather than learning more than necessary.
  • Pandas is considered the most frequently used Python library due to its effectiveness in handling tabular data and its intuitive syntax.
  • SQL is deemed a must-have skill for data professionals, as it is not only for querying but also for data analysis within relational databases.
  • The author has found SQL to be an everyday tool in their data science work and expects this to continue.
  • Tableau is highly regarded for its ease in creating dashboards, which are essential for explaining solutions and results to customers, especially in SaaS or consulting companies.
  • The author notes that Tableau is the market leader in visual analytics platforms and is the most in-demand tool in job postings.
  • They predict that the combination of Pandas, SQL, and Tableau will remain essential tools for data scientists in the foreseeable future.

3 Tools that I Used the Most in My First Year as a Data Scientist

I think it will be the same for the upcoming years.

Photo by Sixteen Miles Out on Unsplash

The data science ecosystem has a ton of software tools and packages which is a good thing because such tools expedite and simplify our workflows.

I’m sure we all are glad to have these tools in our lives. However, having a rich selection of them might be a disadvantage if not used wisely.

Based on what I have experienced and observed in the last 3 years, I can say that we tend to learn more than necessary. Instead of distributing your time and energy among a high number of tools, I recommend focusing on a small subset of them.

This is the reason why I wanted to write this article and explain the 3 tools that I used the most in my first year as a data scientist. Improving your skills on these tools will increase your chance of landing a job dramatically.

Let’s start with the obvious one.

Pandas

Python dominates the field of data science and so do Python libraries. Pandas is a data analysis and manipulation library for Python. Considering a substantial amount of time in a project is spent on cleaning and preprocessing the raw data, Pandas might be the most frequently used Python library.

Pandas is a highly efficient library to work with tabular data. I don’t recall a problem for which Pandas could not provide a solution.

Another advantage of Pandas is having a clean syntax. It is intuitive and easy-to-read syntax just like most Python libraries.

Pandas makes it quite easy to perform the most frequently done operations on tabular data which are as follow:

  • Reading data from an external file (e.g. CSV or parquet)
  • Checking the size of data
  • Changing the data types if necessary (e.g. should not store numbers as string)
  • Finding and handling missing values
  • Filtering based on a condition or a set of conditions
  • Exploratory data analysis

Although Pandas has numerous functions and methods, there is a small subset of them that you will use the most. Here are the 8 Pandas functions I used the most.

SQL

SQL is used for managing data stored in relational databases. A relational database consists of several tables that are related by means of shared columns.

Most companies store their data in relational databases so it is definitely a must-have skill for data scientists, engineers, and analysts. I have used SQL almost everyday in my first year as a data scientist and I think I will keep using it as frequently as before.

Although SQL stands for Structured Query Language, it is capable of doing much more than just querying a database.

SQL is also a data analysis tool. It is capable of doing most of what Pandas can do. If the data is stored in a relational database, it is more practical to do the analysis using SQL instead of exporting all the data and then using another tool for analysis.

You can also automate routine operations by writing SQL scripts as stored procedures. Here is an example of what a stored procedure can do:

  1. Read data from a few tables and filter if necessary
  2. Transform and reformat if necessary
  3. Combine data from multiple tables based on the requirements
  4. Write the transform data into a new table

There are many relational database management systems such as MySQL, PostgreSQL, SQL Server, and so on. Although they mostly use the same SQL syntax, there are some minor differences. For instance, MySQL uses the limit keyword to limit the number of rows to be displayed whereas SQL Server uses the top keyword.

Tableau

Tableau is a visual analytics platform. It makes it easy to create informative dashboards that can be used for understanding the data, evaluating results, and delivering results to customers.

If you work at a SaaS or consulting company, dashboards are of crucial importance. This is how you explain your solution and results to the customers. You cannot just send them csv files with plain numbers.

A lot of companies use Tableau for creating business intelligence dashboards as well. You can combine data from a variety of sources and make a summary of how your company is doing.

Tableau is the market leader in this domain. The other popular one is Power BI. There are also some open source alternatives such as Grafana. What they aim to do are the same but Tableau seems to be the leading player.

In my first year as a data scientist, I have created dashboards and updated existing ones based on customer demand. Either you work as data scientist or data analysts, you will most probably use Tableau or a similar tool. From what I observe in the community and job postings, Tableau is the most-demanded one.

Tableau supports many different types of data visualizations. It also can connect directly your data source so that you do not have to export the data manually.

Pandas, SQL, and Tableau are the 3 tools that I have used the most in my first year as a data scientist. I think this trio will keep their place for the upcoming years.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.

Data Science
Machine Learning
Artificial Intelligence
Python
Data Analysis
Recommended from ReadMedium