Getting Started With Pandas: How to Import and Persist Data from Different Sources
In this Python session, we will walk you through the process of importing various types of data into Pandas, as well as how to get the required libraries installed.

Setting up your Python environment
Before we dive into the importing process, it’s essential to have a properly set up Python environment. We recommend using Anaconda or Miniconda for managing your Python environments and packages. We are assuming you already have Python installed (if not — Python).
Installing necessary libraries
To work with Pandas and the various data formats, you’ll need to install the following libraries:
## Pandas
pip install pandas
## openpyxl (for Excel files)
pip install openpyxl
## sqlalchemy (for SQL databases)
pip install sqlalchemy
## pyreadstat (for SPSS, Stata, and SAS data)
pip install pyreadstat
## rpy2 (for R data)
pip install rpy2Once these libraries are installed, you can import data from various sources.
Importing CSV files
To import CSV files into a Pandas DataFrame, use the read_csv() function. Here’s an example:
import pandas as pd
csv_file = 'your_csv_file.csv'
df = pd.read_csv(csv_file)
print(df.head())Because CSV files are a universally recognized format for encoding tabular data, they are ubiquitous for data interchange across various applications and platforms. The proficiency in integrating these files seamlessly into a Pandas DataFrame not only broadens the scope of data sources from which you can extract insights, but also expedites the data pre-processing stage in any data-centric application or project. The read_csv () function also boasts an array of parameters that allows flexible and efficient data import, granting you more control over the data wrangling process.
Importing Excel files
To import Excel files into a Pandas DataFrame, use the read_excel() function. Here’s an example:
import pandas as pd
excel_file = 'your_excel_file.xlsx'
df = pd.read_excel(excel_file, engine='openpyxl')
print(df.head())Excel is one of the most widely used spreadsheet formats in the business and academic sectors. Being able to import data directly from Excel allows users to work with large datasets without having to convert them to another format, hence saving time and reducing the risk of data loss or corruption. The read_excel() function provides a variety of parameters, enabling you to customize the import process to suit different data arrangements and structures. This can be particularly beneficial for handling complex or irregularly structured Excel files.
Importing data from SQL databases
To import data from SQL databases into a Pandas DataFrame, use the read_sql() function. First, create a connection to your database using the sqlalchemy.create_engine() function. Here’s an example:
import pandas as pd from sqlalchemy
import create_engine
database_url = 'postgresql://username:password@localhost/dbname'
engine = create_engine(database_url)
query = 'SELECT * FROM your_table'
df = pd.read_sql(query, engine)
print(df.head())Interacting with SQL databases is a vital task in data analysis, given that relational databases store a significant amount of the world’s data. Utilizing the read_sql() function in Pandas can essentially streamline the process of importing data from SQL into a Pandas DataFrame, eliminating the need for tedious extraction and transformation procedures. The read_sql() function also fosters seamless integration with the SQLAlchemy library, unlocking the potential to handle diverse SQL dialects with ease. This interoperability can significantly augment your data management capabilities and overall productivity when dealing with extensive or complex SQL databases.
Importing SPSS, Stata, and SAS data
To import SPSS, Stata, and SAS data into a Pandas DataFrame, use the respective read_spss(), read_stata(), and read_sas() functions. Here are some examples:
Importing SPSS data
import pandas as pd
spss_file = 'your_spss_file.sav'
df = pd.read_spss(spss_file)
print(df.head())SPSS is a statistical software widely utilized in social science research and other fields that require complex data interpretation. It frequently comes in a proprietary format (.sav), which pandas can conveniently read using the read_spss() function. By doing so, you enable yourself to work across platforms while maintaining the integrity of your data. Utilizing pandas.read_spss() fosters a more streamlined workflow, allowing you to efficiently combine SPSS data with data from other sources in your data analysis processes.
Importing Stata data
import pandas as pd
stata_file = 'your_stata_file.dta'
df = pd.read_stata(stata_file)
print(df.head())Stata is a reliable data management and statistical software that is often utilized in fields like economics, epidemiology, and political science. The data from Stata is usually stored in a proprietary .dta format, which can be seamlessly read into a Pandas DataFrame using the read_stata() function. By mastering this function, you are providing yourself with the ability to explore, manipulate, and analyze Stata data in Python’s intuitive and feature-rich environment. This interoperability is indispensable when it comes to elevating the complexity and versatility of your data analysis workflows.
Importing SAS data
import pandas as pd
sas_file = 'your_sas_file.sas7bdat'
df = pd.read_sas(sas_file)
print(df.head())SAS is an integrated software suite that’s extensively used for advanced analytics, business intelligence, and data management. Data from SAS typically resides in a proprietary .sas7bdat format. With the read_sas() function, this data can be easily imported into a Pandas DataFrame, paving the way for an efficient exploratory data analysis. Mastering this import function grants you the flexibility to maneuver across multiple platforms and harness the power of Python’s pandas for data manipulation, without being confined to a single software environment. Familiarizing yourself with these import functions can significantly enhance your efficiency and adaptability when handling various data structures, which is a key asset in the ever-evolving realm of data science.
Importing R data using the rpy2 library
To import R data into a Pandas DataFrame, use the rpy2 library. First, you’ll need to install the library with pip install rpy2. Here’s an example:
import pandas as pd
import rpy2.robjects as ro from rpy2.robjects
import pandas2ri
pandas2ri.activate()
r_file = 'your_r_file.RData'
ro.r['load'](r_file)
r_df = ro.r['your_dataframe_name']
df = pandas2ri.ri2py(r_df)
print(df.head())R is a popular language among statisticians and data miners for developing statistical software and data analysis. However, there might be instances where you have data stored in R’s native format (.RData) that you’d like to analyze using Python’s pandas, particularly due to Python’s broad applicability and versatility. The rpy2 library, along with its pandas2ri.ri2py() function, serves as a bridge between R and Python, allowing you to import R data into a Pandas DataFrame.
Understanding how to use this function is invaluable as it provides you with the flexibility to operate across these two prominent languages in the field of data science. It allows for an easier incorporation of existing R-based data or statistical models into your Python environment, augmenting your data manipulation capabilities. This cross-language functionality can also significantly enhance your productivity, especially when working on multi-language projects or migrating workflows from R to Python. It champions interoperability, making your data analysis processes more efficient and adaptable to different project requirements. In essence, mastering this import function can be a robust tool in your data science toolkit, expanding your ability to work with different data formats and platforms.
Persisting tabular data
To save your Pandas DataFrame to various formats, use the respective to_csv(), to_excel(), and to_sql() functions. Here are some examples:
Saving as a CSV
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
csv_file = 'output_csv_file.csv'
df.to_csv(csv_file, index=False)The to_csv() function in Python’s pandas library is crucial to understand because it serves as a vital tool for data preservation and accessibility. Data scientists often work with significant amounts of data, and it’s essential to be able to store and retrieve this data readily. Since CSV files are widely recognized and supported by various software and programming languages, they enhance the interoperability and portability of your data.
You can create a robust representation of your data which can be easily saved and shared, reviewed, or edited outside your code using the to_csv function, promoting collaboration and transparency of your work. It is especially useful when you need to pause your work and resume later without having to re-run the entire process to regenerate the data. The to_csv() function is quite versatile as it allows you to specify parameters such as whether to include the index in the output and the delimiters used. Understanding how to use the to_csv() function is a crucial aspect of efficient data management, offering a streamlined way to persist and exchange data.
Saving as an Excel file
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
excel_file = 'output_excel_file.xlsx'
df.to_excel(excel_file, index=False, engine='openpyxl')Understanding the to_excel() function of pandas is fundamental in today’s business and scientific research environment as Microsoft Excel files are extensively involved in data manipulation and storage due to their widespread use and ease of interpretability. This function allows you to seamlessly export your DataFrame into an Excel file, which is designed to be user-friendly and can be read by individuals who may not have a background in programming, thereby providing a means to share complex data analysis results in an easily consumable medium.
Excel files also offer functionalities such as formulas, pivot tables, graphs, and a host of other features that may not be feasible with a simple text file. By employing the to_excel() function, you can leverage these functionalities to enhance your data representation. The function allows you to specify parameters such as the Excel engine you want to use and whether to include the DataFrame’s index in the output, providing you with more control over the final output.
Being proficient with this function not only increases the versatility and portability of your data but also enhances your data analysis workflow, especially if you frequently interface with non-programming stakeholders or use Excel as part of your data analysis toolset. Mastering the to_excel() function in pandas can prove to be a considerable advantage in data management and sharing, amplifying your overall productivity.
Saving to an SQL database
import pandas as pd from sqlalchemy
import create_engine
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
database_url = 'postgresql://username:[password@localhost](mailto:password@localhost)/dbname'
engine = create_engine(database_url)
table_name = 'your_table_name'
df.to_sql(table_name, engine, index=False, if_exists='replace')SQL databases are used extensively for their ability to handle large volumes of data efficiently and manage complex queries, making them a vital component in many enterprises and big data applications, particularly in the context of scalable data science and analytics projects. The to_sql() function enables you to export your DataFrame to a SQL database, which can help manage your data more effectively, particularly when dealing with large datasets that might be impractical to process or analyze in-memory.
SQL databases also offer robust data integrity and security properties, essential in data-sensitive industries, and institutions. By saving your DataFrame to an SQL database, you tap into these features, ensuring the reliability, accuracy, and confidentiality of your data. The to_sql() function allows you to dictate the behavior of the operation if the table already exists in the database (with options like ‘replace’, ‘append’, ‘fail’) and whether to include the DataFrame’s index in the output, giving you enhanced control over your data storage process.
SQL databases are ubiquitous and language-agnostic, thus increasing the accessibility and interoperability of your data. More importantly, by mastering this function, you can integrate your data analysis tasks tightly with the existing database infrastructure in your organization, making your work more streamlined, efficient, and valuable to your team or business.
Pandas provides powerful tools for importing and persisting data from various sources. With these examples, you can start working with different data formats and further explore the potential of Pandas in your data analysis and manipulation tasks.
Additional Reading and Resources (mixture of free and subscription services):
For PM, PMM, & ML Bits, Bytes, and Bots
For Education & Analytics Education on Education
