Summary
The web content provides a step-by-step guide for installing Apache Spark on Windows for use with PySpark, including prerequisites, downloading and setting up Spark, and configuring environment variables.
Abstract
The provided web content is a comprehensive tutorial for installing Apache Spark on a Windows operating system to enable the use of PySpark. It begins with a video walkthrough and outlines prerequisites such as installing Gnu on Windows (GOW) and Anaconda. The guide details the process of downloading Spark from the official Apache Spark website, moving and unzipping the file, and downloading the necessary winutils.exe. It also instructs on ensuring Java 7 or higher is installed, setting environment variables for Spark and Hadoop, and configuring PySpark to launch in a Jupyter Notebook. The tutorial concludes with instructions to verify the installation by running pyspark local and invites users to ask questions or view an example notebook.
Opinions
pyspark local to launch Spark on 2 cores.The video above walks through installing spark on windows following the set of instructions below. You can either leave a comment here or leave me a comment on youtube (please subscribe if you can) if you have any questions!
Prerequisites: Anaconda and GOW. If you already have anaconda and GOW installed, skip to step 5.

2. Download and install Anaconda. If you need help, please see this tutorial.
3. Close and open a new command line (CMD).
4. Go to the Apache Spark website (link)

a) Choose a Spark release
b) Choose a package type
c) Choose a download type: (Direct Download)
d) Download Spark. Keep in mind if you download a newer version, you will need to modify the remaining commands for the file you downloaded.
5. Move the file to where you want to unzip it.
mkdir C:\opt\spark
mv C:\Users\mgalarny\Downloads\spark-2.1.0-bin-hadoop2.7.tgz C:\opt\spark\spark-2.1.0-bin-hadoop2.7.tgz
6. Unzip the file. Use the bolded commands below
gzip -d spark-2.1.0-bin-hadoop2.7.tgz
tar xvf spark-2.1.0-bin-hadoop2.7.tar
7. Download winutils.exe into your spark-2.1.0-bin-hadoop2.7\bin
curl -k -L -o winutils.exe https://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe?raw=true
8. Make sure you have Java 7+ installed on your machine.
9. Next, we will edit our environmental variables so we can open a spark notebook in any directory.
setx SPARK_HOME C:\opt\spark\spark-2.1.0-bin-hadoop2.7
setx HADOOP_HOME C:\opt\spark\spark-2.1.0-bin-hadoop2.7
setx PYSPARK_DRIVER_PYTHON ipython
setx PYSPARK_DRIVER_PYTHON_OPTS notebook
Add ;C:\opt\spark\spark-2.1.0-bin-hadoop2.7\bin to your path.
Notes on the setx command: https://ss64.com/nt/set.html
See the video if you want to update your path manually.
10. Close your terminal and open a new one. Type the command below.

Notes: The PYSPARK_DRIVER_PYTHON parameter and the PYSPARK_DRIVER_PYTHON_OPTS parameter are used to launch the PySpark shell in Jupyter Notebook. The — master parameter is used for setting the master node address. Here we launch Spark locally on 2 cores for local testing.
Done! Please let me know if you have any questions here or through Twitter. You can view the ipython notebook used in the video to test PySpark here!
Christopher ChungWhile Spark can handle partitions efficiently, there are situations where manually repartitioning your data can greatly improve…
Access this blog for free…
Deepanshu tyagiThis blog provides a brief overview of how Spark operates within a cluster environment, aiming to clarify the various components involved…
Varun SinghAvoid these 5 mistakes and your Spark job will run FASTER and BETTER. Data engineers often make these mistakes while writing Spark jobs ….