avatarJosep Ferrer

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5143

Abstract

/p><h1 id="42c2">#2. Understand the HTML Structure</h1><p id="1ed3">Before building your web scraper, it’s important to understand the structure of the website’s HTML.</p><p id="7c37"><b>So… what’s HTML?</b></p><p id="5bd6">HTML stands for <b>Hypertext Markup Language</b> and is the language used to create web pages. <b>Understanding the structure of the HTML will help you navigate the website and identify the specific data you want to extract.</b></p><p id="4767">But what’s even more important — understanding how the website is structured will allow you the extract any data you want to store.</p><p id="bdc9">I strongly recommend the article of <a href="https://readmedium.com/86fdc517c278">Eugenia Anello</a>.</p><div id="b406" class="link-block"> <a href="https://betterprogramming.pub/understanding-html-basics-for-web-scraping-ae351ee0b3f9"> <div> <div> <h2>Understanding the HTML Basics for Web Scraping</h2> <div><h3>A first step to take before scraping a website using Python</h3></div> <div><p>betterprogramming.pub</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*51rg7cKOIgjuD_4Jnl0YvA.png)"></div> </div> </div> </a> </div><p id="6647">She explains really nicely how to understand the structure of an HTLM website and you can get the basics of HTML from her!</p><h1 id="3752">#3. Choose Your Web Scraping Tool</h1><p id="1a44"><b>After you have a good understanding of the HTML structure, it’s time to select a web scraping tool. </b>There are various tools available, both free and paid, that can help you extract data from websites. Some popular web scraping tools include the python libraries BeautifulSoup, Scrapy, and Selenium.</p><p id="2be4">Each tool has its own set of strengths and weaknesses, so be sure to choose the one that best suits your needs. I usually use — and strongly recommend — <a href="https://www.selenium.dev/"><b>Selenium</b></a> and <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"><b>BeautifulSoup</b></a>.</p><p id="6b66">To develop such projects, there are two main required libraries:</p><ul><li><a href="https://www.selenium.dev/"><b>Selenium</b></a><b> </b>is<b> </b>used for automating web applications. It allows you to open a browser and perform tasks as a human being would, such as clicking buttons and searching for specific information on websites</li><li><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"><b>BeautifulSoup</b></a> is a Python library for pulling data out of HTML and XML files.</li></ul><figure id="b490"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*zmy-m7dh2tTXwQII.png"><figcaption>Self-made image.</figcaption></figure><p id="8f93">Additionally, we need a Driver to interact with our browser. To set up your environment, you first need to install al required libraries in your python environment.</p><ol><li><b>Install Selenium:</b> Run the following command in your command prompt or terminal <code>pip install selenium</code></li><li><b>Install Beautiful Soup:</b> Run the following command in your command prompt or terminal <code>pip install beautifulsoup</code></li><li><b>Download the Driver.</b> We need a driver so <code>Selenium</code> can interact with the browser. Check your Google Chrome version and download the right Chromedriver <a href="https://chromedriver.chromium.org/downloads">here</a>. <b>You need to unzip the driver and place it into a path you remember — you will need this path later on! ;)</b></li></ol><p id="0621">⚠️ <i>As I am a Google Chrome regular user, I use it as my default browser. </i><i>But any other browser can be used as well.</i></p><h1 id="4837">#4. Build Your Web Scraper</h1><p id="fb14">Once you’ve chosen your web scraping tool, it’s time to build your web scraper. This involves writing code that instructs your web scraping tool how to navigate the website and extract the desired information.</p><p id="1d74">To do so, we need a driver to simulate that we are a user using the website and a library to pull data out from the website. This can be a complex process, but there are plenty of resources available online to help you get started. I recommend one of my tutorials where I explain <a href="https://blog.devgenius.io/how-to-build-a-scraping-tool-for-linkedin-in-7-minutes-tool-data-science-csv-selenium-beautifulsoup-python-a673f12ac579">how to create a scraping tool for Linkedin</a> or <a href="https://readmedium.com/how-to-build-a-scraping-tool-for-indeed-in-8-minutes-data-science-csv-selenium-beautifulsoup-python-95fcca4b9719">how to create a scraping too for Indeed.</a></p><div id="5694" class="link-block"> <a href="https://blog.devgenius.io/how-to-build-a-scraping-tool-for-linkedin-in-7-minutes-tool-data-science-csv-selenium-beautifulsoup-python-a673f12ac579"> <div> <div> <h2>How to build a scraping tool for Linkedin in 7 minutes</h2> <div

Options

<h3>Using Python and Selenium.</h3></div>
            <div><p>blog.devgenius.io</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*jFb4mrWEG3TdrLa94ACr9A.png)"></div>
          </div>
        </div>
      </a>
    </div><div id="6078" class="link-block">
      <a href="https://readmedium.com/how-to-build-a-scraping-tool-for-indeed-in-8-minutes-data-science-csv-selenium-beautifulsoup-python-95fcca4b9719">
        <div>
          <div>
            <h2>How to build a Scraping Tool for Indeed in 9 minutes</h2>
            <div><h3>Leveraging Python and Selenium for Job Data Mining on Indeed</h3></div>
            <div><p>medium.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*CtLBtqjYvFxV5-fALyThjg.png)"></div>
          </div>
        </div>
      </a>
    </div><p id="2e1b">Another useful tutorial to create a webscraper from scratch is the following one.</p><div id="51f8" class="link-block">
      <a href="https://readmedium.com/web-scraping-with-python-beginner-to-advanced-10daaca021f3">
        <div>
          <div>
            <h2>Web Scraping With Python: Beginner to Advanced.</h2>
            <div><h3>More data more machine learning.</h3></div>
            <div><p>medium.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*idMHrJ2Njbuup2yr.png)"></div>
          </div>
        </div>
      </a>
    </div><h1 id="3250">#5. Run Your Web Scraper</h1><p id="2a49">After building your web scraper, <i>it’s time to run it and extract your data.</i> Depending on the complexity of your project, this may take some time.</p><p id="c82b">Once it is done, you can easily analyze your output data using tools like Python or R to gain valuable insights!</p><h2 id="577c">Main Conclusions</h2><p id="472c">Web scraping is an excellent tool for businesses and individuals alike. To get started with web scraping, you need first to identify your data source, understand the structure of the website’s HTML, choose your web scraping tool, build your web scraper, and finally run it to extract your data.</p><p id="d039">With these steps in mind, you can start mastering web scraping and unlock valuable insights from websites.</p><p id="3930">Feel free to ask me any further question! :D</p><p id="246c"><b>Data always has a better idea — trust it.</b></p><p id="2e31">You can subscribe to my <a href="https://medium.com/subscribe/@rfeers"><b>Medium Newsletter</b></a><b> to stay tuned and receive my content</b>. <i>I promise it will be unique!</i></p><p id="9440">If you are not a full Medium member yet, <b>just check it out <a href="https://medium.com/@rfeers/membership">here</a> to support me and many other writers. </b><i>It really helps </i>:D</p><p id="bf6a">Some other nice medium related articles you should go check out! :D</p><div id="956a" class="link-block">
      <a href="https://blog.devgenius.io/how-to-build-a-scraping-tool-for-linkedin-in-7-minutes-tool-data-science-csv-selenium-beautifulsoup-python-a673f12ac579">
        <div>
          <div>
            <h2>How to build a scraping tool for Linkedin in 7 minutes</h2>
            <div><h3>Using Python and Selenium.</h3></div>
            <div><p>blog.devgenius.io</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*jFb4mrWEG3TdrLa94ACr9A.png)"></div>
          </div>
        </div>
      </a>
    </div><div id="1b11" class="link-block">
      <a href="https://readmedium.com/web-scraping-with-python-beginner-to-advanced-10daaca021f3">
        <div>
          <div>
            <h2>Web Scraping With Python: Beginner to Advanced.</h2>
            <div><h3>More data more machine learning.</h3></div>
            <div><p>medium.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*idMHrJ2Njbuup2yr.png)"></div>
          </div>
        </div>
      </a>
    </div><div id="30e3" class="link-block">
      <a href="https://readmedium.com/how-to-build-a-scraping-tool-for-indeed-in-8-minutes-data-science-csv-selenium-beautifulsoup-python-95fcca4b9719">
        <div>
          <div>
            <h2>How to build a Scraping Tool for Indeed in 9 minutes</h2>
            <div><h3>Leveraging Python and Selenium for Job Data Mining on Indeed</h3></div>
            <div><p>medium.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*CtLBtqjYvFxV5-fALyThjg.png)"></div>
          </div>
        </div>
      </a>
    </div></article></body>

Web scraping in 2023 — Breaking it down to basics

And how to collect data online in 5 simple steps

Self-made image. A magnifying glass simulating over a website, simulating our webscraping.

The amount of data we produce every day is truly mind-blowing. There are 2.5 quintillion bytes of data created each day at our current pace. This is why today’s biggest database is open and free to everyone — and it is called the Internet.

So, can you imagine what you could do with all this data?

But I know what right now you must be thinking how to actually get this data.

The answer is quite straightforward You can use web scraping! :D

Web scraping is an incredible technique that enables you to extract useful information from websites. Whether you’re conducting research, working in marketing, or being involved in e-commerce, web scraping is an invaluable tool that can help you achieve your goals — the possibilities are endless!

In this article, I’ll break down the basics of web scraping, so you can get started in this technique in 2023.

Let’s dive in and discover what web scraping is all about!👇🏻

So first thing first…

What is Web Scraping?

Web scraping is used to collect data from websites. Put simply, it is a clever technique that allows you to automate the process of extracting information from websites.

Self-made image. After scraping the web, we get data and can store it in CSV files or a database.

Instead of spending hours manually copying and pasting information from web pages, web scraping software tools can do the work for you, quickly and efficiently. The best part is that you can collect various types of data, including text, images, and video content.

⚠️ Legal disclaimer: It is important to keep in mind that web scraping can raise legal and ethical concerns if it involves collecting sensitive or copyrighted information without proper authorization. If you are accessing websites, you should always consider their terms of service.

Now you might be wondering… what do I have to do?

You just need to follow some simple steps:

#1. Identify Your Data Source

To start web scraping, you must first determine which website or websites you want to extract data from. And what’s even more important — is what specific information you want to extract.

Let’s make it simple with some examples:

1. E-commerce websites

Let’s say you run an online store and you want to keep tabs on your competitors’ prices and stock levels. With web scraping, you can automatically gather this information from their websites, saving you time and giving you a competitive edge.

Self-made image. Screenshot of the Amazon website.

Plus, you can also use web scraping to collect customer reviews, analyze product trends, and even track shipping times!

2. Social media platforms

Social media is a treasure trove of data, and web scraping can help you tap into it. For instance, you can scrape Twitter to gather data on hashtags, mentions, and trending topics, or scrape LinkedIn to collect information on job postings, industry trends, and more.

Self-made image. Screenshot of the Twitter website.

With web scraping, you can analyze user behavior, identify influencers, and improve your social media strategy.

3. Real Estate websites:

If you’re in the market for a new home, you know how time-consuming it can be to browse through endless property listings, right? That’s where web scraping comes in!

Self-made image. Screenshot of the Zillow website.

By scraping real estate websites, you can quickly gather information on available properties, rental rates, and more. So, whether you’re buying, selling, or investing in real estate, web scraping can be a powerful tool to help you save time and make smarter decisions.

#2. Understand the HTML Structure

Before building your web scraper, it’s important to understand the structure of the website’s HTML.

So… what’s HTML?

HTML stands for Hypertext Markup Language and is the language used to create web pages. Understanding the structure of the HTML will help you navigate the website and identify the specific data you want to extract.

But what’s even more important — understanding how the website is structured will allow you the extract any data you want to store.

I strongly recommend the article of Eugenia Anello.

She explains really nicely how to understand the structure of an HTLM website and you can get the basics of HTML from her!

#3. Choose Your Web Scraping Tool

After you have a good understanding of the HTML structure, it’s time to select a web scraping tool. There are various tools available, both free and paid, that can help you extract data from websites. Some popular web scraping tools include the python libraries BeautifulSoup, Scrapy, and Selenium.

Each tool has its own set of strengths and weaknesses, so be sure to choose the one that best suits your needs. I usually use — and strongly recommend — Selenium and BeautifulSoup.

To develop such projects, there are two main required libraries:

  • Selenium is used for automating web applications. It allows you to open a browser and perform tasks as a human being would, such as clicking buttons and searching for specific information on websites
  • BeautifulSoup is a Python library for pulling data out of HTML and XML files.
Self-made image.

Additionally, we need a Driver to interact with our browser. To set up your environment, you first need to install al required libraries in your python environment.

  1. Install Selenium: Run the following command in your command prompt or terminal pip install selenium
  2. Install Beautiful Soup: Run the following command in your command prompt or terminal pip install beautifulsoup
  3. Download the Driver. We need a driver so Selenium can interact with the browser. Check your Google Chrome version and download the right Chromedriver here. You need to unzip the driver and place it into a path you remember — you will need this path later on! ;)

⚠️ As I am a Google Chrome regular user, I use it as my default browser. But any other browser can be used as well.

#4. Build Your Web Scraper

Once you’ve chosen your web scraping tool, it’s time to build your web scraper. This involves writing code that instructs your web scraping tool how to navigate the website and extract the desired information.

To do so, we need a driver to simulate that we are a user using the website and a library to pull data out from the website. This can be a complex process, but there are plenty of resources available online to help you get started. I recommend one of my tutorials where I explain how to create a scraping tool for Linkedin or how to create a scraping too for Indeed.

Another useful tutorial to create a webscraper from scratch is the following one.

#5. Run Your Web Scraper

After building your web scraper, it’s time to run it and extract your data. Depending on the complexity of your project, this may take some time.

Once it is done, you can easily analyze your output data using tools like Python or R to gain valuable insights!

Main Conclusions

Web scraping is an excellent tool for businesses and individuals alike. To get started with web scraping, you need first to identify your data source, understand the structure of the website’s HTML, choose your web scraping tool, build your web scraper, and finally run it to extract your data.

With these steps in mind, you can start mastering web scraping and unlock valuable insights from websites.

Feel free to ask me any further question! :D

Data always has a better idea — trust it.

You can subscribe to my Medium Newsletter to stay tuned and receive my content. I promise it will be unique!

If you are not a full Medium member yet, just check it out here to support me and many other writers. It really helps :D

Some other nice medium related articles you should go check out! :D

Programming
Data Science
Python
Scraping
Software Development
Recommended from ReadMedium