avatarMahyar Aboutalebi, Ph.D. 🎓

Summarize

How Many Cloud-Free Sentinel-2 Images Have Been Captured Near Your Location?

In this post, we delve deeper into the query and metadata of Sentinel-2 images to count the clear and cloudy images captured around a specific location in the last year, all accomplished without the need to download the actual images!

Table of Contents

  1. 🌟 Introduction
  2. 🔍 Making Queries and Tricks
  3. 🕵️‍♂️ Histogram of Cloud Cover
  4. 📅 Time Series of Cloud Cover in 2023
  5. 📊 Generating a Summary Table of Cloudy Days
  6. 📈 Plotting Cloudy and Clear Days
  7. 📉 Visualizing All Three Plots (optional)

🌟 Introduction

In my previous post, which focused on downloading Sentinel-2 images using Python, we explored the process of querying and visualizing the general filenames of Sentinel-2 images. We used the image ID to download and visualize a specific image. In this article, our objective is to delve deeper into the query table and extract the cloud cover values for all images captured in 2023 within a designated area of interest (AOI). We will categorize these images from clear to cloudy, and build a comprehensive report detailing the number of images captured for our location. The report will include a breakdown of the number of clear and cloudy scenes. Additionally, we will present the results through visualizations, including histograms, time series, and pie charts. If you haven’t had the chance to read my previous post on downloading Sentinel-2 images using Python, you can find it at this link:

🔍 Making Queries and Tricks

In this step, I will guide you through the process of querying the Copernicus Data Space Ecosystem database and share some tips to obtain a comprehensive list of images along with their complete attributes. To initiate a query, you simply need to specify the following parameters: satellite name, the product level (either top of the atmosphere or surface reflectance), the start date, and end date of the image acquisition, along with the coordinates of your Area of Interest (AOI).

For this article, the choice of the product level is not important, as the cloud cover information is available in both the top of the atmosphere and surface reflectance products. In this case, I used the S2MSI2A product, specifically considered for surface reflectance.

Our goal is to obtain a comprehensive summary of all Sentinel-2 images captured around your specified location within the last year. Therefore, the start date will be January 1, 2023, and the end date will be December 31, 2023.

import requests
import pandas as pd
import numpy as np

url_dataspace = "https://catalogue.dataspace.copernicus.eu/odata/v1"

# Filtering
satellite = "SENTINEL-2"
level = "S2MSI2A"
cloud_cover =100
# aoi_point ="POINT(-120.9970 37.6393)"
aoi_polygon = "POLYGON ((-121.0616 37.6391, -120.966 37.6391, -120.966 37.6987, -121.0616 37.6987, -121.0616 37.6391))"

start_date = "2023-01-01"
end_date = "2023-12-30"
start_date_full =start_date+"T00:00:00.000Z"
end_date_full = end_date +"T00:00:00.000Z"

Next, you can execute the following code to initiate a query and retrieve a list of available images in the database for the specified configurations:

query = f"{url_dataspace}/Products?$filter=Collection/Name eq '{satellite}' and Attributes/OData.CSC.StringAttribute/any(att:att/Name eq 'productType' and att/OData.CSC.StringAttribute/Value eq '{level}') and Attributes/OData.CSC.DoubleAttribute/any(att:att/Name eq 'cloudCover' and att/OData.CSC.DoubleAttribute/Value lt {cloud_cover}) and OData.CSC.Intersects(area=geography'SRID=4326;{aoi_point}') and ContentDate/Start gt {start_date_full} and ContentDate/Start lt {end_date_full}"
response = requests.get(query).json()
result = pd.DataFrame.from_dict(response["value"])
result.to_csv('result.csv', index=False)

The key trick here is that using this command will retrieve only the first 20 images available for the specified period. To obtain the complete list, you need to add “&$top=1000” to access the full set of images. The “1000” indicates the retrieval of the first 1000 images, a safe number since the number of images captured by Sentinel-2 for each location is typically less than 365, considering the temporal resolution of 2 to 7 days depending on the location. Another important trick is that the default output of the query does not display the cloud cover information for each image. To obtain the full list of attributes, you should include “$expand=Attributes” in the command. Therefore, the final version of the query will be:

query = f"{url_dataspace}/Products?$filter=Collection/Name eq '{satellite}' and Attributes/OData.CSC.StringAttribute/any(att:att/Name eq 'productType' and att/OData.CSC.StringAttribute/Value eq '{level}') and Attributes/OData.CSC.DoubleAttribute/any(att:att/Name eq 'cloudCover' and att/OData.CSC.DoubleAttribute/Value lt {cloud_cover}) and OData.CSC.Intersects(area=geography'SRID=4326;{aoi_point}') and ContentDate/Start gt {start_date_full} and ContentDate/Start lt {end_date_full}&$top=1000&$expand=Attributes"
response = requests.get(query).json()
result = pd.DataFrame.from_dict(response["value"])
result.to_csv('result.csv', index=False)

If you open the results.csv in Excel, you’ll observe an extensive dictionary of various information in the attributes column, which is beyond the scope of this article. Our focus is solely on extracting the cloud cover for each row. To extract the cloud cover values from that column, execute the following code:

import json
# Extract the value of cloudCover using json.loads()
result['cloudCover'] = result['Attributes'].apply(lambda x: json.loads(json.dumps(x))[2]['Value'] if len(json.loads(json.dumps(x))) > 2 and json.loads(json.dumps(x))[2]['Name'] == 'cloudCover' else None)
result.to_csv('result_cloud_cover.csv', index=False)

Now that we have all the necessary information in the dataframe, we are ready to proceed with the analysis.

🕵️‍♂️ Histogram of Cloud Cover

An informative plot to examine would be a histogram representing the distribution of cloud cover values extracted throughout the entire year. The following lines generate a histogram showing the distribution of cloud cover values for 2023 and for my location:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# create a histogram of cloud cover data
fig, ax = plt.subplots(figsize=(8, 6))
sns.histplot(sorted_result['cloudCover'], bins=10, ax=ax, color='orange')

# set axis labels and title
ax.set_xlabel('Cloud Cover')
ax.set_ylabel('Count')
ax.set_title('Histogram of Cloud Cover')

# set x-axis style
ax.tick_params(axis='x', which='major', labelsize=12, length=6, width=2, color='gray')
ax.tick_params(axis='x', which='minor', labelsize=10, length=4, width=1, color='lightgray')
ax.spines['bottom'].set_linewidth(2)
ax.spines['bottom'].set_color('gray')

# set y-axis style
ax.tick_params(axis='y', which='major', labelsize=12, length=6, width=2, color='gray')
ax.tick_params(axis='y', which='minor', labelsize=10, length=4, width=1, color='lightgray')
ax.spines['left'].set_linewidth(2)
ax.spines['left'].set_color('gray')

# add grid lines
ax.grid(axis='y', linestyle='--', alpha=0.7)

# show the plot
plt.show()

and the outcome will be:

This histogram shows that the majority of images captured around my location in 2023 have less than 10% cloud cover (more than 80 images!). There is a consistent and relatively small number of images in other bins until above 90%, where there is a noticeable peak showing that more than 20 images were nearly fully cloudy.

📅 Time Series of Cloud Cover in 2023

With an understanding of the image count in each cloud cover range, it would be interesting to examine the time series of cloud cover throughout the year. This analysis will provide information on the months that offer the best chances for clear images and those with less favorable conditions:

# create a time series plot of the cloud cover data against the dates
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(sorted_result['Date'], sorted_result['cloudCover'], color='orange', linewidth=2)

# set axis labels and title
ax.set_xlabel('Date', fontsize=14)
ax.set_ylabel('Cloud Cover', fontsize=14)
ax.set_title('Time Series of Cloud Cover', fontsize=16)

# set x-axis style
ax.tick_params(axis='x', which='major', labelsize=12, length=6, width=2, color='gray')
ax.tick_params(axis='x', which='minor', labelsize=10, length=4, width=1, color='lightgray')
ax.spines['bottom'].set_linewidth(2)
ax.spines['bottom'].set_color('gray')

# set y-axis style
ax.tick_params(axis='y', which='major', labelsize=12, length=6, width=2, color='gray')
ax.tick_params(axis='y', which='minor', labelsize=10, length=4, width=1, color='lightgray')
ax.spines['left'].set_linewidth(2)
ax.spines['left'].set_color('gray')

# add grid lines
ax.grid(axis='both', linestyle='--', alpha=0.7)

# show the plot
plt.show()

The timeseries will be:

As expected, from January to the end of March, the likelihood of obtaining clear images is very low, with most images in that period showing more than 40% cloud cover. The period between mid-April to October stands out as the optimal time for clear images, with cloud cover generally remaining below 40%. However, post-October, there are many peaks in the timeseries, indicating a higher frequency of cloudy conditions until the end of the year.

📊 Generating a Summary Table of Cloudy Days

The third step of the analysis is to create a summary table that displays the exact number of images captured in each month and also shows the distribution of cloud cover percentages in five categories: less than 20%, 20% to 40%, 40% to 60%, 60% to 80%, and above 80%. Subsequently, calculate the ratio of the number of images in each class to the total number of images:

# create a new column to store the cloud cover range
sorted_result['cloud_range'] = pd.cut(sorted_result['cloudCover'], bins=[0, 20, 40, 60, 80, 100], labels=['0-20', '20-40', '40-60', '60-80', '80-100'])

# group by month and cloud range, count the number of days
table = sorted_result.groupby([sorted_result['Date'].dt.month, 'cloud_range']).size().unstack(fill_value=0)

# rename the index to month names
table.index = pd.date_range(start='2023-01-01', end='2023-12-31', freq='MS').strftime('%B')

# add a new row that shows the sum of each column
table.loc['Total'] = table.sum()

# add a new row that shows the percentage of each column
table.loc['Percentage (%)'] = table.loc['Total'].apply(lambda x: round(x / table.loc['Total'].sum() * 100))

# display the table
print(table)

The table will be:

This table illustrates the distribution of cloud cover ranges for images in each month. Similar to the histogram and timesereis analysis, the majority of images captured in July, August, and September were clear, whereas most images taken in December, January, February, and March showed cloudy conditions. The overall analysis reveals that 61% of the images had less than 20% cloud cover, indicating almost cloud-free conditions and 26% of the images showed more than 80% cloud cover, showing nearly very cloudy conditions in 2023 for the specified location.

📈 Plotting Cloudy and Clear Days

Since we now have the table, let’s use a pie chart to visually illustrate the exact values reported in the last row, specifically focusing on the number of images in each cloud cover range:

# create a pie chart of the percentage values
fig, ax = plt.subplots(figsize=(8, 8))

# create a list of colors for each section of the pie chart
colors = ['#ffa600', '#ff6361', '#bc5090', '#58508d', '#008080']

# plot the pie chart
ax.pie(table.loc['Percentage (%)'] , labels=table.columns, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90, textprops={'fontsize': 14, 'fontweight': 'bold'})

# add a title
ax.set_title('Percentage of Cloud Cover Ranges', fontsize=16, fontweight='bold')

# show the plot
plt.show()

The plot will be:

As mentioned earlier, approximately 62% of the images were nearly cloud-free and usable while around 20% of the images were obscured by clouds across the entire scene. The remaining categories showed a comparable ratio, ranging between 6% to 9%.

📉 Visualizing All Three Plots (optional)

This step is completely optional, added to the last part to provide insight into developing a simple dashboard with three plots. The following lines are for showing the plots side by side:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



# create a figure with three subplots
fig, axs = plt.subplots(1, 3, figsize=(18, 6))

# plot the time series on the first subplot
axs[0].plot(sorted_result['Date'], sorted_result['cloudCover'], color='blue', linewidth=2)
axs[0].set_xlabel('Date', fontsize=14)
axs[0].set_ylabel('Cloud Cover', fontsize=14)
axs[0].set_title('Time Series of Cloud Cover', fontsize=16, fontweight='bold')
axs[0].tick_params(axis='x', which='major', labelsize=10, length=6, width=2, color='gray')
axs[0].tick_params(axis='x', which='minor', labelsize=10, length=4, width=1, color='lightgray')
axs[0].spines['bottom'].set_linewidth(2)
axs[0].spines['bottom'].set_color('gray')
axs[0].tick_params(axis='y', which='major', labelsize=12, length=6, width=2, color='gray')
axs[0].tick_params(axis='y', which='minor', labelsize=10, length=4, width=1, color='lightgray')
axs[0].spines['left'].set_linewidth(2)
axs[0].spines['left'].set_color('gray')
axs[0].grid(axis='both', linestyle='--', alpha=0.7)

# plot the histogram on the second subplot
sns.histplot(sorted_result['cloudCover'], bins=10, ax=axs[1], color='orange')
axs[1].set_xlabel('Cloud Cover', fontsize=14)
axs[1].set_ylabel('Count', fontsize=14)
axs[1].set_title('Histogram of Cloud Cover', fontsize=16, fontweight='bold')
axs[1].tick_params(axis='x', which='major', labelsize=12, length=6, width=2, color='gray')
axs[1].tick_params(axis='x', which='minor', labelsize=10, length=4, width=1, color='lightgray')
axs[1].spines['bottom'].set_linewidth(2)
axs[1].spines['bottom'].set_color('gray')
axs[1].tick_params(axis='y', which='major', labelsize=12, length=6, width=2, color='gray')
axs[1].tick_params(axis='y', which='minor', labelsize=10, length=4, width=1, color='lightgray')
axs[1].spines['left'].set_linewidth(2)
axs[1].spines['left'].set_color('gray')
axs[1].grid(axis='y', linestyle='--', alpha=0.7)

# plot the pie chart on the third subplot
colors = ['#ffa600', '#ff6361', '#bc5090', '#58508d', '#008080']
axs[2].pie(table.loc['Percentage (%)'], labels=table.columns, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90, textprops={'fontsize': 14, 'fontweight': 'bold'})
axs[2].set_title('Percentage of Cloud Cover Ranges', fontsize=16, fontweight='bold')

# adjust the layout and spacing of the subplots
fig.tight_layout(pad=3)

# show the plot
plt.show()

The final plot will be:

Here are the relevant posts available through these links:

Data Science
Python
Data Visualization
Remote Sensing
Data
Recommended from ReadMedium