avatarAsjad Naqvi

Summary

The provided content is a comprehensive guide on creating maps in Stata, focusing on visualizing the spatial distribution of the COVID-19 Policy Stringency Index, and includes instructions on preparing data, using shapefiles, and customizing maps.

Abstract

The guide begins by introducing the history of mapping capabilities in Stata, from the introduction of the tmap command in 2004 to the integration of spshape2dta and grmaps in Stata 15. It details the process of creating a map for the COVID-19 Policy Stringency Index, starting with setting up a folder structure for organizing files, downloading necessary map files, and preparing the COVID-19 data from the Oxford COVID-19 Government Response Tracker. The tutorial covers understanding shapefiles, converting them for use in Stata, and preparing the data for mapping. It explains how to merge datasets, handle map projections, and generate customized maps with automated code using locals. The guide also addresses the importance of accurate data merging based on country names, customizing color schemes, and adding informative titles and notes to the maps. The author emphasizes the use of relative paths, the management of no data regions, and the benefits of automation in code. The final section includes instructions for installing necessary Stata packages, generating label files for country names, and creating maps with labels and custom color schemes. The author, an economist with extensive Stata experience, invites readers to engage with the content, ask questions, and collaborate on future projects.

Opinions

  • The author values the importance of accurate and clean data, suggesting manual cleaning of country names for better merging of datasets.
  • There is an emphasis on the use of open-source tools and datasets, such as those provided by ESRI and the Oxford COVID-19 Government Response Tracker.
  • The guide suggests that the choice of map projection can significantly affect the representation of geographical data, opting for Google's Web Mercator projection for its readability.
  • The author demonstrates a preference for customization in data visualization, including the use of custom color schemes and automated code for efficiency and consistency.
  • The guide reflects the author's commitment to sharing knowledge and fostering a community of practice through invitations to connect on various platforms and participate in discussions on a dedicated Discord server.

Maps in Stata I

Maps are a powerful visualization tool for displaying the spatial distribution of data. In Stata, the ability to make maps was first introduced in 2004 by Maurizio Pisati, as the tmap command, which was upgraded in 2014 to the shp2dta and spmap commands. Stata 15 formally integrated both of these commands as spshape2dta and grmaps respectively in 2017 as part of its support for spatial analysis.

In this guide, will be learn how to create the following COVID-19 Policy Stringency Index map:

In order to follow this guide, a basic knowledge of Stata is assumed including a familiarity with the Stata interface, dofiles, and code syntax. The guide here will introduce shapefiles, map projections, merging datasets from different sources, followed by the commands to generate customized maps, customized color schemes, and the use of locals for code automation.

Understanding shapefiles

Spatial data comes in different formats. The most common version of map data are called shapefiles that was developed by ESRI, the makers of ArcGIS, which is also the standard industrial software for mapping and spatial analysis. ESRI has also developed the now well-known dashboard used by John Hopkin’s University (JHU) and several other countries to showcase COVID-19 maps and trends. Given the rise and popularity of maps, shapefiles are usually easily accessible for most countries including different layers of administrative boundaries. Shapefiles are basically a set of multiple files, of which, the following three core files are required to generate a map:

  • .shp: contains the coordinates of shapes which can be polygons, lines, or points
  • .dbf: contains the attributes of the shapes
  • .shx: has the index of spatial objects

Several other optional files are usually bundled with the above three, but an important one here is the projections file:

  • .prj: contains the projection or spatial reference system of the shape

While we won’t go too much into the details of projection systems, it is still essential to understand the core concept behind them. When we see maps, they are usually drawn on a flat two-dimensional surface to represent shape of the earth, a three-dimensional object. This 3-D to 2-D mapping can be done in multiple ways. Different projections are needed to calculate the correct distances or areas. But since we are just plotting data on a map, we will use Google’s Web Mercator projection, which distorts both distance and area, for making countries easier to read. Some countries also release their own projection systems to make maps as accurately as possible. An easy to relate to example given here, shows how a face can be drawn on a flat surface using different projections.

Task 1: Get the folders in order

In line with the previous guides, we use the following folder structure to keep track of the files:

Those, who have read and followed earlier guides, can see that here we add an additional folder, called GIS, to keep all the spatial files. If you are reading this guide for the first time, then just create a root folder, and within it, the following six subfolders shown above. The description of each folder is given in the figure.

Task 2: Get the map files

The generic map files for the world can be downloaded from different places on the internet. For this guide, we will use the ArcGIS 2020 world country boundaries file given at the link here: https://hub.arcgis.com/datasets/UIA::uia-world-countries-boundaries

(Note: Maps get updated regularly so please free to use a latest layer at the time of using this guide.)

The page looks like the image below. Since there are different versions of the files, make sure that the date of the file is somewhere in 2020. If not, then one can scroll down the page and select more recent versions of the same dataset.

On this page, just click on the Download drop-down menu and then click on the Shapefile, which will download a zipped file. Extract the contents of the zipped file in the GIS folder:

Here we get five files from the compressed zip file. In general, shapefiles can be directly easily viewed in softwares like ArcGIS (needs a license), QGIS (free) and GeoDa (free) just by opening the .shp file.

Task 3: Prepare the data

For the maps, we will use the Oxford COVID-19 Government Response Tracker (OxCGRT), that we also used in Guide 3. Guide 3, also included a detailed description on how to extract data from a Github repository. For the sake of completeness, the code required to set up the data set is repeated below:

clear
*** import the data from Github
import delim using "https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv", clear
*** new data has been added for USA and UK for regions so we drop the regions:
drop if regionname!=""
*** keep only the variables we need
keep date countryname stringencyindexfordisplay
ren stringencyindexfordisplay Index  
ren countryname country
*** fix the date
tostring date, gen(date2)    // string the date variable
gen year  = substr(date2,1,4)   // extract the date components
gen month = substr(date2,5,2)
gen day   = substr(date2,7,2)
destring year month day, replace
drop date date2
gen date = mdy(month,day,year)
format date %tdDD-Mon-CCyy
drop month day year
order country date
sort country date
drop if date < 21915    // 21915 = 1st January
** save the files
compress
save ./master/COVID_policies.dta, replace

In the code above, we keep the overall policy stringency index, country name, and the date variables. The date variable is fixed in the middle part of the code above and, in the last step, we save the cleaned data in the master subfolder with the name COVID_policies.dta.

Task 4: Using Shapefiles in Stata

In order to use the shapefiles, we need to install the following Stata packages:

*** install the core packages
ssc install spmap    // for the maps package
ssc install shp2dta  // shapefiles to dta. For versions < v15.
ssc install geo2xy   // for fixing the coordinate system

For Stata 15 or higher, the custom spmap can also be replaced with the Stata’s official grmap command. Since grmap is a wrapper for spmap, I will stick to the original command. Both can be used interchangeably after minor tweaks to the syntax. An official Stata guide on importing JHU COVID-19 data, and setting it up using the grmap is given here.

shp2dta is only necessary if you are using Stata 14 or earlier versions. shp2dta should work with most of the shapefiles but there is a chance it might not read all shapefiles correctly and give an error. For Stata 15 or higher, the built-in command spshape2dta can be used. geo2xy is used for changing the map projections in Stata.

In the first step, we read the shapefile in Stata and covert it to Stata format:

********* get the shapefiles in order in order
cd GIS    // switch to the GIS folder
spshape2dta "World_Countries__Generalized_.shp", replace saving(world)

This will show the following screen output:

Here we can see that two datasets are created: world.dta, which contains the attributes of the shapes (or countries in our case), and world_shp.dta which contains the outlines of the shapes.

For Stata 14 and earlier versions, the following shp2dta can be used:

*** For Stata 14 or earlier versions ONLY:
shp2dta using World_Countries__Generalized_.shp, database(world) coordinates(world_shp) genid(_ID) gencentroids(c) replace

We can also explore the world.dta and the world_shp.dta files:

*** explore the outline file
use world_shp, clear
scatter _Y _X, msize(tiny) msymbol(point)

In the file above, each countries are defined by the variable _ID and are stored as a series of points. This is a standard structure of map files (spatial JSON files also have a similar concept of connecting dots to form a shape). From the command above, we get the following scatter:

In the map above, Antartica sticks out prominently at the bottom but has no known reported cases. Therefore, we can drop it to make some space for other countries. In order to drop regions or countries, we need to get their names. Since names exists in the attributes data file world.dta, we can merge and drop as follows:

*** merge with attributes file to get the names
merge m:1 _ID using world 
drop if COUNTRY=="Antarctica" // drop the polar regions
drop rec_header- _merge    // drop all additional variables
scatter _Y _X, msize(tiny) msymbol(point)

This makes the map a bit more readable. In order to improve the map further, we change the default map projection to Google’s Web Mercator projection:

geo2xy _Y _X, proj (web_mercator) replace
scatter _Y _X, msize(tiny) msymbol(point)

Which gives us:

Here you can see that the scale of the axes has changed and the countries are squeezed in the middle and stretched a bit more on the top and the bottom. This (sort of misleading) projection makes European countries a bit more prominent by making them larger and also makes their data easier to read. We can now save the modified world_shp.dta boundary file with the updated information:

sort _ID
save, replace

As an additional step, we also generate a file which contains the names of the countries for labeling. This step is strictly not necessary for this guide, but this information is useful in general. Since names and center points of country shapes exist in the attribute world.dta file, we use the following code to generate a labels file:

*** generate a file for labels
use world, clear  
 ren COUNTRY country
 drop if country=="Antarctica"
 keep _CX _CY country
 geo2xy _CY _CX, proj(web_mercator) replace  // fix the projection
 compress
 save world_label.dta, replace

Here we just keep the X and Y coordinates of the center points and the country names. We also need to change the projection of the coordinates so they can be correctly overlaid with the boundaries.

Note: for labels we can also just use the original world.dta file, but the advantage of the additional file is that it allows us to custom label some selected group of countries and regions. This will be used and discussed in later guides.

We are now ready to make some proper maps. Here we start by loading the attribute file and map it using the spmap command:

use world, clear
 ren COUNTRY country             
 drop if country=="Antarctica"
*** the first map
spmap using world_shp, id(_ID)

which gives us the following figure:

Note the use of relative paths here. Since we are in the GIS directory (we typed cd GIS earlier to process shapefiles), the two dots .. imply that we first go up one directory, and then go back down in the graphs folder where we save the map1.png file.

We can also generate the same map above with country labels:

** second map with labels
spmap using world_shp, ///
 id(_ID) ocolor(gs4 ..) osize(vvthin ..)  ///
  label(data("world_label.dta") x(_CX) y(_CY) label(country) size(*0.45 ..) length(25))

The graph above is a bit messy with the country name clusters but it is to just showcase how this can be done.

A random note: The north-most label shows the Svelbard region, which belongs to Norway (also known as Spitzbergen in German). Svelbard lies just below the Arctic circle and it is one of the coldest and the north-most inhabited regions in the world. It also has no known COVID-19 cases.

Task 5: Putting it all together

We can now merge the world.dta attribute dataset with the COVID19 policy database we saved earlier. The two datasets can be combined on the country name variable.

Note: Merging on names, in general, is sub-optimal since names and spellings can vary. Ideally, the datasets should contain unique identifiers for each region. For example, countries like France have territories all over the world. Sometimes, these territories are labeled by their actual names and sometimes just as France. Similar issues exist with small island territories, regions with contested boundaries etc. While typically, such regions tend to get dropped (or even overlooked), accuracy in data management requires these need be correctly identified across multiple datasets.

Since we are merging datasets from different databases, and names vary across datasets, not all names will merge perfectly. Here, I will just show some examples of basic (and manual) name cleaning we can do before the merge (there are other countries that need to be cleaned up as well):

*** clean some country names
replace country="Cote d'Ivoire" if country=="Côte d'Ivoire"
replace country="Kyrgyz Republic" if country=="Kyrgyzstan"
replace country="Slovak Republic" if country=="Slovakia"
replace country="Democratic Republic of Congo" if country=="Congo DRC"
replace country="Russia" if country=="Russian Federation"
replace country="Palestine" if country=="Palestinian Territory"

We can now do a one-to-many 1:m merge since we are combining countries in world.dta with countries plus dates in COVID19_policies.dta:

merge 1:m country using ../master/COVID_policies
tab country if _m==1  // show the unmerged countries in world.dta
tab country if _m==2 // show the unmerged countries in COVID data
*** see exactly how many countries merged
egen tag = tag(country)  // tag=1 for each country observation
tab _m if tag==1         // 178 countries merge perfectly.
drop if _m==2   // drop the unmerged countries from the policy data

We can now make the first map of policy stringency:

***
summ date
spmap Index using world_shp if date==`r(max)', ///
 id(_ID) cln(11) fcolor(Heat)

This is the default Stata map layout. It contains eleven endogenously generated cut-offs cln(11) and use the Heat color scheme fcolor(Heat). See help spmap to show the other color schemes available.

We can refine the above map further:

summ date 
spmap Index using world_shp if date==`r(max)', ///
 id(_ID) ocolor(black ..) osize(vvthin ..)  ///
  clmethod(custom) clbreaks(0(10)100)  fcolor(Heat)  ///
  legend(pos(7) size(*1)) legstyle(2)  /// 
   title("COVID-19 Policy Stringency Index", size(medium)) ///
   note("Data source: Oxford COVID-19 Government Response Tracker. Mercator projection used. Antartica dropped from maps." , size(tiny))

Here we make the outline osize(vvthin ..) very very thin. We also defined a custom legend cut-off clmethod(custom) clbreaks(0(10)100). We also modify the legend position, size and marker separators legstyle(2) and add a title and notes.

This already looks neater but we can some more useful additional information and tone-down the colors a bit.

In the last step, we can add a custom color using the colorpalette package (discussed at length in Guide 2 and Guide 3), and the date when the map was created:

*** get the color scheme packages if not installed
ssc install palettes, replace    // for color palettes
ssc install colrspace, replace   // for expanding the color base
*** generate a local for today's date
local date: display %tdd_m_y date(c(current_date), "DMY")

*** generate a local for the w3 color scheme
colorpalette w3 light-blue, n(10)  nograph
local colors `r(p)'
summ date 
spmap Index using world_shp if date==`r(max)', ///
 id(_ID) ///
 clmethod(custom) clbreaks(0(10)100)    ///
 ocolor(black ..) fcolor("`colors'") osize(0.04 ..)  ///
 ndocolor(black ..) ndfcolor(gs10 ..) ndsize(0.04 ..) ndlabel("Missing") ///
  legend(pos(7) size(*0.8)) legstyle(2) /// 
   title("COVID-19 Policy Stringency Index (`date')", size(medium)) ///
   note("Data source: Oxford COVID-19 Government Response Tracker. Mercator projection used. Antartica dropped from maps." , size(tiny))

Here we customize the line width using the osize(0.04 ..) option. We also add information for how to deal with no data (abbreviated nd). The no data colors follow the same logic ndocolor() is outline color, ndfcolor() is fill color, ndsize() is line width and ndlabel() modifies the legend name. The local date is used to add date information to the title.

From the code above, we get the following final map:

The locals help us with the automation process of the code. Automation in codes is essential to expedite tasks and avoid human errors. These points are discussed at length in Guide 1 and Guide 2 and will be used throughout the various Stata guides. Here we use the w3 CSS amber color scheme pre-defined in colorpalette and colrspace packages which uses very soft gradients.

Hope you find this guide useful! Try making a map using different indicators and other color schemes available in the colorpalette package.

About the author

I am an economist by profession and I have been using Stata since 2003. I am currently based in Vienna, Austria where I work at the Vienna University of Economics and Business (WU) and the International Institute for Applied Systems Analysis (IIASA). You can see my profile, research, and projects on GitHub or my website. You can connect with me via Medium, Twitter, LinkedIn, or simply via email: [email protected]. If you have questions regarding the Guide or Stata in general post them on The Code Block Discord server. We can also collaborate via UpWork.

The Stata Guide releases awesome new content regularly. Subscribe, Clap, and/or Follow the guide if you like the content!

Stata
Maps
Covid-19
Color Scheme
Recommended from ReadMedium