Lab Exercise #2: Correlation, Regression and Modeling of Population and Population Density with nighttime imagery provided by the DMSP OLS satellite.

Background

In this lab you will perform analyses and use data associated with my dissertation: Census From Heaven: estimation of human population parameters using Nighttime Satellite Imagery. The datasets are of the world, United States, and Los Angeles basin. In the world directory you have two coverages and one grid: A point coverage of world cities with population attribute for about 2,000 of them, a country boundaries coverage with various aggregate national statistics as attributes, and a low-gain DMSP OLS composite image of the earth at night. All of these are in the dreaded Goode’s Interupted Homolosine Projection. (this is a strange coverage not supported by ArcView of Arc/INFO explicitly but you can read and analyze the data anyway). The unitedstates directory contains four grids and one coverage. Usalazea is a coverage of the U.S. counties. Usatnight is a lowgain composite DMSP OLS image of the conterminous U.S. Usnightbinary is a binary (1 = lit, 0 = dark) image of the same conterminous U.S. Uspopden is a grid of population density for the conterminous U.S. derived from 1990 census block-group polygons.  Dmspmxdn is another nighttime image in which the value of the pixel is the highest light intensity leve recorded over the hundreds of orbits used to make the image. All of the U.S. coverages are in Lamber_Azimuth projection. (do a ‘describe’ for the parameters). All of these grids have pixel or ‘cells’ which are 1 km2. The LosAngeles directory has a subdirectory with Tiger files of LA streets and LA block-group polygons from the 1990 U.S. census. The grids included are ladmsplg which is a non-radiance calibrated low-gain DMSP image of the LA basis, lajobdensity which is a meaure of the employment based population density of Los Angeles county, lalandscan which is Oak Ridge National Laboratories representation of the LA area’s population density based on a fairly sophisticated model and data, lanightradcal which is a radiance calibrated measure of light intensity at night over the LA basin, lapopdensity which is clipped from the uspopden dataset in the unitedstates directory. The following few paragraphs provide a modicum of background information for the nighttime satellite imagery you will be using.

Technical history of the DMSP OLS platform and related imagery

The Defense Meteorological Satellite Program’s Operational Linescan System (DMSP OLS) platform was designed as a meteorological satellite for the United States Air Force. The DMSP system consists of two sun-synchronous polar orbiting satellites at an average elevation of 865 kilometers above the earth. The swath width of a typical DMSP image is 3,000 kilometers. One satellite observes the earth at dawn and dusk, the second observes the earth at approximately noon and midnight. The sensor on the system has two bands: a panchromatic visible-near infrared (VNIR) band and a thermal infrared band. Nighttime imagery provided by the DMSP OLS has been available since the early 1970's. The DMSP sensors are more than four orders of magnitude more sensitive to visible-near-infrared radiances than traditional satellite sensors optimized for daytime observation. However, the high sensitivity of the sensor was not implemented to see city lights. It was implemented to see reflected lunar radiation from the clouds of the earth at night. The variability of lunar intensity as a function of the lunar cycle is one of the reasons why the satellite system’s sensors were designed with such a large sensitivity range. As you will see later, this proved to be very fortuitous for studies of intra-urban population density.

The digital archive of the DMSP OLS data is housed at the National Geophysical Data Center in Boulder, Colorado  (the NGDC is a subsidiary of NOAA). Algorithms developed by Elvidge et al have produced a one km2 resolution dataset of the city lights of the continental United States. Elvidge and company developed algorithms to identify spatio-temporally stable VNIR emission sources utilizing images from hundreds of orbits of the DMSP OLS platform. The resulting hyper-temporal dataset is cloud-free because the infrared band of the system was used to screen out cloud impacted data. Later, a global version was prepared. At the time of this research there were two versions of this data available.

These earlier datasets are referred to as the ‘high-gain’ data product. Two kinds of high-gain data are used in this research. The first kind of high-gain data has the value of the pixel representing a percentage of times light was measured relative to the number of cloud free orbits for which that pixel was sampled. For example, if observed light levels were 20, 60, 0, and 40 for four observations from four orbits, the value in the pixel would be 75, representing light observation for 75% of the orbits.  The second kind or type of high-gain data simply used the maximum light level observed in a given pixel for those cloud free orbits (in the case of the previous example the value would be 60). Only the maximum light level data could be interpreted as a measure of observed light intensity. The primary drawback of these data was the issue of saturated pixels in urban areas. Both high-gain datasets were virtually binary with mostly black or dark pixels valued at 0 and brightly lit ‘urban’ pixels with a value of 63 (the DMSP sensor has a six-bit quantization). These data lent themselves to aggregate estimation of urban ‘cluster’ populations but were not good at estimating population density variation within the urban clusters. This drawback was identified and resulted in a special request to the Air Force regarding the use of the DMSP OLS platform.

As mentioned before, the DMSP OLS platform was designed to observe reflected lunar radiation at night (primarily reflected from clouds). During the days just prior to and after a new moon there is very little lunar radiation striking the earth. Consequently, the sensor has its gain set to its maximum possible value. The NGDC requested that the Air Force turn down the gain on several orbits near the new moon. This request was honored by the Air Force and resulted in what is now referred to as the ‘low-gain’ data product. Turning down the gain produced dramatic results with respect to the saturation of the ‘urban’ pixels. The low-gain data show dramatic variation of light intensity within the urban areas, and it can easily be calibrated to at-sensor radiances. Hyper-temporal datasets similar to the previous data were made using the low-gain orbits. This research utilizes both the ‘high-gain’ and ‘low-gain’ DMSP data. 

Outline of your tasks/objectives

            First you will look at the global datasets are try to build a model that estimates the population of the cities based soley on manipulations of the nighttime image. Then you will try to improve that model by incorporating aggregate national statistics. Second you will apply a similary model to the U.S. data and note the differences in performance and provide and explanation for such. Then you will look at the U.S. data to develop a disaggregate population density model based in some way on the spatial nature of the data or the intensity of the light. Thirdly you will look at the concept of ‘ambient’ population density in the Los Angeles area using both residence and employment based population density. Finally you will select some city in the U.S., build a population density model for it using the nighttime imagery, map the residuals in your model and produce a publication quality one-page figure characterising the model and its performance for you particular city.

 

Analysis of World data

            Most of you have probably seen the nighttime imagery from the DMSP OLS. (If you haven’t you will soon). Explore the global image of nighttime lights with country boudaries, city points, etc. It may seem obvious that the lights lights represent population. However, building a mathematical model derived from the lights to represent population is not as easy as it looks.

Task #1: Using Arc/INFO and/or ArcView create a polygon coverage that has every contiguous blob of light in the nighttime image (to do this you will need to create a binary grid from the ‘earthatnight’ grid using the CON function in grid). Next you will have to do a REGIONGROUP command in grid. Next a GRIDPOLY command so you have a coverage of polygons that represent all the urban clusters in the world as identified by the nighttime imagery. Each of these polygons has a unique ID.

Task #2:  Right now the polygon coverage you have does not have any attributes other than an ID. You will need it to have the following attributes: AREA, COUNTRY (preferably a FIPS code), GDP/Capita of Country, # of cities with known population that fall inside polygon, The names of these cities as one long string, and the sum of the population of these cities. (sometimes due to conurbation more than one city falls into these polygons). There are various ways to create this table. You may want to join the VAT from the grid that resulted from the REGIONGROUP command to get the area value. You will have to perform some kind of intersection to get the point attributes of the cities linked to your urban cluster polygon coverage. You will then have to use the country coverage to get attributes like GDP/Capita, etc. We’ll futz around with these problems in the second half of class.

Task #3: Output this table into a txt or dbf file that can be read into MS Excel. Export it from Excel as a tab delimited text file and read into the JMP statstics package. Answer the following questions:

 

1) Run a simple linear regression between area and total population for each of the cities in the table.

            Describe the results and problems

 

2) Transform your area and total population values to Ln(Area) and Ln(total population) and run the regression again. Describe the results and problems.

 

3) Color code your points so that points representing cities in countries with a GDP/capita less than $1,000 / year are red, $1,001-$5,000/year are blue, and over $5,000/year are green. Run the same regression on Ln(area) vs. Ln(total Population).  Describe the results.

 

4) Create a new column in JMP called IncomeClass or something based on the red, blue, green code above. Run separate regressions on ln(area) vs. Ln(total population) for each class. How do these regressions on subsets of the data compare to the regression on all the data. Describe and explain the difference.

5) Assume you only had the regression knowledge from the previous exercises and you had to estimate the populations of Mumbai, India; London, England, and Cali, Colombia from only their areal extent as measured in the nighttime satellite image. What would your estimates and 95% confidence intervals be for those cities if you used the global parameters? What would your estimates and 95% confidence intervales be for those cities if you used the regression parameters derived after sub-setting the data to ‘rich’, ‘mid-income’ and ‘poor’? Check the “actual” population figures for those cities and see if they landed inside your 95% confidence intervals.

6) How could you use these regression models, a nighttime image of the world, and a % urban figure for every nation of the world to estimate the total global population?

 

Analysis of United States Data

Task #1: Generate a table similar to the one you did for the world using the population density and nighttime imagery grids for the United States. This table will have many more records because you have a known population for every cluster identified by the nighttime imagery over the United States whereas with the world analysis you only had population data for a limited number of U.S. cities. This table should at least have the following fields: “ClusterID”, “Cluster Area”, Cluster Population”, “LnArea”, “LnPop”.

 

7)  Produce a histogram of the population density data.

8) Produce a histogram of the usatnight data

9) By looking at these histograms comment on the likelihood that a model derived from nighttime light emissions could predict population density.

10) Produce a correlogram of the pop density data  (use the CORRELATION function several times)

11) produce a correlogram of the usatnight data

12) What does the correlogram of the popdensity data suggest about the effectiveness of a perfect model if it is mis-registered by one pixel?

13) Run a FOCALMEAN on the pop density data? (use a 5x5 and an 11 x 11 filter). Now generate two new correlograms for the ‘smoothed’ data. How does FOCALMEAN work, and how does it change the data?

 

14 ) Perform the same Ln(area) vs. Ln(Population) regression on this data. How is it different than the results for the U.S. from the world-level analysis? Describe these results and the differences from the U.S. results in the world-level analysis. Which regression parameters do you think are more accurate? Does this U.S. level study weaken or strengthen the argument tha the areal extent of a city as measured by the DMSP OLS satellite is a good predictor of that city’s population?

 

15) Use the CORRELATION command in grid to get a simple correlation coefficient between the U.S. population density grid and the usatnight grid. Record this R or R2. Comment on its value and provide an explanation.

 

16) Model Building: We have just evaluated a simple model to predict population density from nighttime image. The model was a simple linear model that suggests that light intensity as measured by the DMSP OLS is directly proportional to population density. Our R2 was not 1.0 so our model was not perfect. Build a better model, describe/define it, justify it, and evaluate it. Describe how it compares to the simple aforementioned model.

 

Analysis of the Los Angeles Data

            If you look at this data for a while you will find that no model that uses nightime lights alone will ever predict population density perfectly.  However, this may be because the census data is only a particular kind of representation of population density: e.g. residential population density. It does not account for employment based population density. Suppose you are trying to characterize ‘ambient’ population density that is a temporally averaged measure of population density that accounts for human mobility, employment, entertainment, etc. Without implanting GPS recievers in the heads of a large population and monitoring their spatial behavior it is a very difficult thing to measure this ‘ambient’ population density. A crude approximation might be the average of residence-based population density and employment-based population density.

 

 

 

17) Apply your model of population density prediction from before to the Los Angeles urban cluster.

            a) What is the correlation of your model to residence based pop den?

            b) What is the correlation of your model to employment based pop den?

            c) What is the correlation of your model to the average of these two?

 

18) What do your results above suggest regarding your model’s ability to predict ‘ambient’ population density?

 

19) Visualization of Error: Subtract your model from each of the following: Residence-based pop den, Employment based pop den, and average of Residence and employment based pop den. Which errors look most random? Which ‘map of residuals’ has the smallest mean absolute deviation? Produce correlograms for each of these residual maps. What do these correlograms suggest about these three images?

 

Study your own city

Pick some city in the US dataset other than Los Angeles. Apply your best population density model to this city and evaluat the model. Get a map of the city off the web. Generate a map of your errors or residuals. Produce a one-page publication quality figure showing this city, your model, the errors, the nighttime lights image, and a map of the city. Explain model and errors as briefly as possible.