Lab Exercise #2: By Jennifer Doe

 Due May 6, 2003

 

World Data:

 

Question 1: Run a simple linear regression between Area_1 (from earthlights poly coverage) and PopMetro from the worldcities.  Describe the results and problems.

 

The R2 is 0.052868 meaning that the simple linear regression model used here only accounts for 5.29% of the variation in the data.  Because F = 141.7794 the probability is <0.0001 that this is occurring by chance.  Some of the world cities have the same earthlights polygon associated with them because they are very close together.  In general there are data gaps where one piece of data is present but not both pieces required for the analysis.  There are also many more small cities than large ones so there is a big blob of data points very close to the origin.  In the large cities the population does not seem to respond linearly to the area the light occupies.

 

 

 

 

Question 2:  Run a simple regression on the Ln(Area_1) and Ln(PopMetro).  Describe the results and problems.

 

The regression line generated by the Observations available (253 observations) is on the Bivariate Fit of Ln(PopMetro by Ln(PopArea) printout.  The R2 is 0.363861 meaning that the new model using the log values now accounts for 36.4% of the variation in the data.  The F = 143.5677 and the probability is <0.0001.  There are still some points where the Population is significantly larger than the Model would predict.  Istanbul in Turkey, Mexico City in Mexico, and Karachi in Pakistan all show up higher in population than this model would predict (i.e. more people per area of light). There are also places where there is more light area than people the model predicts.  Some examples of this are Columbus in the United States, Valletta in Malta and Saint Joh in Canada.

 

Question 3:  Color-code your points with red, blue or green for GDP/capita. Run the same regression on Ln(PopArea) vs Ln(PopMetro).  Describe the results.

 

For reasons unclear to me, at this point some of the GDP data didn’t attach correctly to the table.  The most annoying of these omissions was that no GDP data was attached for the United States.  To correct this problem, I did a subset of data containing only the records with Ln(Area) and Ln(PopMetro) and did an export to Excel where I sorted based on country then I  remove all the other countries with no GDP attached except the U.S.  I could add the U.S. to the IncomeClass 3 manually. This problem was biasing my results on the Low end by adding all the “missing GDP countries” data to the low GDP group.  My original curves for the “biased data” are attached at the end, while the corrected curve is shown below.  This change did lower the number of total observations and change the number in each group.

 

The green dots for the highest GDP/capita countries fall mainly below the line especially if the area (light polygon is small).  The model is overestimating the population of these polygons.  The red dots tend to be above the line and in areas where the area (light polygon is large).  The model underestimates the population of these polygons.  The explanation for this might be that the richest countries have more lights on at night…they have money to pay for the energy for them.

 

The R2 = 0.3365

The Probability = <0.0001

This line is very similar to the one above.

 

 

Question 4:  Run separate regressions on the three Income Classes based on the GDP/capita color codes.  Describe and explain the difference.

 

Based on my improved data the lines all look pretty similar in slope but the intercepts increase with decreasing income.

 

The complete printout is attached

Lower Income: observations = 23

The R2  = 0.7825

The Probability = <0.0001

Medium Income: observations = 58

The R2 = 0.4444

The Probability = <0.0001

High Income: observations = 123

The R2 = 0.6197

The Probability = <0.0001

This does improve the R2 at each income level from the models above.

 

 

 

 

 

 

 

 

 

Question 5: Assume you only had the regression knowledge from the previous exercises and you had to estimate the populations of Bombay, India, Cali, Colombia, Paris, France from only their aerial extent as measured in the nighttime satellite image.

a)      What would your estimates and 95% confidence intervals be for those cities if you used the global parameters?

 

The Equation for the Ln(PopMetro) by Ln(PopArea) is

Ln(PopMetro) = 1.9176 + 0.4880 Ln(PopArea)

Lower 95% Ln(PopMetro) = 1.1395 + 0.4880 Ln(PopArea)

Upper 95% Ln(PopMetro) = 2.6958 + 0.4880 Ln(PopArea)

 

For Bombay Ln(PopArea) = 8.6201 PopMetro = 1,331,094 predicted

Lower95% = 221,875 Upper95% = 7,987,462 Actual from PopMetro = 14,482,636 OUT

For Cali Ln(PopArea) = 8.2355 PopMetro = 864,020 predicted

Lower 95% = 144,020 Upper 95% = 5,184,704 Actual from PopMetro = 1,832,321 OK

For Paris Ln(PopArea) = 9.1186 PopMetro = 2,330,649 predicted

Lower 95% = 388,487 Upper 95% =13,985,461 Actual from PopMetro = 9,591,757 OK

 

b)      What would your estimates and 95% confidence intervals be for those cities if you used the regression parameters derived after sub-setting the data to ‘rich’, ‘mid-income’ and ‘poor’?

 

The Equation for the “Poor” Ln(PopMetro) by Ln(PopArea) is

Ln(PopMetro) = 0.4190+ 0.7554 Ln(PopArea)

Lower 95% Ln(PopMetro) = - 1.0219 + 0.7554 Ln(PopArea)

Upper 95% Ln(PopMetro) = 1.8599 + 0.7554 Ln(PopArea)

For Bombay Ln(PopArea) = 8.6201 PopMetro = 8,523,609

Lower95% = 308,832 Upper95% = 235,246,000 Actual PopMetro = 14,482,636 OK

 

The Equation for the “Mid-Income” Ln(PopMetro) by Ln(PopArea) is

Ln(PopMetro) = 1.5406 + 0.5536 Ln(PopArea)

Lower 95% Ln(PopMetro) = 0.2314 + 0.5536 Ln(PopArea)

Upper 95% Ln(PopMetro) = 2.8497+ 0.5536 Ln(PopArea)

For Cali Ln(PopArea) = 8.2355  PopMetro = 1,258,267

Lower 95% = 61,741 Upper 95% = 25,637,330 Actual from PopMetro = 1,832,321 OK

 

The Equation for the “Rich” Ln(PopMetro) by Ln(PopArea) is

Ln(PopMetro) = 0.3476+ 0.6542 Ln(PopArea)

Lower 95% Ln(PopMetro) = - 0.4231 + 0.6542 Ln(PopArea)

Upper 95% Ln(PopMetro) = 1.1182+ 0.6542 Ln(PopArea)

 

For Paris Ln(PopArea) = 9.1186 PopMetro = 2,055,834

Lower 95% = 348,568 Upper 95% = 12,122,386 Actual from PopMetro = 9,591,757 OK

 

Question 6: How could you use these regression models, a nighttime image of the world, and a % urban figure for every nation of the world to estimate the total global population?

 

Do the same as this lab; create a (0,1) grid at specific light intensity and use region group in arc to group the connected areas.  Convert the grid to a polygon coverage from the light pixel groups that contains area.  Use the regression line.  Estimate each urban area and add up the total for the country.  Use the % urban as a divisor (Urban total population)/% urban to get the total population.

 

Or if your lazy like I am, I would simply add up all the light polygons in each county first, fit the total areas into a new regression equation created similarly to the lab but with total urban area instead of area for each city and get the total urban from the line then divide as above.  It’s easier to be spatially accurate when joining big polygons than little.  There would probably be less data loss and if you were looking for a country’s population it would be easier to calculate (just one result total urban).  Maybe then the GDP/capita would work better to separate the groups with.

 

U.S. Data

 

Question 7: and Question 8:

The Histogram of Population Density from uspopden grid and the Histogram from usatnight grid are NOT very enlightening as almost all the pixels are very low numbers.  Even selecting only those pixels with values over 1000 from the uspopden grid, the histogram for the data is highly skewed to low-end values.  The grid has 6455 different values so this removed the lowest 15% of the values.

 

 

The usatnight grid has fewer grid values than the uspopden grid and the one shown below is selecting the pixels with values over 20.  This grid has 195 different values so I removed the lowest 10%.

 

 

Question 9:  Comment on the likelihood that a model derived from nighttime light emission could predict population density.

The data has a similar distribution, very skewed to the low end.  Based on the fact that the usatnight grid has 195 values and the uspopden grid has 6455 values, the usatnight grid has only 3% of the sensitivity of the uspopden variation (195/6455).  It would take a 33-value count change in uspopden to make a 1 value count change in usatnight (assuming a linear proportional relationship).  It might be possible to create a model using usatnight data but the accuracy will be about 1% of the uspopden information.

 

Question 10: Correlogram of uspopden

 

         

 

Question 11:  Correlogram of usatnight

 

         

 

Question 12:  What does the Correlogram of the popdensity data suggest about the effectiveness of a perfect model it is miss-registered by one pixel?

 

The steep drop-off in the values suggests that within a miss-register of five pixels the ability of the model to provide useful estimates of the population drops to about 50% from 100%

 

Question 13: Generate two new correlograms for the ‘smoothed’ data.  How does focalmean work, and how does it change the data?

 

Focalmean: for each cell location on an input grid, finds the mean of the values within a specified neighborhood and sends it to the corresponding cell location on the output grid.  It is a form of smoothing the data by averaging the cell values around it (5X5, of 11x11 in our choices), and creating a new grid of those smoothed values.  The larger the neighborhood the smoother the data becomes.

 

          

 

         

 

The correlograms for these two new grids drop-off more slowly, the model would work better because the changes in the grid are less extreme.

 

Question 14: Perform the same Ln(area) vs. Ln(Pop) regression on the United States data as you did for the world data

a)      How is the regression equation for the U.S. different from the world-level analysis?  Describe these results and the differences from the U.S. results in the world-level analysis

b)       Which regression parameters do you think are more accurate? 

c)      Does this U.S. level study weaken or strengthen the argument that the aerial extent of a city as measured by the DMSP OLS Satellite is a good predictor of that city’s population?

 

There are many more observations for this set of data (6369 observations) than for the World data (41 observations)

 

With data at the same resolution for the U.S. data, it is insured that for each light group a comparable group could be found in the pop density grid.  The world data, because the method was attempting to convert grid to polygon and then spatially join the two sets of data, any projection variations or other line up issues limited the data available.  Many popygons didn’t have MetroPop data.  See the below regression lines.

 

      

 

I think the U.S. data is more accurate but for most of the countries of the world a grid of population density at the same resolution as the DMSP OLS Satellite data is unavailable at present.

 

It strengthens the view that a model for population based on the DMSP OLS Satellite data would be an alternative to the population data that is available in some countries.  In a way, it is a form of calibration, taking known populations and creating the linear relationship against our detector the DMSP OLS Satellite.  It should then be possible to use the detector data with the curve to find values for unknown populations of other cities.  One useful additional effort (assuming the goal is to Predict populations for cities in other countries, where the population is unknown) would be to check one or two reference city in that country where the population data is good vs. the curve results to determine if the result is within the 95% confidence interval (or even to create a correction factor if the error is a constant).  This would then provide a QC on the data being generated.  Theoretically, curves for the three GDP categories could be created from three country’s data (one for each GDP) where the population density data is available and extrapolated to the other countries where the population data is limited.  The accuracy of the predicted populations could then be verified using a very limited number of cities (with better data) in the country being studied. 

 

Question 15: Comment of the R correlation for the uspopden vs. usatnight grids and provide an explanation.

 

The correlation is 0.4919, which isn’t great!  I should point out that the uspopden11 data correlates a bit better at 0.5591.  So smoothing the data does improve the correlation.

 

Question 16: Model Building

 

I tried a few refinements to the simple model.  I used the slice command in Grid to reclassify the data in the unpopden grid since smoothing the data seemed to help above.  I used the equal interval slice and tried 200 and 100 for interval slices (reclassifying the data).  Unfortunately, none of these reclassification efforts produced the desired effect of improving the correlation between the new uspopden grid and the usatnight grid.  Most of these efforts resulted in correlations lower than the original grid produced.  Perhaps using 10 or 20 might have worked better…it would appear from the above exercise that the relationship is not a truly directly proportional one.

 

You had mentioned in class that weighting the data with the area improved the correlation and that is what is shown above.  It is pretty obvious that the number of small cities was affecting the slope and the intercept of the line.  The R2 value is 0.9504 for the weighted line.  Quite an improvement in correlation.

 

 

 

 

 

 

 

 

 

 

 

Los Angeles Data

 

Question 17: Apply your model of population density prediction from before to the Los Angeles urban cluster.

a)      What is the correlation of your model to residence based population density?

 

 Based on the Grid Correlation command in Arc the correlation R is 0.5262

Based on Jump the R2 is 0.4646

And the Probability is 0.0000

 

 

 

 

 

 

 

 

b)         What is the correlation of your model to employment based pop. density?

 

 Based on the Grid Correlation command in Arc the correlation R is 0.4428

Based on Jump the R2 is 0.3514

And the Probability is 0.0000

 

 

 

 

c)      What is the correlation of your model to the average of these two?

 

 Based on the Grid Correlation command in Arc the correlation R is 0.5704

Based on Jump the R2 is 0.5614

And the Probability is 0.0000

 

 

 

 

 

 

 

 

I have no explanation for why the R in Arc when squared isn’t equal to the R2 in Jump.

 

Question 18:  What do your results above suggest regarding your model’s ability to predict ‘ambient’ population density?

 

The R and R2 values for the Average Population are higher than either of the other records of population itself as it relates to the light data available.  Of course this means that the night light data is better at estimating the ambient population than the population density of the city or the job population density.

 

Question 19: Visualization of Error

a)      Which errors look most random?

 

See the attached maps.  The LA Population Density seems to have large areas of Red in the center indicating our estimate is too low in a lot of the center of the study area.  The Estimate of LA Job Population seems to be the one with the fewest extremes and could therefore be considered the most random.  Based on the Histograms, it is surely the most normally distributed.  The Ambient Population is what it was designed to be, the average of the two.

 

b)      Which ‘map of residuals’ has the smallest mean?

 

This is the Estimate of LA Population Density error by Std. Dev.  Notice that the mean for this data vs. the estimate is negative 329.437.  The Standard Deviation for this data is 1247.13.  Our estimate is too low.

 

 

 

 

 

This is the Estimate of LA Job Population error by Std. Dev.  The mean here is a positive 328.978.  The Standard Deviation for this data is 1179.889.  Our estimate is too high.

 

 

 

 

 

This is the Ambient LA Population error by Std. Dev.  I suppose that it should not be surprising that since we used the Average data to create the line we then created our estimated population from that the mean should be 0.  The Standard Deviation for this data set is 890.649.  By far this is the narrowest Std. Dev. Range.

 

c)      Produce correlograms for each of these residual maps.

 

        

 

        

        

 

a)      What do these correlograms suggest about these three images?

 

All three sets of the correlation correlograms for the grids using different population estimates drop off fairly quickly.  The Estimate of the Average Population drops off a bit less quickly and might be the best choice for a model as it appears to be slightly less affected by 1-unit changes in the grid.




thkeljklksdjlk

Population Estimation for the city of Lafayette-West Lafayette

Tippecanoe County, Indiana

By Janice L Vaughn

 

 

This study area for the Lafayette-West Lafayette metropolitan area includes the complete Indiana county of Tippecanoe.  West Lafayette is home to Purdue University. The Wabash River splits the cities. 

This MapQuest™ map provides an overview of the metropolitan area. The stable population and the business community reside on the east side of the Wabash while the university occupies most of West Lafayette.

 

 

 

 

 

Using the USPOPDEN grid and the USATNIGHT grid, I generated two grids for the area data, TIPPYPOP and TIPPYLIGHTS. The regression equation generated from these two grids is shown below. 

The R2 for the data is 0.3214, so the relationship shown accounts for 32% of the variability in the data.  The regression line is shown below the curve and the probability = < 0.0001.  The original data with the created population estimate grid is shown on page two with the error grid.  The model estimates the population to the southwest lower than it is.

 

An explanation might be that the students are not counted in the census data.   Shown in the classification window below are the mean, Standard Deviation, and the Histogram of the data.  There is a bit of a skew to the high end of the curve.  The map of the error of the population estimate shows very clearly that the residuals are not random spatially.  The under-estimates are to the southwest and the over-estimates are east.