Lab Exercise #1: By Joe Student

 Due April 21, 2003

Spatial Analysis of a survey on attitudes about Population Growth in Santa Barbara County

 

Assessing ‘Representativeness’ of the survey respondents (1-6)

 

1)       Were the home locations of the respondents to this survey spatially random?

 

No, they do not appear to be random.  To quantify the non-random distribution statistically, the X2 test compares observed survey return locations to a random expected return distribution based on percent area in each census tract.  The calculation for this would be very similar to the technique used in 2) except the LAND_KM for each tract would be used to get the % of the Total Land Area for each tract and this would be used to calculate the Expected Survey Returns.  The actual Counts for each tract would be done as in 2) and the X2 could then be calculated.

2)       Were the home locations of the respondents to this survey random with respect to population density? (You do have to do the analysis for this one)

 

To use the Chi-Square table to compare an observed distribution to a theoretical distribution, the degrees of freedom for the test are determined by df = categories – 1 = 82(tracts) – 1 = 81

For a 1-tail test where α = .05 (95% confidence interval) and df = 81

From the table reading across to .950, df choices are 80 or 100 reading df = 100 (harder to reject null)

X2 = 77.93

The Chi-squared table for the observed distribution and the X2 calculation is attached.

The resulting X2 = 281.

We reject the null that the respondents are random with respect to population density.

       

3)       What level of spatial aggregation (tracts or block-groups) is more appropriate to answering question #2. Explain.

 

The sizes of the blocks are much smaller than the size of the tracts.

Given the assumptions of the Chi-Square test and the “Rule of Thumb” for categories

If df > 1 no cell frequency < 1 (i.e. = 0) and no more than 20% of the cells should have frequencies < 5

Using the tracts, there are tracts (i.e. categories) where no surveys respondents were observed.  So even with the larger tracts, the assumption and rule of thumb is not being met.  With the smaller blocks, the inability to meet the assumptions would be even greater.

 

4)       How would you test the following?:

a) Is the age distribution of the respondents to this survey significantly different than the age distribution of the population of Santa Barbara County?

 

Census tract data provides data for age in ranges.  Using the Adults in the county (i.e. people 18 and Over) it is possible to calculate the Expected Survey respondents of each age range based on the % of people that age in the county.  This gives you an Expected for the age ranges (as defined in the census data).   The Observed from the survey is in years but would need to be reclassified into the ranges of the census tract data.  A X2 test could then be preformed comparing the Expected to the Observed.

 

b) Is the income of the respondents to this survey significantly different than the population of Santa Barbara County?

 

Census tract data provides data for income in ranges.  The survey respondents also provided their income in ranges.  Unfortunately the ranges do not match up completely.  It would be possible to create ranges that could be used to do a X2 test but the result would be extremely generalized.  Again the Expected would be based on the % for each income range from the census tract data and the Observed would be from the survey answers.

 

c) Are the political party, ethnicity, and gender distribution of the respondents to this survey significantly different from the population of Santa Barbara County?

 

Ethnicity and Gender are provided in the        Census tract data but data for Number of Adults and Number of children is not provided.  It would be possible to get a distribution for each of these attributes based on the total population, but it would not be possible to refine it based on which are Adults and the target for the survey respondents.  Expected would be based on the % of each classification in the total population (i.e. The different Ethnicities for the first and Male or Female for the second).  Observed would be from the survey answers.   A X2 test could then be preformed comparing the Expected to the Observed.

 

Political party is not provided in the Census tract data, but the respondents were not sampled from the whole population of the census tracts.  They were sampled from the voter registration records.  So using the data from the voter registration records, the % of each party can be used to calculate the Expected values.  A X2 test could then be preformed comparing the Expected from the Voter registration records to the Observed survey answers.

 

5)       What kinds of problems do you run into when trying to answer the questions posed in #4?

 

The data types are not the same.  Example the Age data in the Census is Categories (Nominal) but the Survey data is interval/ratio.

The Categories are not the same.  Example the Income and Ethnicity Categories are not the same.

The data available for the population is incomplete (i.e. the voter registration records are only those residents that are registered) and may not accurately reflect the population (even though it does reflect the subset that was sampled from).

 

6)       Are the respondents to this survey age and income independent? Would you expect them to be?  Is this a parametric or non-parametric test?

 

1-Way ANOVA is used to compare a Nominal Independent variable (Income) that has 2 or more categories to a Dependent variable collected at the Interval level (Age).  This is a parametric test.   The specific results are on the One-way Analysis of Age By Income printout. 

 

The survey doesn’t appear to be age independent.  The mean for the respondent’s age is 48.6 years with a Standard Error in each income category of < 2.2 so in general approximately 95 % of the respondents were 48.6 +/- 4.4 years.  The persons in this age range almost exclusively filled out the survey and mailed it back.

Based on the Number of respondents in each income category, the reported incomes of the respondents appear to be normally distributed (with respect to the levels provided) around level 3 ($20,000 to 50,000) with 300 respondents in this level.

Overall Middle Income and Middle Aged people responded, and I wouldn’t have expected much different.  The age range is people who have been around and wish to “contribute” and/or “pass on knowledge” and are still young enough to have energy to do it.  Based on the income level, these adults are not struggling to get by nor are they so wealthy they are on continual holiday.

 

One specific outcome of the One-way Analysis of Age by Income is that the Age range of the lowest income level is statistically different from the other levels and the probability of this occurring by chance is < .0001.  For the people with income less than $10,000, there is a statistical difference in the Mean Age (37.8533) for this level vs. ANY of the other income levels.  Using the Tukey-Kramer positive values to show pairs of means that are significantly different all other income levels show a positive significant difference.  This may just be a case where the respondents who make less are also young people starting out and have lower paying jobs or part-time jobs.

 

Simple Demographic Comparisons (7-11)

7)       Based on the responses to the question: ‘Abortion should remain legal as defined in “Roe v. Wade”?’; are Democrats significantly more ‘Pro-Choice’ than Republicans?

 

The specific results are on the One-way Analysis of P3_9 By PolPrty printout.

All PolPtry Means are < 3.0 where 3.0 is Neutral and 1.0 and 2.0 are Strongly Agree and Agree.  Therefore all Political Parties agreed Abortion should remain legal as defined in “Roe v. Wade”.  Based on the Tukey-Kramer comparisons, where positive values show pairs of means that are significantly different, there is a significant difference in the Means for the Democrats and the Means for the Republicans.

The Democrat Mean was 1.61194 and the Mean for the Republicans was 2.32781.  A 1-Way ANOVA test of P3_9 by PolPrty produces an F Ratio of 16.8315.  The Probability of getting these results on this 1-Way ANOVA test were < .0001. 

 

8)       Along a similar vein, are Women more ‘Pro-Choice’ than Men? (According to this survey)

 

The specific results are on the One-way Analysis of P3_9 By Sex$ printout.

Both Sex$ Means are < 3.0 where 3.0 is Neutral and 1.0 and 2.0 are Strongly Agree and Agree.  Therefore both Sexes agreed Abortion should remain legal as defined in “Roe v. Wade”.   The F Ratio for this 1-Way ANOVA is 6.4824 and this results in a Probability of 0.0110 so based on the Means for the Males and the Means for the Females and a 95% confidence level there is a significant different in answer to question P3_9 based on Sex.

 

9)       In a separate survey I found that Women were more ‘Pro-Choice’ than men and that Catholic women were significantly ‘More, more Pro-Choice’ than Catholic men. Is this true of the respondents to this survey?  How did you test that?  If you did find the gap between Catholic men and women significantly greater than the gap between men and women in general what would a statistician call such a phenomena? If it were true, how would you explain it?

 

The specific results are on the One-way Analysis of P3_9 By PolPrty  (subset) printout.

For the subset of Catholic respondents, a 1-Way ANOVA test of P3_9 by PolPrty produces an F Ratio of 2.9280 and a Probability of 0.0346 of getting such an F value by chance.  The difference based on Political Party for Catholics is less extreme than the difference for the overall respondents.

 

The specific results are on the One-way Analysis of P3_9 By Sex$  (subset) printout.

For the subset of Catholic respondents, a 1-Way ANOVA test of P3_9 By Sex$ produces an F Ratio of 7.9974 and a Probability of 0.0051 of getting such an F value by chance.  The difference based on Sex for Catholics is more extreme than the difference for the overall respondents.  It appears that “Women were more ‘Pro-Choice’ than men and that Catholic women were significantly ‘More, more Pro-Choice’ than Catholic men”. Is true of the respondents to this survey as well.  This phenomenon is called interaction and means that two variables enhance or diminish each others effects.  Combining the Catholic Religion and Sex therefore accentuates the difference due to Sex.  It may be related to Catholic teachings, doctrine or experiences specific to Catholics.

 

10)    Are republicans different than non-republicans on the responses to any of the questions about immigration?

 

The specific results are on the One-way Analysis of P4B_1 through 5 By Republican printout.

Yes, all the questions yielded significant differences and probabilities of < 0.0001 that these results would have occurred by chance.

Question 1 F Ratio = 43.1681

Question 2 F Ratio = 14.7869

Question 3 F Ratio = 30.6039

Question 4 F Ratio = 39.0164

Question 5 F Ratio = 27.4545

 

The two questions yielding the highest F Ratio were:

The U.S. should deport all illegal aliens. This question had a Mean of 1.61486 for Republicans and 2.17112 for Non-Republicans.

Federal law should be changed so that citizenship is not automatically granted to children born in the U.S. of non-citizen parents.  This question had a Mean of 1.59934 for Republicans and 2.15526 for Non-Republicans.

 

11)    Is there any relationship between ‘Religiosity’ and responses to the question: ‘The earth has a finite supply of natural resources such as water, arable land, etc. which imposes a limit on the number of people which can sustainabily live on it.’

 

The specific results are on the Bivariate Fit of P1_15 By RelgAct printout.

As Religious involvement (religious activity) goes from 1 to 5 (Minimal to Extensive), agreement in finite supplies of natural resources goes from Agree to Neutral.  As religious involvement increases there is less belief in the finite supply of natural resources.

 

Factor Analysis (12-14). Factor analysis is a data reduction technique that allows you to ‘compress’ your analysis.  Factor analysis is a means of ‘capturing’ this co-variance between questions and ‘reducing’ a many-question survey to a few factors.

12)    For Factors 1-5 list the questions with a factor contribution score of 0.40 or more and study the questions that contributed to each factor. As a result of this study provide a name for each of the first five factors.

 

Sorting the data in descending order for Factor each yields this list of questions with their factor contribution scores > 0.4000.  I list the ones above .5000 or the top 6.  The Factor name is at the top.

 

Name Factor 1: Government Intervention or “Laws will Fix It”

Question

Factor 1

P2_10

0.828251

P2_11

0.818446

P2_12

0.769

P2_9

0.72183

P3_5

0.695323

P2_14

0.665964

P3_7

0.631293

P1_16

0.554414

P3_4

0.540022

P3_13

0.512226

P3_1

0.479687

P3_10

-0.47772

P1_13

0.446905

P3_9

0.433724

P1_5

0.426083

P2_10 Imposing restrictions on CFC emissions that course depletion of ozone in the stratosphere was a necessary and appropriate Government action.

P2_11 The potential consequences of global warming justify the spending of money to reduce the emission of greenhouse gases (CO2 & CH4).

P2_12 To protect the environment for future generations, present economic and behavioral sacrifices are justified.

P2_9 Human activities are the major cause of environmental degradation.  Governments of the world must formulate policy to minimize the degradation.

P3_5 The Govt. should insure that various types of contraceptives are available at affordable prices for all members of our society.

P2_14 Efforts, including funding, should be made to enhance the opportunity for women, worldwide to achieve improved educational, economic and political status.

P3_7 To reduce teen pregnancy, sex education should be mandatory in the schools.

P1_16 Policies regarding environmental degradation must also address the high per capita levels of resource consumption that are common in the industrialized nations such as the U.S.

P3_4 Govt. sponsored educational programs can be an effective means to achieve reduction of family size by voluntary cooperation.

P3_13 The U.S. should lead the way in addressing global population control because it is one of the few nations wealthy enough to provide any significant funding.

 

Name Factor 2: The Indigent Tax Burden or “Why Benefit the Undeserving Poor”

Question

Factor 2

P4B_1

0.822995

P4B_2

0.786595

P4B_4

0.76423

P4B_5

0.71261

P4B_3

0.625665

P3_2

0.539417

P4B_1 The U.S. should deport all illegal aliens.

P4B_2 The U.S. should issue a counterfeit-proof National Identification card so that only U.S. citizens receive benefits that are restricted to U.S. citizens only.

P4B_4 Federal law should be changed so that citizenship is not automatically granted to children born in the U.S. of non –citizen parents.

P4B_5 The U.S. should tighten up border security

P4B_3 Immigration policies, laws, and law enforcement are federal responsibilities; individual States should be reimbursed for costs resulting from lack of enforcement of these laws by the federal govt.

P3_2 Welfare support to unwed mothers acts as an incentive to produce more   children.

Name Factor 3: People are Good or “Keep Government out of the Bedroom”

Question

Factor 3

P1_4

0.693664

P1_14

0.688308

P3_12

0.559302

P2_1

0.554813

P3_9

-0.49573

P2_6

0.494416

P2_5

0.476661

P2_7

0.458089

P3_10

0.457654

P1_2

0.439916

P1_10

-0.42691

P1_4 Population growth is good because it increases the supply of our most valuable resource: People.

P1_14 A growing population is necessary for a growing economy.

P3_12 Countries that allow or condone abortion should be denied any kind of foreign aid.

P2_1 Attempts at curbing population growth are usually the racist schemes of the people in power.

P3_9 Abortion should remain legal as defined in Roe vs. Wade.

P2_6 Human ingenuity has provided improved agricultural yields, better energy utilization and other technological innovations.  This ingenuity can be counted upon to avert the need for population control.

 

 

 

 

 

Name Factor 4: Population Resource Degradation or “People Claustrophobia”

Question

Factor 4

P1_6

0.743261

P1_7

0.735964

P1_8

0.71135

P1_13

0.64239

P1_3

0.638499

P1_9

0.629217

P1_5

0.567222

P1_15

0.525338

P1_10

0.494813

P1_11

0.494605

P1_12

0.482948

P2_6

-0.46944

P2_3

0.430306

P2_2

0.406022

P1_1

0.401505

P1_6 The Growing population causes increasing traffic congestion.

P1_7 Population growth increases competition for natural resources such as land, oil, and water.

P1_8 International violence is aggravated by issues such as immigration and competition for natural resources that are directly related to the growing human population.

P1_13 Increasing human population threatens the diversity and survival of many plant & animal species.

P1_3 Population growth is a cause of increased pollution.

P1_9 The growing population contributes to inter-racial conflict.

P1_5 Population growth is a cause of deforestation in the U.S. and worldwide.

P1_15 The earth has finite limits of land, air, and water, which impose a ceiling on the number of people that can live on it.

 

 

 

 

 

 

 

Factor 5: Limiting Reproduction or “Policies for those who Won’t Help Themselves”

Question

Factor 5

P3_6

0.669543

P2_2

0.667867

P2_3

0.627428

P3_3

0.604894

P3_8

0.545589

P3_14

0.538953

P3_11

0.501134

P1_10

0.474396

P2_4

-0.46921

P1_11

0.448927

P3_13

0.407074

P3_6 The govt. should provide economic incentives for seekers of public assistance to be temporarily or permanently sterilized.

P2_2 The U.S. should have an explicit and well-publicized National Population Policy.

P2_3 The U.S should have an explicit and well-publicized International Population Policy.

P3_3 Incentive strategies such as tax laws favoring small families and penalizing large families are appropriate actions for govt. to use.

P3_8 As a condition of public assistance, child abusers and drug addicts must accept implanting a contraceptive such as NORPLANT.

P3_14 Coercive population control policies such as China’s are justified because they are in the best interest of the Chinese despite the fact that they do limit individual rights.

P3_11 The U.S. Tax laws should limit deductions for dependent children to a maximum of two.

 

 

13)    Do all the statistical tests necessary to fill out the table below. Put an asterisk (*) in the cells that indicate any significant differences on factor scores between demographic variables. For each asterisk provide a detailed description of the nature of the significant differences and some guess as to an explanation for the differences. Your ‘guess’ is referred to as ‘theory’ in academia. If you are really fired up about this exercise find references to support your theory.

 

 

Significant Factor Score Differences (*)

Sex

Pol. Party

Religion

Religiosity

Income

Education

Race/Ethnicity

Factor1:

* 0.0029 

* <0.0001 

* 0.0020 

 

 

 

 

Factor 2:

 

* <0.0001

 

* 0.0004 

 

* 0.0491 

 

Factor3:

 

 * 0.0012

 * 0.0019

 * <0.0001

 

* 0.0004 

*0.0167 

Factor4:

 

 

 

 

* 0.0394 

 

 

Factor5:

* 0.0448 

 

*0.0421 

 * 0.0001

 

 

 

 

For all the Factor Means the Negative values are associated with Strongly agree.  The Positive values are associated with Neutral

 

Factor 1 “Laws will Fix It

As shown above, 3 demographic variables show up with differences where the probability of this occurring by chance is less than 5%.  These are Sex, Political Party, and Religion. 

 

Demographic Variable

Group Name

Factor Mean for Group

Political Party

Republican

0.39594

Sex

Male

0.12927

Religion

The assortment of Christians

0.00329, 0.18578, -0.00571, 0.09941

Sex

Female

-0.21434

Political Party

Democrat

-0.48187

Religion

Jewish, Agnostic, Atheist

-0.52614, -0.59407, -0.35203

 

From this ranking of the Factor means it is clear that within Political Party the Factor Means have the widest gap.  That Jewish people and Democrats come out very similarly is interesting.  I have heard it said that Jewish people tend to be Democrats…I don’t know if that is true but for this factor they have similar results.  These almost are questions defining the differences in Political Party doctrines.  It would appear from this that Jewish and Democrat people believe laws will fix things and Republicans aren’t so sure.

 

 

Factor 2  “Why Benefit the Undeserving Poor”

        As shown above, 3 demographic variables show up with differences where the probability of this occurring by chance is less than 5%.   These are Polprty, Religiosity and Education.

 

Demographic Variable

Group Name

Factor Mean for Group

Religious Activity

Extensive

0.82405

Political Party

Other & Democrat

0.45025, 0.29172

College

Masters Degree

0.32600

Political Party

Republican & Independent

-0.23491, -0.13994

Religious Activity

Average

-0.39607

College

No College

-0.44929

       

        From this ranking of the Factor means, it is clear that within Religious Activity the Factor Means have the widest gap.  “Religious Do-Gooders” and “Bleeding Heart Liberals” have the highest Mean values while Average Church-goers and Un-college Educated have the lowest.  “Religious Do-Gooders” and “Bleeding Heart Liberals” want everybody to be helped/aided.  I guess “Bleeding Heart Liberalism” isn’t taught in the High Schools and the Average Church-goes gave at the church.

 

Factor 3 “Keep Government out of the Bedroom”

        As shown above 5 different demographic variables show up with differences where the probability of this occurring by chance is less than 5%.  These are Polprty, Religion, Religiosity, Education and Race/Ethnicity.

 

Demographic Variable

Group Name

Factor Mean for Group

College

PhD, MD, JD

0.48313    (20)

Religion

Non-Denominational, Jewish, Agnostic

0.30418, 0.42428, 0.55411   (14, 25, 13)

Polparty

Independent, Other

0.34933, 0.46708    (38, 12)

Religious Activity

Extensive

0.23370    (143)

Race

White

0.0540    (266)

Polprty

Democrat, Republican

0.14169, -0.20776  (111, 141)

Religion

Other, Christian

-0.40483, -0.28825  (37, 53)

College

No College

-0.71959    (24)

Religious Activity

Minimal

-0.74057    (15)

Race

Chicano/Mexican, Latino/Hispanic

-0.9285, -0.3597   (9, 11)

 

The highest Means for this Factor were among the Highly Educated, Agnostic and Other Political Parties.  They are most likely to be Neutral.  The lowest Means were among the No College, No Church, Chicano or Latino Races.  They are most likely to strongly agree.  Notice that for both extremes these Means are only a few people in number but they answered the questions correlated with Factor 3 very differently than the majority in the demographic.   It is possible that if the sample size for these groups had been larger, the Means for these groups would not have been so extreme.  It may also mean that of the minority groups for each demographic the people with the most extreme views chose to respond.

 

 

Factor 4 “People Claustrophobia”

        As shown above only 1 demographic variable show up with differences where the probability of this occurring by chance is less than 5%.  This variable was Income.

 

Demographic Variable

Group Name

Factor Mean for Group

Income

$50,000 to $100,000

0.32273

Income

All Other Groups

-0.05538, -0.16188, -0.11549, -0.04135

 

        The odd Income level is $50,000 to $100,000.  This group is more likely to be neutral about the environmental concerns.  Maybe they just bought a place in the country and now “Own” their environment.

 

 

Factor 5 “Policies for those who Won’t Help Themselves”

        As shown above, 3 demographic variables show up with differences where the probability of this occurring by chance is less than 5%.  These are Sex, Religion and Religiosity.

 

Demographic Variable

Group Name

Factor Mean for Group

Religious Activity

Minimal, Some

0.52066, 0.49975

Religion

Christian

0.32034

Sex

Female

0.14458

Sex

Male

-0.08722

Religious Activity

Extensive

-0.19910

Religion

Jewish, Agnostic

-0.34476, -0.44550

 

        The positive extreme Means on this Factor are found in the Minimal and Some Religious Activity and in the people defining themselves as Christian.  The negative extreme Means are among the Agnostic, Jewish religions and the Extensive Religious Activity.  The negative Means may be related to these groups being willing to take reproductive rights away from others and more concern for the good of the group/society as a whole.   On the other hand those that are inactive Christians may be more concerned about reproductive “due process” or simple not want government to decide these things.

 

14)    Did filling out the table and answering the questions of #13 make you appreciate factor analysis?

 

Yes, Factor Analysis and Principal Components are useful for:

 

1)       Pattern Identification,

a.        It will find Questions that are related to each other.

b.        Identify separate Independent variables that are unrelated to each other

2)       Data Reduction

a.        It reduced a large data set to more manageable proportions

b.        Components capture most of the information in a smaller set of variables

c.        Lowers the number of individual statistical tests that need to be analysed

3)       Data Transformation

a.        It changes the data to meet the requirements of independence in variables

b.        Provides scores to use the new variables

c.        No colinearity

4)       Selection of Surrogate Questions

a.        A shorter survey could be composed from the questions with the highest loading on the Factors

b.        These would be questions that are the most independent and provide the most information

5)       Evaluate the original Survey’s Structure based on the below questions

a.        Were there a lot of questions asking the same thing?

b.        Were all the correlated questions grouped together?

       

Spatial Anaysis: Where’s the Geography?

15) Test for any significant differences/variation for all of the factor scores (1-5) and the population density and percent non-white of the respondents home location. If you find any significant differences provide an explanation?

 

1)       Join geocoding with survey data  What I did here in ArcGIS:

a)       Join surveylocneg483.shp to sbcoblockgroupdemog.shp using a spatial join to create a new Join.shp file with the points having all the data from the block they were in (join polygon to points)

b)        The clean jump file with factors was exported to excel and did the intersection field, saved as a .csv (coma delimited), renamed .txt (ArcGIS likes those), renamed any field names ArcGIS didn’t like and then used the table join to join this table to the Join.shp by the Intersection fields from each table

c)        Export the data will create a new Export.shp with all the three combined

Hope this is kind of what you had in mind

 

2)       Analyze data in JMP with respect to spatial location

a)       Create new field from block data PopDens (Persons/Land_KM)

b)       Create new field from block data Non-White (Black+Amind+Asian+O_Ethnic+Hispanic)

c)       Create new field from block data White/Non_White

d)       Analyze the Factors with Line fit

Hope this is what you had in mind

 

Sorry, but neither of these appeared to me to be statistically significant with regard to any of the Factors.  Possibly the above procedure was in error.

 

15)    Another test you could do is to test for increases variance in response based on a geographic attribute.  What kind of statistical test would you use to look for that and if it proved significant, what would the explanation be?

 

An F-test works well to compare differences in variance

Larger over Smaller

Read df smaller variance down the side df = N – 1 and df larger variance across the top df = N – 1

If the variance was significant it would mean that something other that % Hispanics is affecting their answer to the questions.

 

General Questions

16)   Describe 10 specific problems related to this little research project. Things to consider: Sampling frame was registered voters whereas census data was total population, Non-response Bias, etc.

       

 

Sampling issues:

A) Registered Voters are people who have taken the time to actually register to vote.  They may have characteristics that are different from the overall population.  They may tend to be more stable, owners of homes, older.  Sampling from Registered Voters might be considered a form of “Opportunity Sample”.  Sort of grabbing from a population that is easy to grab from but may not represent the population as a whole.

       

        B) Since the respondents “responded” they had to have some motivation because action was required on their part.  This would be a form of “Response Bias” resulting in people who were highly motivated to answer the survey’s questions.  This was after all a very long survey.

 

        C) The advanced vocabulary used in the questions probably increased “Response Bias” to the more highly educated.  Words like formulate, degradation, stratosphere, etc might have meant some people didn’t even understand the questions being asked.  Less highbrow wording might have lead to increased response by some groups.

 

        D) Because the sample wasn’t “Stratified”, i.e. the population divided into homogeneous groups and randomly sampled from each group some groups are represented by a very few individuals (think Race or Age here).  It is hard to draw meaningful statistical results from such small sample sizes.

 

        E) One question that is lacking, or maybe the answer is implied somewhere, that I would have liked to see included is whether the person is a Native Citizen or a Naturalized Citizen.

 

Spatial issues:

        F) To correlate this data with spatial tract or block data it needs to be located in space.  The methodology for collection of this information was not 100% successful.  The Cross Street data could only be geocoded 2/3 of the time into an actual location.  Perhaps voter precincts would have been a better way to get this information, at least it would have been information you could have provided the responder yourself so it would have been “correct” to the precinct level.

 

        G) Even with the geocoded data the spatial join to the tract and block and polygon data was difficult.  Many of the points were on the line between the tracts (about 1/5) which also meant there were problems with the blocks as well.  The default is to associate the point with the lower number tract or block.  Points on the line between a small tract and a large tract (in area) frequently were associated with the small tract, which may or may not be correct.  This would tend to bias the spatial distribution even further in the direction of high population density blocks.

 

        H) The data is spatially biased but even the cause is unclear due to the above 2 problems.

 

Data Comparison issues:

        I) Lack of standardized categories between the Census Data and the Survey Data.  I am thinking here of the Ethnicity/Race: and Income: categories specifically.  Especially in the $20,000 to $50,000 and the $50,000 to 100,000 categories in Income on the survey much information may have been lost due to the wide ranges.

 

        J) The Factors only captured about half the variability in the survey.  There would still be quite a few things that would have to be learned only from 1 specific question.

 

17) Are these problems significant enough to invalidate any or all of the findings from an analysis of this data?

 

For specific groups there is enough data to evaluate their opinions (Whites, 45-55 in age, Christian religions, Male & Female) but for any Racial, Age, or Less Common Religions I doubt that the survey would accurately reflect their views and the statistical differences of the population.