ANOVA
Question 1a- Test equality of urban/rural availablility of waterANOVA
The GLM Procedure
Data dictionaries
SAS Data set REGRESSIONDATA
Alphabetic List of Variables and Attributes | |||||
# | Variable | Type | Len | Description | Units |
18 | AllC | Num | 8 | All Cause mortality | Deaths per 100,000 population |
19 | Commun | Num | 8 | Communicable Disease mortality | Deaths per 100,000 population |
10 | Region | Char | 4 | WHO Global Region | |
11 | Urb_Rur | Char | 5 | Urban or rural residence | |
4 | all_pcs | Num | 8 | Per capita health care spending from all sources | Purchasing power equivalent dollars per person |
1 | country | Char | 11 | Country name | |
14 | drink00 | Num | 8 | Percent of population with access to clean drinking water in year 2000 | Percent |
16 | drink05 | Num | 8 | Percent of population with access to clean drinking water in year 2005 | Percent |
12 | drink95 | Num | 8 | Percent of population with access to clean drinking water in year 1995 | Percent |
3 | gov_pcs | Num | 8 | Per capita health care spending from government sources | Purchasing power equivalent dollars per person |
5 | income_pc | Num | 8 | Per capita income | Purchasing power equivalent dollars per person |
20 | noncom | Num | 8 | Non-communicable Disease mortality | Deaths per 100,000 population |
6 | phys_num | Num | 8 | Number of Physicians | BEST32. |
8 | phys_pt | Num | 8 | Physicians per thousand population | Number of Physicians per thousand population |
7 | popn | Num | 8 | Population of Country | BEST32. |
15 | sanit00 | Num | 8 | Percent of population with access to sanitation in year 2000 | Percent |
17 | sanit05 | Num | 8 | Percent of population with access to sanitation in year 2005 | Percent |
13 | sanit95 | Num | 8 | Percent of population with access to clean drinking water in year 1995 | Percent |
9 | subregion | Num | 8 | Subregions within WHO regions | |
2 | year | Num | 8 | Calendar year |
SAS Data set Watertrim
Alphabetic List of Variables and Attributes | |||||
# | Variable | Type | Len | Description | units |
1 | Country | Char | 11 | Country name | |
4 | Region | Char | 4 | WHO Global Region | |
2 | Urb_Rur | Char | 5 | Urban/Rual | |
7 | drink00 | Num | 8 | Percent of population with access to clean drinking water in year 2000 | Percent |
9 | drink05 | Num | 8 | Percent of population with access to clean drinking water in year 2005 | Percent |
5 | drink95 | Num | 8 | Percent of population with access to clean drinking water in year 1995 | Percent |
8 | sanit00 | Num | 8 | Percent of population with access to sanitation in year 2000 | Percent |
10 | sanit05 | Num | 8 | Percent of population with access to sanitation in year 2005 | Percent |
6 | sanit95 | Num | 8 | Percent of population with access to sanitation in year 1995 | Percent |
3 | subregion | Num | 8 | WHO subregion within global region |
Names of WHO Regions.
WHO Region Name | Abbreviation |
African Region, | AFRO |
Region of the Americas, | PAHO |
South-East Asia Region, | SEAR |
European Region, | EURO |
Eastern Mediterranean Region, | EMRO |
Western Pacific Region | WPRO |
The final is due on 26 June before midnightby email. Email your finished final exam and include the SAS output, and the written answers to the questions in two separate files. In the written answers, please refer to the page of the SAS output that you derived your answer from (applies to most questions, but may not be applicable to some) and indicate the location of the answer on the SAS output (highlights, arrows, boxes).
The output has titles with the question number and section.
If you have any questions send them by email and if I can answer them without giving away too much, I will send the question and the answer to everyone.
Answer the following questions based on analysis output you have been provided.
Availability of drinking water and sanitation.
The file “Water” contains data on the percentage of the population that has access to drinking water and sanitation in rural and urban areas and the country overall (Total) in 193 countries at three times 1995, 2000 2005. Countries are categorized into regions and sub-regions.
For this exercise, the percentage of the population with access to drinking water and sanitation are treated as continuous variables (even though it is actually a proportion).
Analysis of Variance(ANOVA) is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
What statistic was used in SAS to test for differences in drinking water availability?
What is the value of this statistic?
What is the p-Value?
What proportion of the variance in drinking water availability is explained by the model? What numbers shown on the output can be used to calculate this statistics?
A t-test is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
What type of test was used in SAS to test for differences in water and sanitation availability?
What is the value of this statistic?
What is the p-Value?
What in in the output suggests that these data may not be suitable for this test? Why?
ANOVA is used to test if the difference in the availability of sanitation between urban and rural areas in 2005 are different across regions of the world.
Is availability of sanitation different across regions of the world?
What type of test was used in SAS to test for differences in water and sanitation availability?
What is the value of this statistic?
What is the p-Value?
Is the difference in availability of sanitation between urban and rural areas different across regions of the world?
What is the value of the statistic used in SAS to test if urban rural differences are different across regions?
What is the p-Value?
What proportion of the variance in sanitationavailability is explained by the ANOVA model?
Association of communicable and non-communicable disease with economic and health resources.
The file “Resources” contains data for mortality rates from communicable disease, non-communicable disease, and all causes (death per 100,000 per year for 2008) for 193 countries. Additional variables include data on the number of physicians per thousand population, percentage of the population with access to drinking water and sanitation (from file “Water” above), per capita health care spending from government and all sources (purchasing power equivalent dollars), and per capita incomes (purchasing power equivalent dollars).
Using PROC CORR, look at the correlations between the independent variables.
Which variables are likely to be redundant (i.e., they are likely to be similarly associated with the dependent variables)?
How does this influence the starting model for regression analysis?
Identifypossible sets of sets of variables for use in regression.
Using PROC REGdetermine the association between mortality rates from communicable diseases (variable commun) and independent variables for drink water and sanitation (for year 2000, drink00 and sanit00), income and health care spending, and physicians per thousand population.
What is the final model? Write the equation for it.
Interpret the model in words. What does it say about the relationship between the independent and dependent variables?
Evaluate how well the model fits the data.
How much of the variation in communicable disease mortality is explained by the final model?
How much is explained by the variables removed from the full model?
Run the same model for non-communicable disease mortality (variable noncom). Start with the same full model and determine the final reduced model.
What is the final model? Write the equation for it.
Interpret the model in words. What does it say about the relationship between the independent and dependent variables?
Evaluate how well the data fits the assumptions for least squares regression.
The graphs show some clustering of the data. Where are these data clustered? What does this clustered data represent in the real world? Do you think this clustering affects the model?
Run the model from Part C but restrict the year to 2005.
How do the results differ?
What might account for the differences in the results of the same model run for two different years?
The variable Region cannot be included in the PROC REG models because it is categorical.
How could youdetermine if Region might have an influence on the rates of communicable and non-communicable disease?
Solution
Introduction to Biostatistics Final Exam Project Answer the following questions based on analysis output you have been provided.
- Availability of drinking water and sanitation.
- Analysis of Variance(ANOVA) is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
- What statistic was used in SAS to test for differences in drinking water availability?
- What is the value of this statistic?
- What is the p-Value?
- What proportion of the variance in drinking water availability is explained by the model? What numbers shown on the output can be used to calculate this statistics?
- A t-test is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
- What type of test was used in SAS to test for differences in water and sanitation availability?
- What is the value of this statistic?
- What is the p-Value?
- What in in the output suggests that these data may not be suitable for this test? Why?
- ANOVA is used to test if the difference in the availability of sanitation between urban and rural areas in 2005 are different across regions of the world.
- Is availability of sanitation different across regions of the world?
- What type of test was used in SAS to test for differences in water and sanitation availability?
- Is availability of sanitation different across regions of the world?
- What is the value of this statistic?
- What is the p-Value?
- Is the difference in availability of sanitation between urban and rural areas different across regions of the world?
- What is the value of the statistic used in SAS to test if urban rural differences are different across regions?
- What is the p-Value?
- What proportion of the variance in sanitation availability is explained by the ANOVA model?
- Association of communicable and non-communicable disease with economic and health resources.
- Using PROC CORR, look at the correlations between the independent variables.
- Which variables are likely to be redundant (i.e., they are likely to be similarly associated with the dependent variables)?
- How does this influence the starting model for regression analysis?
- Identifypossible sets of sets of variables for use in regression.
- Using PROC REGdetermine the association between mortality rates from communicable diseases (variable commun) and independent variables for drink water and sanitation (for year 2000, drink00 and sanit00), income and health care spending, and physicians per thousand population.
- What is the final model? Write the equation for it.
- Interpret the model in words. What does it say about the relationship between the independent and dependent variables?
- Evaluate how well the model fits the data.
- How much of the variation in communicable disease mortality is explained by the final model?
- How much is explained by the variables removed from the full model?
- Run the same model for non-communicable disease mortality (variable noncom). Start with the same full model and determine the final reduced model.
- What is the final model? Write the equation for it.
- Interpret the model in words. What does it say about the relationship between the independent and dependent variables?
- Evaluate how well the data fits the assumptions for least squares regression.
- The graphs show some clustering of the data. Where are these data clustered? What does this clustered data represent in the real world? Do you think this clustering affects the model?
- Run the model from Part C but restrict the year to 2005.
- How do the results differ?
- What might account for the differences in the results of the same model run for two different years?
- The variable Region cannot be included in the PROC REG models because it is categorical.
- How could youdetermine if Region might have an influence on the rates of communicable and non-communicable disease?
-
-
- The F-statistic
- 77.81
-
- <0.001
- 17.77%. The Model and Error Sum of squares
-
- TTEST Procedure with equal and unequal variances
- Equal variances: -8.93. Unequal variances: -8.86
- < 0.001 for both cases
- Histograms and QQ plots suggest that the variable drink05 doesn’t have a normal distribution for none of the groups. This means that the conducted test is not valid
-
-
- One-Way ANOVA test
- 42.84
-
- <0.001
-
- 41.03
- < 0.001
- 52.38%
-
-
- Highly correlated independent variables are gov_psc, all_pcs and income_pc; sanit00, sanit95 and sanit05; and drink95, drink00 and drink05.
- This problem is called multicollinearity and might cause a high variance of the estimators
-
- Possibly: gov_pcs, phys_pt, drink00 and sanit00
-
- When the Percentage of population with access to sanitation in year 2000 increases in 1 percent, the Communicable disease mortality rate decreases on average in 11.52 deaths per 100,000 population. Also, the average level of Communicable disease mortality rate when the Percentage of population with access to sanitation in year 2000 is 0, is 1207.27
- The goodness of fit of a linear model usually is measured by the R squared statistic. In this case, this statistic shows that this model explains 75.34% of the variability of the Communicable Disease Mortality (dependent variable). It is important to note, though, that the residuals appear to have different probability distributions for different levels of the dependent variable, which could mean that it's necessary to transform the original variables to have a real linear relationship between independent and dependent variables.
- 75.34%
- 0.40%
-
- When the Percentage of population with access to sanitation in year 2000 increases in 1 percent keeping constant the Per capita Income, the Communicable disease mortality rate decreases on average in 2.2257 deaths per 100,000 population. When the Per capita Income increases in 1 unit, the Communicable disease mortality rate decreases on average in 0.0097 deaths per 100,000 population. Also, the average level of Communicable disease mortality rate without the effects of the independent variables, is 888.2166
- Residuals in this model don't seem to deviate away too much from the normality assumption and they don't seem to be correlated to the fitted values. Independent variables Percent of population with access to sanitation in year 2000 (sanity) and Per capita income (income_pc) have a moderately high correlation of 0.61, which doesn't strictly violates the non multicolinearity assumption, but may cause a fairly high variance of the coefficient estimators. There are a few observations with a high Cook's distance or a high leverage that might be distorting the results.
- These points are clustered in the lowest range of Income per capita. This data represents low income countries. This could affect the model, because it is better for the linear models for the independent variables to have a high variance
-
- Now income_pc is not included and phys_pt is and R-squared is higher.
- It might be a change in variables behavior in time. It also could be simply the sample size