Wisconsin COVID-19 Data By Census Tract Boundary
Variable Name | Variable Description |
GEOID | Geographic ID |
State | State |
CENSUS_TRACT | Census Tract Number |
COUNTY | County Name |
DATE | Last Date of Report |
POSITIVE | Number of Positive on COVID19 Test |
NEGATIVE | Number of Negative on COVID19 Test |
DEATHS | Number of Deaths by COVID19 |
HOSP_YES | Number of Hospitalized by COVID19 |
HOSP_NO | Number of Not Hospitalized by COVID19 |
HOSP_UNKNOWN | Unknown Number of Hospitalized by COVID19 |
AREA_LAND | Land Area Size |
AREA_WATER | Water Area Size |
POPULATION | Total Population |
POP_LT18 | Percent of Population that is Less Than 18 Years |
POP_65P | Percent of Population that is 65 Years and Over |
HOUS_NO_VEH | Percent of households with no vehicle available |
ADULT_LIMITED_ENGLISH | Percent of adults 18 years and over who have limited English ability |
ADULT_SPANISH_LENG | Percent of adults 18 years and over who speak Spanish and have limited English ability |
POP_BELOWPOV | Percent of Population whose income in the past 12 months is below poverty level |
POP_DISABILITY | Percent of Population with a Disability |
POP_MEDICAD | Percent of Population with Medicaid/Means-Tested Public Coverage |
POP_MEDICARE | Percent of Population with Medicare Coverage |
POP_HEALTHINS | Percent of Population with No Health Insurance Coverage |
HOUS_NOSMARTPHN | Percent of Households that Have No Smartphone |
HOUS_NOINTERNET | Percent of Households with No Internet Access |
SAS Code
Here is the SAS code to load the data into your SAS program.
filenamewebdaturl "http://bigblue.depaul.edu/jlee141/econdata/eco520/COVID19_WI.csv" ;
proc import datafile=webdat out = COVID19 DBMS = csv replace ; run ;
run ;
/*Select 500 randomly selected census tracts in WI using YourDePaulID */
proc surveyselect data= COVID19 method=srs seed= YourID
N=500 out= MYCOVID19 ;
run;
proc contents data=MYCOVID19 ;
run ;
Use SAS code to answer the following questions using MYCOVID19 data
Descriptive Analytics Questions.
Use descriptive statistics and plots to answer the following questions.
(Confirmed cases means the test results show positive from the COVID19 test.)
1) Find the average number of confirmed cases, hospitalized, and deaths per 1000 persons by Census tract.
2) Find the average number of confirmed cases, hospitalized, and deaths per 1000 persons by County.
3) Find the five highest Census tracts and the five highest counties in terms of the number of confirmed cases.
4) The five Census tracts that have the highest probability to be hospitalized among the confirmed cases.
5) Find the correlation coefficients relate to the number of confirmed cases with demographic, social status, and health care service variables. Find any nonlinear relationship using scatter plots.
6) Suppose you are working as a consultant for the state of Wisconsin and make some recommendation. According to the descriptive statistics, where are the most venerable areas that the gov’t agent need to spend their resources? Explain clearly.
Hierarchical and Non-hierarchical Clustering Analysis
1) Hierarchical Clustering Analysis for the Demographic clusters: POP_65P, POP_BELOWPOV, POP_DISABILITY
2) Non-Hierarchical Clustering Analysis for the Demographic clusters: POP_65P, POP_BELOWPOV, POP_DISABILITY
3) Compare the results 1) and 2), and define or name the clusters using descriptive statistics and plots.
4) Test if the clusters are significant to the number of confirmed cases per 1000 persons using the ANOVA test
5) Test if the clusters are significant to the number of hospitalized per 1000 persons using the ANOVA test.
Predictive Analytics using Regression Model
Use only TRAIN data (80%) to estimate the models and use the TEST data (20%) to perform the out-of-sample prediction
Suppose you are trying to predict the number of COVID19 confirmed cases using the demographic variables from Census data. (none of the variables related to COVID19 can be used to predict the COVID19 related dependent variables).
1) Find the regression models to explain the variation of the number of confirmed cases per 1000 persons. You can use any variables such as nonlinear variables and cluster variables to make the best models.
Model 1: Your own choices of variables 1
Model 2: Your own choices of variables 2
Model 3: Stepwise
Model 4: adjusted R square
2) Perform the out of sample prediction using the observation that were not used in the estimation. Find the following statistics and compare the results. Which model is the best performing model in terms of the following statistics?
a. MSE (mean square error)
b. RMSE (root mean square error)
c. MPE (mean percentage error)
d. MAE (mean absolute error)
3) Do the regression analysis for the number of hospitalized per 1000 persons as the dependent variable following the step 1) and 2). Explain what you find.
Solution
Use SAS code to answer the following questions using MYCOVID19 data
4. Descriptive Analytics Questions.
Use descriptive statistics and plots to answer the following questions. ( 9 points).
(Confirmed cases means the test results show positive from the COVID19 test.)
7) Find the average number of confirmed cases, hospitalized, and deaths per 1000 persons by Census tract.
procmeansdata=mycovid19;
classcensus_tract;
var positive hosp_yes deaths;
outputout = _1mean(where=(_stat_='MEAN'));
run;
procprintdata=_1mean;
varcensus_tract positive hosp_yes deaths;
run;
8) Find the average number of confirmed cases, hospitalized, and deaths per 1000 persons by County.
procmeansdata=mycovid19 nway;
class county;
var positive HOSP_YES deaths;
outputout=_2mean(where=(_stat_='MEAN'));
run;
procprintdata=_2mean;
var county positive hosp_yes deaths;
run;
9) Find the five highest Census tracts and the five highest counties in terms of the number of confirmed cases.
procmeansdata=mycovid19 nway;
classcensus_tract;
var positive;
outputout = _3max(where=(_stat_='MAX'));
run;
procsortdata=_3max(obs=5);
bydescending positive;
run;
procmeansdata=mycovid19 nway;
class county;
var positive;
outputout = _3max1(where=(_stat_='MAX'));
run;
procsortdata=_3max1(obs=5);
bydescending positive;
10) The five Census tracts that have the highest probability to be hospitalized among the confirmed cases.
data mycovid19_1;
set mycovid19;
prob=hosp_yes/positive;
run;
procmeansdata=mycovid19_1 nway;
class county;
varprob;
outputout = _4max(where=(_stat_='MAX'));
run;
procsortdata=_4max(obs=5);
bydescendingprob;
run;
11) Find the correlation coefficients relate to the number of confirmed cases with demographic, social status, and health care service variables. Find any nonlinear relationship using scatter plots.
proccorrdata=mycovid19;
var positive population HOUS_NO_VEH ADULT_LIMITED_ENGLISH POP_BELOWPOV POP_MEDICAD POP_HEALTHINS;
run;
The correlation between confirmed cases and:
i. Population is 0.8604.
ii. Percent of households with no vehicle available is -0.187.
iii. Percent of Population whose income in the past 12 months is below poverty level is -0.2111.
iv. Percent of Population with Medicaid/Means-Tested Public Coverage is -0.1946
Scatter plots
i. There is a linear relationship between confirmed cases and Population
procsgplotdata=mycovid19;
scattery = positive x = population;
run;
ii. There is a linear relationship between confirmed cases and Percent of households with no vehicle available.
procsgplotdata=mycovid19;
scattery = positive x = HOUS_NO_VEH;
run;
iii. There is a linear relationship between confirmed cases and Percent of Population whose income in the past 12 months is below poverty level.
procsgplotdata=mycovid19;
scattery = positive x = POP_BELOWPOV;
run;
iv. There is a linear relationship between confirmed cases and Percent of Population with Medicaid/Means-Tested Public Coverage.
procsgplotdata=mycovid19;
scattery = positive x = POP_MEDICAD;
run;
12) Suppose you are working as a consultant for the state of Wisconsin and make some recommendation. According to the descriptive statistics, where are the most venerable areas that the gov’t agent need to spend their resources? Explain clearly.
Based on the COVID-19 data collected on the state of Wisconsin through the DHS, there seems to be a relationship between population size and number of confirmed cases. My suggestion is therefore, to closely monitor the areas with a large population, to make sure the occupants are adhering to the covid-19 protocols set by WHO. The census tracts that have the highest confirmed cases are 1, 2, 2.02, 2.02 and 2.05; and the counties with the highest confirmed cases are: Brown, Barron, Bayfiel, Adams and Ashland. These sites and areas should be closely monitored, and strict guidelines put in place to ensure no further spread.
5. Hierarchical and Non-hierarchical Clustering Analysis on ( 7 points)
6) Hierarchical Clustering Analysis for the Demographic clusters: POP_65P, POP_BELOWPOV, POP_DISABILITY
procclusterdata=mycovid19 method=ward;
var POP_65P POP_BELOWPOV POP_DISABILITY;
run;
7) Non-Hierarchical Clustering Analysis for the Demographic clusters: POP_65P, POP_BELOWPOV, POP_DISABILITY
procfastclusdata=mycovid19 maxclusters=10maxiter=100converge=0
mean=mean out=prelim;
var POP_65P POP_BELOWPOV POP_DISABILITY;
run;
8) Compare the results 1) and 2), and define or name the clusters using descriptive statistics and plots.
9) Test if the clusters are significant to the number of confirmed cases per 1000 persons using the ANOVA test
10) Test if the clusters are significant to the number of hospitalized per 1000 persons using the ANOVA test.
6. Predictive Analytics using Regression Model (9 points)
Use only TRAIN data (80%) to estimate the models and use the TEST data (20%) to perform the out-of-sample prediction
Suppose you are trying to predict the number of COVID19 confirmed cases using the demographic variables from Census data. (none of the variables related to COVID19 can be used to predict the COVID19 related dependent variables).