Understanding Survival analysis
Survival analysis is a statistical method used in analyzing data on the time of an event such as death, device failure, heart attack, etc. The analysis is important during legal proceedings. It includes gathering and analyzing data to come up with ideal conclusions.
Research Questions
Environmental issues have been one of the major subjects of debate recently and it is thought that human and economic activities have endangered the environment through the emission of C02 and other toxic substances from production activities to our air water and land. There has been an unprecedented reduction in the ozone layer and has led to climate changes that have affected economies of the world generally and Australia specifically. Thus, we want to examine the impact of economic activities and structure on pollution. The broad research question is
“What is the effect of economic activities and structure on the emission of C02 and other toxic substances”. The specific research questions are
1. is there a significant effect of economic growth measured by GDP growth on emission
2. is there a significant effect of population on emission
3.is there a significant effect of the number of approved buildings on emission
Data Cleaning
I have four data sets which are on emissions, buildings, population growth, and economic growth rate. I start with the cleaning emissions data set. The data set is disaggregated by facilities and emission type. I will aggregate it by summing it over the years so that I can have the total for each year. There are three sources of emission in the data set (air, water, and land), I sum the three to have total emission. The year in the data set was written as 2018/2019, I create a new variable “year” which extract the lead year (2019 from 2018/2019) and aggregated over this year variable. Finally, I have yearly data from 1999 to 2019. The code and the sample of final data is shown in the chunk below
library(stringr)
##DATA CLEANING
emission<-read.csv("emissions.csv")
emission$totalemission=rowSums(emission[,c("air_total_emission_kg","water_emission_kg","land_emission_kg")],na.rm=T)
emission$year=as.numeric(str_sub(emission$report_year,6,9))
emission<-emission[,c("totalemission","year")]
emission<-aggregate(emission,by=list(emission$year),FUN=sum)
emission$year=emission$Group.1
emission=emission[,2:3]
head(emission)
## totalemission year
## 1 1163652271 1999
## 2 3241358298 2000
## 3 3249103564 2001
## 4 3952112597 2002
## 5 3908653089 2003
## 6 3968629002 2004
Next, I clean the building's data set which is made up of monthly data from 1983 to 2020. First I remove the first 9 rows that are not observations but provide information about the data and then extract the variable that I need. Then I average over years to give a yearly time series data from 1983 to 2020. The code and the sample of final data are shown in the chunk below.
buildings<-read.csv("buildings.csv",stringsAsFactors = F)
buildings<-buildings[10:450,]
buildings$dwellings=as.numeric(buildings[,3])
buildings$year=as.numeric(str_sub(buildings$X,5,8))
buildings=buildings[,c(23,24)]
buildings=na.omit(buildings)
buildings=aggregate(buildings,by=list(buildings$year),FUN=mean)
buildings=buildings[,c(2,3)]
head(buildings)
## dwellings year
## 1 9656.667 1983
## 2 10250.667 1984
## 3 10009.333 1985
## 4 8072.583 1986
## 5 8364.750 1987
## 6 11301.417 1988
Next, I clean the population growth data set which is made up of quarterly data from 1981 to 2019. First I remove the first 9 rows that are not observations but provide information about the data and then extract the variable that I need. Then I average over years to give a yearly time series data from 1983 to 2020. The code and the sample of final data is shown in the chunk below.
pop<-read.csv("population_growth.csv",stringsAsFactors=F)
pop=pop[10:163,]
pop$population=as.numeric(pop$Estimated.Resident.Population..ERP.....Australia..)
pop$year=as.numeric(str_sub(pop$X,5,8))
pop=pop[,16:17]
pop<-aggregate(pop,by=list(pop$year),FUN=mean)
pop<-pop[,2:3]
head(pop)
## population year
## 1 14988.70 1981
## 2 15208.52 1982
## 3 15415.55 1983
## 4 15604.17 1984
## 5 15816.33 1985
## 6 16048.42 1986
There is no much to clean in the economic growth rate data set. I only create a year variable to denote year of the observations.The code and the sample of final data is shown in the chunk below.
growth<-read.csv("growth.csv")
growth$year=as.numeric(str_sub(growth$Date,1,4))+1
growth=growth[,2:3]
head(growth)
## Growth.in.real.GDP.per.capita year
## 1 0.2 1961
## 2 -1.1 1962
## 3 4.2 1963
## 4 5.0 1964
## 5 3.9 1965
## 6 0.4 1966
I have all my data cleaned but the lenghts are different. I see that observations for year 1999 to 2017 are available for all variables, thus I restrict my sample to this number and combine the data set. Moreover,I took the log of total emission, population and number of approved buildings while I did not take that of growth because it contains negative values.
growth=growth[39:57,]
pop=pop[19:37,]
emission=emission[1:19,]
buildings=buildings[17:35,]
data<-data.frame(growth,pop,emission,buildings)
data<-data[,c(2,1,3,5,7)]
data$ltotalemission=log(data$totalemission)
data$ldwellings=log(data$dwellings)
data$lpopulation=log(data$population)
Data Analysis
Exploratory data analysis
##time series plot of variables
par(mfrow=c(2,2))
plot(data$year,data$Growth.in.real.GDP.per.capita,ylab="GDP growth",type="l",col=1)
plot(data$year,data$population,ylab="Population",type="l",col=2)
plot(data$year,data$totalemission,ylab="total emissions (kg)",type="l",col=3)
plot(data$year,data$dwellings,ylab="Total approved Buildings",type="l",col=4)
We see from the plot that total emissions jumped drastically from the initial period and reached the peak around 2008 then falls very slowly till date. total approved buildings have flctuated over time. Same can be seen for economic growth. However, population have been increasing throughout time.
##scatterplot
par(mfrow=c(2,2))
plot(data$Growth.in.real.GDP.per.capita,data$totalemission,xlab="GDP growth", ylab="total emission",col=1)
plot(data$population,data$totalemission,xlab="Population", ylab="total emission",col=2)
plot(data$dwellings,data$totalemission,xlab="Total approved buildings ", ylab="total emission",col=3)
The scatterplot of total emissions against the independent variables is shown above. We hardly see any pattern of relationship between total emission and the independent variables.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
corr.mat<-rcorr(as.matrix(data[,2:5]))
corr.mat
## Growth.in.real.GDP.per.capita population
## Growth.in.real.GDP.per.capita 1.00 -0.62
## population -0.62 1.00
## totalemission -0.43 0.32
## dwellings -0.01 0.09
## totalemission dwellings
## Growth.in.real.GDP.per.capita -0.43 -0.01
## population 0.32 0.09
## totalemission 1.00 -0.15
## dwellings -0.15 1.00
##
## n= 19
##
##
## P
## Growth.in.real.GDP.per.capita population
## Growth.in.real.GDP.per.capita 0.0044
## population 0.0044
## totalemission 0.0632 0.1818
## dwellings 0.9821 0.7074
## totalemission dwellings
## Growth.in.real.GDP.per.capita 0.0632 0.9821
## population 0.1818 0.7074
## totalemission 0.5439
## dwellings 0.5439
The result shows that there is negative medium correlation between GDP growth and total emmsions (r=-0.43) and is significnat at 10% alpha level (p=0.0632). there is negative weak (r=-0.15) but insignificant (p=0.1818>0.05) between total emissions and number of approved buildings. There exist poistive medium (r=0.32) but insignificant (p=0.5439>0.05) relationship between total emissions and dwellings.
Linear and Predictive Modelling
model<-lm(ltotalemission~Growth.in.real.GDP.per.capita+lpopulation+ldwellings,data=data)
summary(model)
##
## Call:
## lm(formula = ltotalemission ~ Growth.in.real.GDP.per.capita +
## lpopulation + ldwellings, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76158 -0.10547 0.00244 0.15319 0.32270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.44494 10.78909 2.266 0.0387 *
## Growth.in.real.GDP.per.capita -0.11043 0.07472 -1.478 0.1601
## lpopulation 0.34170 0.91284 0.374 0.7134
## ldwellings -0.62367 0.70515 -0.884 0.3904
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2597 on 15 degrees of freedom
## Multiple R-squared: 0.2755, Adjusted R-squared: 0.1305
## F-statistic: 1.901 on 3 and 15 DF, p-value: 0.1729
The regression result above shows that the estimated regression model is ltotalemissions=24.44-0.11growth+0.34lpopulation-0.62ldwellings. The result shows that contrary to expectation, the number of approved buildings and growth in GDP has a negative effect on total emission. Specifically, a 1% increase in the number of approved buildings reduces emission by 0.62% reduction in emission while a 1% increase in GDP growth reduces total emission by 11.04%. Finally, a 1% increase in population increases totals emission by 0.34%. However, all these variables are not significant as all have p-values greater than 0.05. The adjusted R-squared shows that 13.05% of the variation is explained by the independent variables. ###Predictive Modelling.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(58)
rfm<-randomForest(ltotalemission~.,data=data[,c(2,6,7,8)],ntree=500,importance=T)
print(rfm)
##
## Call:
## randomForest(formula = ltotalemission ~ ., data = data[, c(2, 6, 7, 8)], ntree = 500, importance = T)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 0.07360822
## % Var explained: -0.19
The mean of squared residuals is 0.0736 with 0.19 percent of variance explained.
importance(rfm)
## %IncMSE IncNodePurity
## Growth.in.real.GDP.per.capita 0.91483892 0.4931572
## ldwellings -0.09948159 0.1740888
## lpopulation 9.85744741 0.4757064
varImpPlot(rfm)
The result above shows that the population is the most important variable explaining total emissions as it has the highest mean decrease accuracy.
Conclusion
We have examined whether economic growth and structure contribute to environmental pollution using yearly time series data from 1999 to 2017. The result of the multiple linear regression model shows that none of the considered economic variables are significant which means there is no effect of economic growth and structure on the emission of a toxic substance to the environment. The limitation of this is that we cannot rule out the possibility of omitted variable bias given that the omnibus F-test is not significant which means the model needs to be improved upon by finding variables that may be significant but have been left out. However, this provides a good starting point in investigating this important topic.