Zillow Model Code

Zillow Model Code

I need to continue with the Zillow data sets you helped me cleanse last week.  I can send the links to the original data files again if needed.  For the next assignment I need to:

  1. Create the distribution of outcomes (logerror), i.e. the difference in log(Zestimate)-log(Saleprice) in a histogram.
  2. Create a correlation matrices with logerror and absolute logerror.
  3. Build four (4) different Time Series models. Explain why you selected the particular model approach.  The model output needs to match the Sample Submission cvs format to submit to Kaggle for grading. Our professor has created a couple of sample models for review in the Sample Zillow Model Code file.
  4. If you could provide directions on how to get the model output from R to a cvs file, it would also be very much appreciated

Sample Zillow Model Code

Sample 1

NOTE: the models I am providing just show you how to “roll your own” models! I am really only demonstrating how to merge datasets and begin the building of simple models. From these models here, you can build all sorts of time series / external regression models that will help forecast the logerror for the next periods. Fundamentally, this is simple linear modeling with time series components.

train16=read.csv(“C:/Users/lfult/Desktop/Zillow/train_2016_v2.csv”) #read train16
train17=read.csv(“C:/Users/lfult/Desktop/Zillow/train_2017.csv”) #read train 17
prop16=read.csv((“C:/Users/lfult/Desktop/Zillow/properties_2016.csv”)) #read property data 16
prop17=read.csv((“C:/Users/lfult/Desktop/Zillow/properties_2017.csv”)) #read property data 17

#############Ignoring EDA##################

myprop16=merge(train16, prop16, by=”parcelid”) ##merge data 2016
myprop17=merge(train17, prop17, by=”parcelid”) ##merge data 2017

myprop16$mo=as.factor(substr(myprop16$transactiondate,6,7)) ##get monthly TS data
myprop16$da=a=as.factor(substr(myprop16$transactiondate,9,10)) ## get monthly TS data

myprop17$mo=as.factor(substr(myprop17$transactiondate,6,7)) ##get monthly TS data
myprop17$da=a=as.factor(substr(myprop17$transactiondate,9,10)) ## get monthly TS data

myprop16$parcelid=as.factor(myprop16$parcelid)
myprop17$parcelid=as.factor(myprop17$parcelid)

myprop17mo=as.factor(substr(myprop17$transactiondate,6,7)) ##get monthly TS data
myprop17da=a=as.factor(substr(myprop17$transactiondate,9,10)) ## get monthly TS data

mylm1=lm(logerror~mo+da,data=myprop16) ##build stupid model for 2016
mylm2=lm(logerror~mo+da,data=myprop17) ##build stupid model for 2017

Sample 2

One thing some of you are struggling to do is deal with multiple observations in a single time period. Normally, we would use panel-like methods; however, we won’t get to that until later. Here is a simplification that runs an ARIMA with an xreg (external regressor). Remember, ARIMA models are nothing more than regression-type models. SO you can build them all manually if you would like.. But if you don’t want to do so, here is a simplification that allows you to evaluate time components with some of Hyndman’s tools. Remember, you can always roll your own…

read.csv(“C:/Users/lfult/Desktop/Zillow/train_v3″) #write a new csv

library(data.table) #great library…simplification to collapse multivariate data into means
newdata=myprop16[order(myprop16$transactiondate),] # order the data by transaction date
for(i in 1:ncol(newdata)){ #naive imputations…use MICE or something else!
newdata[is.na(newdata[,i]), i]= mean(newdata[,i], na.rm = TRUE)
}

newdata2=data.table(newdata,key=”transactiondate”) #make a data table with transactiondate as the key column
mydata=newdata2[,list(transactiondate=transactiondate, logerror=mean(logerror), yearbuilt=mean(na.omit(yearbuilt))),by=transactiondate] #build whatever variables needed
myts=ts(mydata$logerror, frequency =1, start=c(2016,1,1))
myarima1=arima(myts,order=c(2,2,2),seasonal=c(1,1,1), xreg=mydata$yearbuilt)#include regression elements with ARIMA elements
myarima2=arima(myts,order=c(2,2,2),seasonal=c(1,1,1))
myets=ets(myts)
plot(forecast(myarima2))

version1.r

setwd(“C:/Users/AKOSMOS/Desktop/freelancer/tutorials”)

#data importation:

# It takes a huge amount of time because the file are enormous.

#2016

donnees_2016 <- read.csv(“properties_2016.csv”, header=TRUE, sep=”,” , dec=”.”)

len2016=length(names(donnees_2016))

#We delete colonne if number of Na is over 50% of data.

L=length(donnees_2016[,1])

for(i in 1:58){

l=length(which(is.na(donnees_2016[,i])))

if(l/L<0.5){donnees_2016[,-i]}

}

#Statistiquesdescriptives:

L=length(donnees_2016[,1])

C=length(names(donnees_2016))

#stat simple

for(i in 1:C){

names(donnees_2016)[i]

#max min etquantiles

summary(donnees_2016[,i])

#moyen:

mean(donnees_2016[,i])

#ecart type

sd(donnees_2016[,i]))

}

#visualisation

 

for(i in 1:C)

{

if(is.numeric(max(donnees_2016[,i])))

{plot(donnees_2016[,i],main=names(donnees_2016)[i], type=p)}

else

{hist(x, breaks = “Sturges”,main=names(donnees_2016)[i])}

#Traitement des valeursmanquantes:

for(i in 1:C){donnees_2016[is.na(donnees_2016[,i]),i]=mean(donnees_2016[,i], na.rm = FALSE)}

#Outlier: boite à moustache:

for(i in 1:C){

box=boxplot(donnees_2016[,i]

if(length(box$out)>0){

Position=match(box$out,  donnees_2016[,i]

donnees_2016[,Position]=mean(donnees_2016[,i])

}

#Recodage des variables/ dummy

for(i in 1:C){

if(is.factor(max(donnees_2016[,i])){

table=table(donnees_2016[,i])

for( k in 1:length(table)){

colonne=as.numeric(names(table(a))==table[k])

cbind(donnees_2016, colonne)

}

}

}

#Backward Regression Approach: backward

step(full, data=donnees_2016, direction=”backward”)

x=donnees_2016[,1]

capture.output(summary(as.numeric(x)), “summary1.txt”)

head(x)

donnees_2017 <- read.csv(“properties_2017.csv”, header=TRUE, sep=”,” , dec=”.”)

length(names(donnees_2017))

#data=rbind(donnees_2016,donnees_2017)

#We export all variables to variable.txt

a=names(donnees_2016)

write(a,”variables.txt”) 

Solution 

corr matrix (Autosaved).R

setwd(“C:/Users/AKOSMOS/Desktop/freelancer/tutorials”)

#data importation:

# It takes a huge amount of time because the file are enormous.

#2016

donnees_2016 <- read.csv(“properties_2016.csv”, header=TRUE, sep=”,” , dec=”.”)

len2016=length(names(donnees_2016))

#We delete colonne if number of Na is over 50% of data.

L=length(donnees_2016[,1])

for(i in 1:58){

l=length(which(is.na(donnees_2016[,i])))

if(l/L<0.5){donnees_2016[,-i]}

}

#Statistiquesdescriptives:

L=length(donnees_2016[,1])

C=length(names(donnees_2016))

#stat simple

for(i in 1:C){

names(donnees_2016)[i]

#max min etquantiles

summary(donnees_2016[,i])

#moyen:

mean(donnees_2016[,i])

#ecart type

sd(donnees_2016[,i]))

}

#visualisation

for(i in 1:C)

{

if(is.numeric(max(donnees_2016[,i])))

{plot(donnees_2016[,i],main=names(donnees_2016)[i], type=p)}

else

{hist(x, breaks = “Sturges”,main=names(donnees_2016)[i])}

#Traitement des valeursmanquantes:

for(i in 1:C){donnees_2016[is.na(donnees_2016[,i]),i]=mean(donnees_2016[,i], na.rm = FALSE)}

#Outlier: boite à moustache:

for(i in 1:C){

box=boxplot(donnees_2016[,i]

if(length(box$out)>0){

Position=match(box$out,  donnees_2016[,i]

donnees_2016[,Position]=mean(donnees_2016[,i])

}

#Recodage des variables/ dummy

for(i in 1:C){

if(is.factor(max(donnees_2016[,i])){

table=table(donnees_2016[,i])

for( k in 1:length(table)){

colonne=as.numeric(names(table(a))==table[k])

cbind(donnees_2016, colonne)

}

}

}

#Backward Regression Approach: backward

step(full, data=donnees_2016, direction=”backward”)

x=donnees_2016[,1]

capture.output(summary(as.numeric(x)), “summary1.txt”)

head(x)

donnees_2017 <- read.csv(“properties_2017.csv”, header=TRUE, sep=”,” , dec=”.”)

length(names(donnees_2017))

#data=rbind(donnees_2016,donnees_2017)

#We export all variables to variable.txt

a=names(donnees_2016)

write(a,”variables.txt”) 

hist.R 

train<-train_2016_v2 #first load these files in R studio

hist(train$logerror,prob=T) #gives logerrorhist for 2016 

TS_garage.R 

#ARIMA with regression on numner of garages

properties<-properties_2016

train<-train_2016_v2

myprop16 <-merge(properties,train,by=”parcelid”)

library(data.table)

newdata<-myprop16[order(myprop16$transactiondate),]

for(i in 1:ncol(newdata)){

newdata[is.na(newdata[,i]), i]= mean(newdata[,i],na.rm =F)

} #there may be warnings, since in some cases we’ll be averaging non numerical stuff , so NA would come out , but they don’t matter since we are not using those in the code

newdata2=data.table(newdata,key=”transactiondate”)

mydata=newdata2[,list(transactiondate=transactiondate, logerror=mean(logerror), garagecarcnt=mean(na.omit(garagecarcnt))),by=transactiondate]

newdata2=data.table(newdata,key=”transactiondate”) #make a data table with transactiondate as the key column

myts=ts(mydata$logerror, frequency =1, start=c(2016,1,1))

library(forecast)

arima_build<-auto.arima(myts,xreg=mydata$garagecarcnt)

accuracy(arima_build) #gives accuracy parameters for the time series trained on 2016 data

testset<- properties_2016 #added

X=matrix(0,nrow(testset),7)

for ( i in 1:nrow(testset)){

P=fitted(arima_build,xreg=testset$garagecarcnt[i]) #this works

X[i,1]=testset$parcelid[i]

X[i,2]=mean(P[274:304]) #average of October daily predictions

X[i,3]=mean(P[305:335])

X[i,4]=mean(P[336:351])

}

#calculation for 2017 data

properties<-properties_2017

train<-train_2017

myprop17 <-merge(properties,train,by=”parcelid”)

library(data.table)

newdata<-myprop17[order(myprop17$transactiondate),]

for(i in 1:ncol(newdata)){

newdata[is.na(newdata[,i]), i]= mean(newdata[,i],na.rm =F)

}

newdata2=data.table(newdata,key=”transactiondate”)

mydata=newdata2[,list(transactiondate=transactiondate, logerror=mean(logerror), garagecarcnt=mean(na.omit(garagecarcnt))),by=transactiondate]

~/Krista3.csv

newdata2=data.table(newdata,key=”transactiondate”) #make a data table with transactiondate as the key column

myts=ts(mydata$logerror, frequency =1, start=c(2017,1,1))

library(forecast)

arima_build<-auto.arima(myts,xreg=mydata$garagecarcnt)

accuracy(arima_build) #gives accuracy parameters for the time series trained on 2017 data

testset<- properties_2017 #added

for ( i in 1:nrow(testset)){

xreg=c(rep(testset$garagecarcnt[i],100))

P=as.vector(forecast(arima_build,100,xreg=xreg)$mean) #this works

X[i,5]=mean(P[1:31]) #average of October daily predictions

X[i,6]=mean(P[32:61])

X[i,7]=mean(P[62:92])

}#though upper level isn’t given , so there maybe warnings , but it doesn’t matter since we need only first 92

#printing X

colnames(X) <- c(“parcelid”,”201610″,”201611″,”201612″,”201710″,”201711″,”201712″)

X<-data.frame(X)

X[is.na(X)]<-0

X <- X[order(parcelid),]

X<-round(X,4)

write.csv(X,”Krista3.csv”)

TS_totaltax.R 

#ARIMA with regression on totaltax

properties<-properties_2016

train<-train_2016_v2

myprop16 <-merge(properties,train,by=”parcelid”)

library(data.table)

newdata<-myprop16[order(myprop16$transactiondate),]

for(i in 1:ncol(newdata)){

newdata[is.na(newdata[,i]), i]= mean(newdata[,i],na.rm =F)

}

newdata2=data.table(newdata,key=”transactiondate”)

mydata=newdata2[,list(transactiondate=transactiondate, logerror=mean(logerror), taxvaluedollarcnt=mean(na.omit(taxvaluedollarcnt))),by=transactiondate]

newdata2=data.table(newdata,key=”transactiondate”) #make a data table with transactiondate as the key column

myts=ts(mydata$logerror, frequency =1, start=c(2016,1,1))

library(forecast)

arima_build<-auto.arima(myts,xreg=mydata$taxvaluedollarcnt)

accuracy(arima_build) #gives accuracy parameters for the time series trained on 2016 data

testset<- properties_2016 #added

X=matrix(0,nrow(testset),7)

for ( i in 1:nrow(testset)){

P=fitted(arima_build,xreg=testset$taxvaluedollarcnt[i]) #this works

X[i,1]=testset$parcelid[i]

X[i,2]=mean(P[274:304]) #average of October daily predictions

X[i,3]=mean(P[305:335])

X[i,4]=mean(P[336:351])

}

#calculation for 2017 data

properties<-properties_2017

train<-train_2017

myprop17 <-merge(properties,train,by=”parcelid”)

library(data.table)

newdata<-myprop17[order(myprop17$transactiondate),]

for(i in 1:ncol(newdata)){

newdata[is.na(newdata[,i]), i]= mean(newdata[,i],na.rm =F)

}

newdata2=data.table(newdata,key=”transactiondate”)

mydata=newdata2[,list(transactiondate=transactiondate, logerror=mean(logerror), taxvaluedollarcnt=mean(na.omit(taxvaluedollarcnt))),by=transactiondate]

newdata2=data.table(newdata,key=”transactiondate”) #make a data table with transactiondate as the key column

myts=ts(mydata$logerror, frequency =1, start=c(2017,1,1))

library(forecast)

arima_build<-auto.arima(myts,xreg=mydata$taxvaluedollarcnt)

accuracy(arima_build) #gives accuracy parameters for the time series trained on 2017 data

testset<- properties_2017 #added

for ( i in 1:nrow(testset)){

xreg=c(rep(testset$taxvaluedollarcnt[i],100))

P=as.vector(forecast(arima_build,100,xreg=xreg)$mean) #this works

X[i,5]=mean(P[1:31]) #average of October daily predictions

X[i,6]=mean(P[32:61])

X[i,7]=mean(P[62:92])

}#though upper level isn’t given , so there maybe warnings , but it doesn’t matter since we need only first 92

#printing X

colnames(X) <- c(“parcelid”,”201610″,”201611″,”201612″,”201710″,”201711″,”201712″)

X <- X[order(parcelid),]

X<-data.frame(X)

X[is.na(X)]<-0

X<-round(X,4)

write.csv(X,”Krista4.csv”) 

Krista_buildyear.R 

#ARIMA along with regression with build_year

properties<-properties_2016

train<-train_2016_v2

myprop16 <-merge(properties,train,by=”parcelid”) #merge the 2 datsets

library(data.table)

newdata<-myprop16[order(myprop16$transactiondate),]

for(i in 1:ncol(newdata)){

newdata[is.na(newdata[,i]), i]= mean(newdata[,i],na.rm =F)

}#missing values manipulation .  there may be warnings, since in some cases we’ll be averaging non numerical stuff , so NA would come out , but they don’t matter since we are not using those in the code

newdata2=data.table(newdata,key=”transactiondate”)

mydata=newdata2[,list(transactiondate=transactiondate, logerror=mean(logerror), yearbuilt=mean(na.omit(yearbuilt))),by=transactiondate]

newdata2=data.table(newdata,key=”transactiondate”) #make a data table with transactiondate as the key column

myts=ts(mydata$logerror, frequency =1, start=c(2016,1,1))#our training series

library(forecast)

arima_build<-auto.arima(myts,xreg=mydata$yearbuilt)

accuracy(arima_build) #gives accuracy parameters for the time series trained on 2016 data

testset<- properties_2016 #added

X=matrix(0,nrow(testset),7)

for ( i in 1:nrow(testset)){

P=fitted(arima_build,xreg=c(rep(testset$yearbuilt[i],length(myts))) )#this works

X[i,1]=testset$parcelid[i]

X[i,2]=mean(P[274:304]) #average of October daily predictions

X[i,3]=mean(P[305:335])

X[i,4]=mean(P[336:351])

}

#calculation for 2017 data

properties<-properties_2017

train<-train_2017

myprop17 <-merge(properties,train,by=”parcelid”)

library(data.table)

newdata<-myprop17[order(myprop17$transactiondate),]

for(i in 1:ncol(newdata)){

newdata[is.na(newdata[,i]), i]= mean(newdata[,i],na.rm =F)

}

newdata2=data.table(newdata,key=”transactiondate”)

mydata=newdata2[,list(transactiondate=transactiondate, logerror=mean(logerror), yearbuilt=mean(na.omit(yearbuilt))),by=transactiondate]

newdata2=data.table(newdata,key=”transactiondate”) #make a data table with transactiondate as the key column

myts=ts(mydata$logerror, frequency =1, start=c(2017,1,1))

library(forecast)

arima_build<-auto.arima(myts,xreg=mydata$yearbuilt)

accuracy(arima_build) #gives accuracy parameters for the time series trained on 2017 data

testset<-properties_2017 #added

for ( i in 1:nrow(testset)){

xreg=c(rep(testset$yearbuilt[i],100))

P=as.vector(forecast(arima_build,100,xreg=xreg)$mean) #this works

X[i,5]=mean(P[1:31]) #average of October daily predictions

X[i,6]=mean(P[32:61])

X[i,7]=mean(P[62:92])

}#though upper level isn’t given , so there maybe warnings , but it doesn’t matter since we need only first 92

#printing X

colnames(X) <- c(“parcelid”,”201610″,”201611″,”201612″,”201710″,”201711″,”201712″)

X <- X[order(parcelid),]

X<-data.frame(X)

X[is.na(X)]<-0

X<-round(X,4)

write.csv(X,”Krista1.csv”)

Krista_totalarea.R

#ARIMA with regression on totalcalculated area

properties<-properties_2016

train<-train_2016_v2

myprop16 <-merge(properties,train,by=”parcelid”)

library(data.table)

newdata<-myprop16[order(myprop16$transactiondate),]

for(i in 1:ncol(newdata)){

newdata[is.na(newdata[,i]), i]= mean(newdata[,i],na.rm =F)

} #there may be warnings, since in some cases we’ll be averaging non numerical stuff , so NA would come out , but they don’t matter since we are not using those in the code

newdata2=data.table(newdata,key=”transactiondate”)

mydata=newdata2[,list(transactiondate=transactiondate, logerror=mean(logerror), calculatedfinishedsquarefeet=mean(na.omit(calculatedfinishedsquarefeet))),by=transactiondate]

newdata2=data.table(newdata,key=”transactiondate”) #make a data table with transactiondate as the key column

myts=ts(mydata$logerror, frequency =1, start=c(2016,1,1))

library(forecast)

arima_build<-auto.arima(myts,xreg=mydata$calculatedfinishedsquarefeet)

accuracy(arima_build) #gives accuracy parameters for the time series trained on 2016 data

testset<- properties_2016 #added

X=matrix(0,nrow(testset),7)

for ( i in 1:nrow(testset)){

P=fitted(arima_build,xreg=testset$calculatedfinishedsquarefeet[i]) #this works

X[i,1]=testset$parcelid[i]

X[i,2]=mean(P[274:304]) #average of October daily predictions

X[i,3]=mean(P[305:335])

X[i,4]=mean(P[336:351])

}

#calculation for 2017 data

properties<-properties_2017

train<-train_2017

myprop17 <-merge(properties,train,by=”parcelid”)

library(data.table)

newdata<-myprop17[order(myprop17$transactiondate),]

for(i in 1:ncol(newdata)){

newdata[is.na(newdata[,i]), i]= mean(newdata[,i],na.rm =F)

}

newdata2=data.table(newdata,key=”transactiondate”)

mydata=newdata2[,list(transactiondate=transactiondate, logerror=mean(logerror), calculatedfinishedsquarefeet=mean(na.omit(calculatedfinishedsquarefeet))),by=transactiondate]

newdata2=data.table(newdata,key=”transactiondate”) #make a data table with transactiondate as the key column

myts=ts(mydata$logerror, frequency =1, start=c(2017,1,1))

library(forecast)

arima_build<-auto.arima(myts,xreg=mydata$calculatedfinishedsquarefeet)

accuracy(arima_build) #gives accuracy parameters for the time series trained on 2017 data

testset<- properties_2017 #added

for ( i in 1:nrow(testset)){

xreg=c(rep(testset$calculatedfinishedsquarefeet[i],100))

P=as.vector(forecast(arima_build,100,xreg=xreg)$mean) #this works

X[i,5]=mean(P[1:31]) #average of October daily predictions

X[i,6]=mean(P[32:61])

X[i,7]=mean(P[62:92])

}#though upper level isn’t given , so there maybe warnings , but it doesn’t matter since we need only first 92

#printing X

colnames(X) <- c(“parcelid”,”201610″,”201611″,”201612″,”201710″,”201711″,”201712″)

for ( i in 1:nrow(testset)){

xreg=c(rep(testset$yearbuilt[i],100))

P=as.vector(forecast(arima_build,100,xreg=xreg)$mean) #this works

X[i,5]=mean(P[1:31]) #average of October daily predictions

X[i,6]=mean(P[32:61])

X[i,7]=mean(P[62:92])

}#though upper level isn’t given , so there maybe warnings , but it doesn’t matter since we need only first 92

#printing X

colnames(X) <- c(“parcelid”,”201610″,”201611″,”201612″,”201710″,”201711″,”201712″)

X <- X[order(parcelid),]

X<-data.frame(X)

X[is.na(X)]<-0

X<-round(X,4)

write.csv(X,”Krista2.csv”)