OutLie..R: Down and Dirty Forecasting: Part 2

Friday, May 24, 2013

Down and Dirty Forecasting: Part 2

This is the second part of the forecasting exercise, where I am looking at a multiple regression. To keep it simple I chose the states that boarder WI and the US unemployment information for the regression. Again this is a down and dirty analysis, I would not call this complete in any sense.

#getting the data for WI, US, IL, MI, MN, IO
 
#Using Quandl data, great little site
wi<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/WIUR.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date'))
 
us<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/UNRATE.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date'))
 
il<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/ILUR.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date'))
 
mi<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/MIUR.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date'))
 
mn<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/MNUR.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date'))
 
io<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/IAUR.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date'))
 
 
#merging the data into one dataframe
#I started with WI so that I could isolate the rest of the variables better
unemp<-merge(wi, io, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io')
unemp<-merge(unemp, mi, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi')
unemp<-merge(unemp, mn, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn')
unemp<-merge(unemp, us, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn', 'us')
unemp<-merge(unemp, il, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn', 'us', 'il')
 
save(unemp, file='Unemployment WI.RData')
 
 
library(car)
library(lmtest)
library(forecast)
 
 
load('Unemployment WI.RData')
unemp.b<-unemp[1:436,]
unemp.p<-unemp[437:448,]
 
plot(unemp$Date, unemp$us, col='black',

main='Unemployment WI Region', type='l', xlab='Date',

ylab='Unemp. Rate', ylim=c(2, 17))
lines(unemp$Date, unemp$il, col='blue')
lines(unemp$Date, unemp$io, col='dark green')
lines(unemp$Date, unemp$mi, col='purple')
lines(unemp$Date, unemp$mn, col='gray')
lines(unemp$Date, unemp$wi, col='red')
leg.txt<-c('US', 'IL', 'IO', 'MI', 'MN', "WI")
leg.col<-c('black','blue', 'dark green', 'purple', 'gray', 'red')
legend('topleft', leg.txt, col=leg.col, pch=15)
 
summary(unemp)
 
#correlation matrix
 
cm<-cor(unemp[,-1])
cm
 
#From the graph and correlation matrix, everything

#is highly correlated, the lowest r value is 0.7879

#between Iowa and US.
#plots of each of the data, then each individual,

#everything looks really good, nice linear trends,

#although Michigan has quite bit of volatility

#compared to the other variables.
 
plot(unemp.b, pch='.')
model<-lm(wi~., data=unemp.b[,-1])
step.m<-step(model, direction='backward')
summary(step.m)
 
#testing for outliers
outlierTest(step.m)
 
# After conducting the outlier test, looks like it is michigan 
#Sept 1980 where unemployment was 13.2 percent, point 57 is 
#close at 12.9 those are the two biggest, but the outlierTest 
#isolated row 56
 
dwtest(step.m)
dwtest(step.m, alternative='two.sided')
#yep lots of autocorrelation, so a multiple ARAIMA might be in order
 
pred<-unemp.p[3:7]
mult.f<-predict(step.m, newdata=pred, interval='prediction')
a1<-accuracy(mult.f, unemp.p$wi)
 
#because the MASE is missing I needed to add the following to compare later
a1<-c(a1,'0')
names(a1)<-c("ME",  "RMSE", "MAE", "MPE", "MAPE", "MASE")
 
 
# ARIMA
 
auto.r<-auto.arima(unemp.b$wi, xreg=unemp.b[,3:7])
auto.r
auto.f<-forecast(auto.r, xreg=unemp.p[,3:7], h=12)
a2<-accuracy(auto.f, unemp.p$wi)
a2
 
#Just remember that the 0 is just a place holder
r.table<-rbind(a1, a2)
row.names(r.table)<-c('Multiple Regression', 'Reg. with ARIMA Errors')
r.table

At this point I combined the two accuracy tables to see which one was the best.

b.table<-rbind(a.table[,1:6], r.table)
b.table<-b.table[order(b.table$MAPE),]
b.table

Created by Pretty R at inside-R.org

The winner is: Regression with ARIMA Errors, for most of the analysis the simple mean like predictions dominated, even the multiple regression is considerably low on the list.

3 comments:

tomdireillMay 30, 2013 at 2:48 PM
Good use of Quandl!

Getting Dirtier. Yes ARIMA is found to be helpful, but there are other things going on too. Due to a max of 4,096 I had to edit out about 20 outliers from the model

A change in variance was detected using TSAY's variance change detection test, automatically beginning at 9/1990 and WLS was used. We don't show the H-Weights here.

The model includes SOME of the causals(no leads or lags were detected) and outliers and three changes in seasonality(August and two in January). It also has an AR1 of .704. The forecasts were as follows with a 1.22% MAPE.

7
7
7
7
6.9
6.8
6.7
6.7
7
7.1
7.1
7.1

[(1-B**1)]Y(T) = +[X1(T)][(1-B**1)][(+ .0330)] io
+[X2(T)][(1-B**1)][(+ .274)] mn
+[X3(T)][(1-B**1)][(+ .0429)] us
+[X4(T)][(1-B**1)][(+ .206)] il
+[X5(T)][(1-B**1)][(+ 1.5171)] :PULSE 56 1980/ 8
+[X6(T)][(1-B**1)][(- .751)] :PULSE 144 1987/ 12
+[X7(T)][(1-B**1)][(+ .0258)] :PULSE 51 1980/ 3
+[X8(T)][(1-B**1)][(+ .0347)] :PULSE 52 1980/ 4
+[X9(T)][(1-B**1)][(- .0432)] :PULSE 205 1993/ 1

+[X12(T)[(1-B**1)][(- .114)] :PULSE 87 1983/ 3
+[X13(T)[(1-B**1)][(- .113)] :PULSE 88 1983/ 4

+[X37(T)[(1-B**1)][(- .0702)] :PULSE 190 1991/ 10
+[X38(T)[(1-B**1)][(- .0440)] :PULSE 382 2007/ 10
+[X39(T)[(1-B**1)][(+ .0536)] :PULSE 111 1985/ 3
+[X40(T)[(1-B**1)][(- .110)] :PULSE 42 1979/ 6
+[X41(T)[(1-B**1)][(+ .0757)] :SEASONAL PULSE 409 2010/ 1
+[X42(T)[(1-B**1)][(- .0212)] :SEASONAL PULSE 121 1986/ 1
+[X43(T)[(1-B**1)][(+ .0011)] :PULSE 192 1991/ 12
+ [(1- .704B** 1)]**-1 [A(T)]

Here is the Variance Change Detection :

DIAGNOSTIC CHECK #5: THE TSAY VARIANCE CONSTANCY TEST

The Critical value used for this test : .01
The minimum group or interval size was: 87

DIRECTION TIME DATE F VALUE P VALUE
(T)

DECREASING 177 1990/ 9 .335430 .0000

Since the automatic model fixup option for the variance stability test is
enabled, the program will now estimate the parameters of the model with a set
of weights that adjusts the residuals to account for the variance change(s).

Tom Reilly
www.autobox.com
ReplyDelete
Replies
adam888May 31, 2013 at 4:49 AM
I get an error on the line below:

mult.f<-predict(step.m, newdata=pred, interval='prediction')

a1<-accuracy(mult.f, unemp.p$wi)
Error: $ operator is invalid for atomic vectors

How can I fix this? Thanks
ReplyDelete
Replies
UnknownAugust 5, 2017 at 11:23 AM
Hello Everybody,
My name is Ahmad Asnul Brunei, I contacted Mr Osman Loan Firm for a business loan amount of $250,000, Then i was told about the step of approving my requested loan amount, after taking the risk again because i was so much desperate of setting up a business to my greatest surprise, the loan amount was credited to my bank account within 24 banking hours without any stress of getting my loan. I was surprise because i was first fall a victim of scam! If you are interested of securing any loan amount & you are located in any country, I'll advise you can contact Mr Osman Loan Firm via email osmanloanserves@gmail.com

LOAN APPLICATION INFORMATION FORM
First name......
Middle name.....
2) Gender:.........
3) Loan Amount Needed:.........
4) Loan Duration:.........
5) Country:.........
6) Home Address:.........
7) Mobile Number:.........
8) Email address..........
9) Monthly Income:.....................
10) Occupation:...........................
11)Which site did you here about us.....................
Thanks and Best Regards.
Derek Email osmanloanserves@gmail.com
ReplyDelete
Replies

Add comment