Thursday, September 19, 2013

How do you draw a Pirate Hat?: rnorm()

Since today, 19 Sept is talk like a pirate day, I would like to relate a story about stats and pirates. While driving, my wife was in the back seat with our young child asked me how to draw a pirate hat. Since I was driving I could not show her how to do it, so I had to figure out a way to describe it that both of us could understand. My reply, "Draw a normal distribution curve."

She looked at me kind of weird, then drew it, half laughing to her self, and it worked. A simple pirate hat kind of looks like a normal distribution. My child was happy, my wife and I were kind of giggling our selves, because some how I was able to bring stats into a child's drawing. We both were amused. Below is the R code and very crude picture.

x<-rnorm(n=10000, mean=1, sd=.25)
plot(density(x), type='l', main='', ylab='', xlab='')
Created by Pretty R at

Wednesday, May 29, 2013

Look Familiar? Mapping in R

For those who have been following the R-Bloggers this picture should be yesterdays news, but there was an article on the BBC on it, full story HERE. I find it interesting how it was on Drudge Report ( which gets over 1 million hits a day. The article gives the perspective of 5 different points of view, the 'data visualization' interpretation is pretty interesting, as are the rest. For those who want to do more than just look at a pretty picture here are some links to help recreate a similar, if not better map.


Mapping the world Airlines

Worlds Largest Airlines

Data for Airlines

Tutorial on how to create the great circles

Friday, May 24, 2013

Down and Dirty Forecasting: Part 2

This is the second part of the forecasting exercise, where I am looking at a multiple regression. To keep it simple I chose the states that boarder WI and the US unemployment information for the regression. Again this is a down and dirty analysis, I would not call this complete in any sense.

#getting the data for WI, US, IL, MI, MN, IO
#Using Quandl data, great little site
wi<-read.csv('', colClasses=c('Date'='Date'))
us<-read.csv('', colClasses=c('Date'='Date'))
il<-read.csv('', colClasses=c('Date'='Date'))
mi<-read.csv('', colClasses=c('Date'='Date'))
mn<-read.csv('', colClasses=c('Date'='Date'))
io<-read.csv('', colClasses=c('Date'='Date'))
#merging the data into one dataframe
#I started with WI so that I could isolate the rest of the variables better
unemp<-merge(wi, io, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io')
unemp<-merge(unemp, mi, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi')
unemp<-merge(unemp, mn, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn')
unemp<-merge(unemp, us, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn', 'us')
unemp<-merge(unemp, il, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn', 'us', 'il')
save(unemp, file='Unemployment WI.RData')
load('Unemployment WI.RData')
plot(unemp$Date, unemp$us, col='black', 
main='Unemployment WI Region', type='l', xlab='Date', 
ylab='Unemp. Rate', ylim=c(2, 17))
lines(unemp$Date, unemp$il, col='blue')
lines(unemp$Date, unemp$io, col='dark green')
lines(unemp$Date, unemp$mi, col='purple')
lines(unemp$Date, unemp$mn, col='gray')
lines(unemp$Date, unemp$wi, col='red')
leg.txt<-c('US', 'IL', 'IO', 'MI', 'MN', "WI")
leg.col<-c('black','blue', 'dark green', 'purple', 'gray', 'red')
legend('topleft', leg.txt, col=leg.col, pch=15)
#correlation matrix
#From the graph and correlation matrix, everything 
#is highly correlated, the lowest r value is 0.7879 
#between Iowa and US.
#plots of each of the data, then each individual, 
#everything looks really good, nice linear trends, 
#although Michigan has quite bit of volatility 
#compared to the other variables.
plot(unemp.b, pch='.')
model<-lm(wi~., data=unemp.b[,-1])
step.m<-step(model, direction='backward')
#testing for outliers
# After conducting the outlier test, looks like it is michigan 
#Sept 1980 where unemployment was 13.2 percent, point 57 is 
#close at 12.9 those are the two biggest, but the outlierTest 
#isolated row 56
dwtest(step.m, alternative='two.sided')
#yep lots of autocorrelation, so a multiple ARAIMA might be in order
mult.f<-predict(step.m, newdata=pred, interval='prediction')
a1<-accuracy(mult.f, unemp.p$wi)
#because the MASE is missing I needed to add the following to compare later
names(a1)<-c("ME",  "RMSE", "MAE", "MPE", "MAPE", "MASE")
auto.r<-auto.arima(unemp.b$wi, xreg=unemp.b[,3:7])
auto.f<-forecast(auto.r, xreg=unemp.p[,3:7], h=12)
a2<-accuracy(auto.f, unemp.p$wi)
#Just remember that the 0 is just a place holder
r.table<-rbind(a1, a2)
row.names(r.table)<-c('Multiple Regression', 'Reg. with ARIMA Errors')
At this point I combined the two accuracy tables to see which one was the best.

b.table<-rbind(a.table[,1:6], r.table)
Created by Pretty R at

The winner is: Regression with ARIMA Errors, for most of the analysis the simple mean like predictions dominated, even the multiple regression is considerably low on the list.

Down and Dirty Forecasting: Part 1

I wanted to see what I could do in a hurry using the commands found at Forecasting: Principles and Practice . I chose a simple enough data set of Wisconsin Unemployment from 1976 to the present (April 2013). I kept the last 12 months worth of data to test the accuracy of the models. The next blog post will include a multiple regression analysis. The analysis is lacking many important steps, particularly the ARIMA, but this is a down and dirty exercise.

#State Unemployment seasonally adjusted
#Using Quandl data, great little site
wi<-read.csv('', colClasses=c('Date'='Date'))
#some minor clean up
colnames(wi)<-c('date', 'rate')
#base data, 1-436, test data 437-448
wi.ts<-ts(wi.b$rate, start=c(1976, 1), frequency=12)
wi.p.ts<-ts(wi.p$rate, start=c(2012, 5), frequency=12)
#Lets test some models
mean<-meanf(wi.ts, 12)
naive<-rwf(wi.ts, 12)
s.naive<-snaive(wi.ts, 12)
drift<-rwf(wi.ts, 12, drift=T)
#linear fit
#checking for autocorrelation
res1 <- residuals(m1)
plot(res1, ylab="Residuals",xlab="Year")
Acf(res1, main="ACF of residuals")
res2 <- residuals(m2)
plot(res2, ylab="Residuals",xlab="Year")
Acf(res2, main="ACF of residuals")
#Durbin-Watson Test
dwtest(m1, alt="two.sided")
dwtest(m2, alt="two.sided")
#yep autocorrelation city! No surprize here, due to the nature of unemployment
#STL ETS Decomposition
m3<-stl(wi.ts, s.window='periodic')
m4<-ets(wi.ts, model='ZZZ')
plot(forecast(m5, h=12))
#neural networks
plot(forecast(m6, h=12))
#Testing for accuracy the first 4 models
a1<-accuracy(mean, wi.p.ts)
a2<-accuracy(naive, wi.p.ts)
a3<-accuracy(s.naive, wi.p.ts)
a4<-accuracy(drift, wi.p.ts)
a.table<-rbind(a1, a2, a3, a4)
#Creating the forecast and accuracy for the next 6 models
f1<-forecast(m1, h=12)
f2<-forecast(m2, h=12)
f3<-forecast(m3, h=12)
f4<-forecast(m4, h=12)
f5<-forecast(m5, h=12)
f6<-forecast(m6, h=12)
a5<-accuracy(f1, wi.p.ts)
a6<-accuracy(f2, wi.p.ts)
a7<-accuracy(f3, wi.p.ts)
a8<-accuracy(f4, wi.p.ts)
a9<-accuracy(f5, wi.p.ts)
a10<-accuracy(f6, wi.p.ts)
#Combining into a table with row names
a.table<-rbind(a.table, a5, a6, a7, a8, a9, a10)
row.names(a.table)<-c('Mean', 'Naive', 'S. Naive', 'Drift', 'Lm~Trend', 
'Lm~Trend+Sea', 'STL', 'ETS', 'ARIMA', 'Neuro')
#make into a data frame so the best model is first, according to MAPE
Created by Pretty R at

Results so far: Looks like the mean like forecasts are doing the best, the fancy models are not doing very well.

Saturday, April 13, 2013

Spring Cleaning Data: 6 of 6- Saving the Data

With all the cleaning done, the only thing left to do is save the data to be analyzed, for future use, and I hope by others. The data I thought would be simple, but there were a few interesting twist, like the Primary Credit*, and using ifelse() to edit the districts.

I have included the product as well as the R-code in a single file for people to use and learn from. I would like to thanks all those who made comments, I find all of them helpful. Below are the links to the files generated and used in the series, and the r-code used to exporting and reloading the data.

List of files used and their links

#Export the data, csv and RData
setwd("C:/Users Defined/")
write.csv(dw, file='DiscountWindow.csv')
save(dw, file='DiscountWindow.RData')
#note when loading the data the envir= needs to be defined
#with larger files the RData is definately the way to go
#this file is small enough it does not matter
load('DiscountWindow.RData', envir=.GlobalEnv)
dw<-read.csv(file.choose(), header=T)
Created by Pretty R at

Previous Posts (Part 1Part 2Part 3Part 4, Part 5)

Friday, April 12, 2013

Spring Cleaning Data: 5 of 6- 2 ifelse vs Merge

The blog in the data cleaning series looks at separating out the Federal Reserve Districts. What I wanted was two additional columns, where I had the name of the city and the number for each district. Since I was on a separation kick I thought it would be fun to do this using ifelse() function.

Well, what started out as a fun romp in the fields turned to an exercise in precision and frustration that did end well, but took too much time, and too many lines of code to do what I wanted.

While I was banging my head against the keyboard in frustration, the thought occurred to me. Instead of using the ifelse() function, create a table with the new columns of data then merge the original data with the table just created. Two lines of code for both columns of data, definitely one of those eureka moments.

The lesson in all of this, ifelse() functions are good within a limited use, I would say 5 or less. Unless you really like doing them, then have fun. If there are limited number of occurrences like this example 12 different districts, the table works very well. What took me 2 hours of work using the ifelse() function, took me 15 minutes using the table method. The code is simpler, and easier to understand. Sure, there is the extra table to be imported, but it is small and very manageable. 

I have placed the code below, with the merge code first, followed by the ifelse() code. The table I used can be downloaded from here (District Data). Read the district data in by using the read.csv() then merge the two files using the 'district' as the column they both have in common. The ifelse(logic, true, false), the logic is if the column looks like one of the districts, if true a 1/Boston, at the end there is the 'Error' just in case.

#Merging the data

dist<-read.csv(file.choose(), header=T)
dw<-merge(dw, dist, by='district')
#re-coding the district data to numerical
tmp1<-ifelse(dw$district=='Boston (1)', 1,
   ifelse(dw$district=='New York (2)', 2,
   ifelse(dw$district=='Philadelphia (3)', 3,
   ifelse(dw$district=='Cleveland (4)', 4,
   ifelse(dw$district=='Richmond (5)', 5,
   ifelse(dw$district=='Atlanta (6)', 6,
   ifelse(dw$district=='Chicago (7)', 7,
   ifelse(dw$district=='St. Louis (8)', 8,
   ifelse(dw$district=='Minneapolis (9)', 9,
   ifelse(dw$district=='Kansas City (10)', 10,
   ifelse(dw$district=='Dallas (11)', 11,
   ifelse(dw$district=='San Francisco (12)', 12,
#Isolating the names, making to factor
tmp2<-ifelse(dw$district=='Boston (1)', 'Boston',
   ifelse(dw$district=='New York (2)', 'New York',
   ifelse(dw$district=='Philadelphia (3)', 'Philadelphia',
   ifelse(dw$district=='Cleveland (4)', 'Cleveland',
   ifelse(dw$district=='Richmond (5)', 'Richmond',
   ifelse(dw$district=='Atlanta (6)', 'Atlanta',
   ifelse(dw$district=='Chicago (7)', 'Chicago',
   ifelse(dw$district=='St. Louis (8)', 'St. Louis',
   ifelse(dw$district=='Minneapolis (9)', 'Minneapolis',
   ifelse(dw$district=='Kansas City (10)', 'Kansas City',
   ifelse(dw$district=='Dallas (11)', 'Dallas',
   ifelse(dw$district=='San Francisco (12)', 'San Francisco',
Created by Pretty R at

Previous Posts (Part 1, Part 2, Part 3, Part 4)

Thursday, April 11, 2013

Spring Cleaning Data: 4 of 6- Combining the files & Changing the Dates/Credit Type

So far the individual files have been left on their own, it is now time to combine using the rbind function, simple enough after all we have done so far, then the quick check with summary.

Now that we have one data frame, time to make larger changes to the data. The first is to get the dates into a format that R can understand. The as.Date() function does this by defining the variable, then the pattern for the date. At this point, I had a hard time figuring out what each one meant; basically you are defining what the date looks like now in the data frame, not in the future.

For this data set the '%b %d %Y' or in other words Feb 01 2011, if the date looked like Feb-01-2011, then the code would be '%b-%d-%Y', or if the date was 02-02-2011, then '%m-%d-%Y'. For a more comprehensive tutorial, see the post on Quick-R.

#Changing the date variables, then 
#isolating the year variable for alter use
dw$<-as.Date(dw$, '%b %d %Y')
dw$<-as.Date(dw$, '%b %d %Y')
dw$<-as.Date(dw$, '%b %d %Y')

At this point, I like to have two extra variables so I can aggregate the data later for some nice results, in particular the year and the month. The reason is I want to know if there is a difference in the years.  I know there are only 2 years so far, but every quarter new data will be released so I am setting up the code for it now. The month I want to know if there is any seasonality to it. If I choose to I can isolate the day, but this gets messy because February has 28/29 days, then the rest of the months fluctuate between 30 and 31. The data is scattered and blotchy as is, making the day too small of a unit to be useful.

The code assumes the date has been changed to the R default of YYYY-MM-DD, for the year I selected the first 4 numbers using the str_sub() function, while making it a numerical value- as.numeric(). The year and date variable I made it a factor for easier sorting and categorizing, with a similar process as above except I want both.

#Create a year variable
dw$year<-as.numeric(str_sub(dw$, start=1, 
#Create a year and month variable
   start=1, end=7))

The next step is to change the credit type to something simpler for tables and graphs. I used the gsub, one of the most interesting and fun functions I never knew existed until I did this. Basically it will take a string then replace it with another. For this data I wanted to replace the "Primary Credit" with "primary" because it make things so much easier for graphs and tables. Then I changed it to a factor.
#Changing the type of credit to one word
   gsub("Primary Credit", 'primary',
   gsub("Seasonal Credit", 'seasonal',
   gsub("Secondary Credit", 'secondary',
#change to factor
Created by Pretty R at

Links to the previous posts (post 1, post 2, post 3)