Since today, 19 Sept is talk like a pirate day, I would like to relate a story about stats and pirates. While driving, my wife was in the back seat with our young child asked me how to draw a pirate hat. Since I was driving I could not show her how to do it, so I had to figure out a way to describe it that both of us could understand. My reply, "Draw a normal distribution curve."
She looked at me kind of weird, then drew it, half laughing to her self, and it worked. A simple pirate hat kind of looks like a normal distribution. My child was happy, my wife and I were kind of giggling our selves, because some how I was able to bring stats into a child's drawing. We both were amused. Below is the R code and very crude picture.
Created by Pretty R at inside-R.org
OutLie..R
Analysis and tutorials for statistics, economics, and public policy using the statistical programming language R, from R-Cran.
Thursday, September 19, 2013
Wednesday, May 29, 2013
Look Familiar? Mapping in R
For those who have been following the R-Bloggers this picture should be yesterdays news, but there was an article on the BBC on it, full story HERE. I find it interesting how it was on Drudge Report (www.drudgereport.com) which gets over 1 million hits a day. The article gives the perspective of 5 different points of view, the 'data visualization' interpretation is pretty interesting, as are the rest. For those who want to do more than just look at a pretty picture here are some links to help recreate a similar, if not better map.
Links:
Mapping the world Airlines
Worlds Largest Airlines
Data for Airlines
Tutorial on how to create the great circles
Friday, May 24, 2013
Down and Dirty Forecasting: Part 2
This is the second part of the forecasting exercise, where I am looking at a multiple regression. To keep it simple I chose the states that boarder WI and the US unemployment information for the regression. Again this is a down and dirty analysis, I would not call this complete in any sense.
Created by Pretty R at inside-R.org
The winner is: Regression with ARIMA Errors, for most of the analysis the simple mean like predictions dominated, even the multiple regression is considerably low on the list.
#getting the data for WI, US, IL, MI, MN, IO #Using Quandl data, great little site wi<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/WIUR.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date')) us<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/UNRATE.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date')) il<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/ILUR.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date')) mi<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/MIUR.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date')) mn<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/MNUR.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date')) io<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/IAUR.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date')) #merging the data into one dataframe #I started with WI so that I could isolate the rest of the variables better unemp<-merge(wi, io, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io') unemp<-merge(unemp, mi, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi') unemp<-merge(unemp, mn, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn') unemp<-merge(unemp, us, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn', 'us') unemp<-merge(unemp, il, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn', 'us', 'il') save(unemp, file='Unemployment WI.RData') library(car) library(lmtest) library(forecast) load('Unemployment WI.RData') unemp.b<-unemp[1:436,] unemp.p<-unemp[437:448,] plot(unemp$Date, unemp$us, col='black',
main='Unemployment WI Region', type='l', xlab='Date',
ylab='Unemp. Rate', ylim=c(2, 17)) lines(unemp$Date, unemp$il, col='blue') lines(unemp$Date, unemp$io, col='dark green') lines(unemp$Date, unemp$mi, col='purple') lines(unemp$Date, unemp$mn, col='gray') lines(unemp$Date, unemp$wi, col='red') leg.txt<-c('US', 'IL', 'IO', 'MI', 'MN', "WI") leg.col<-c('black','blue', 'dark green', 'purple', 'gray', 'red') legend('topleft', leg.txt, col=leg.col, pch=15) summary(unemp) #correlation matrix cm<-cor(unemp[,-1]) cm #From the graph and correlation matrix, everything
#is highly correlated, the lowest r value is 0.7879
#between Iowa and US. #plots of each of the data, then each individual,
#everything looks really good, nice linear trends,
#although Michigan has quite bit of volatility
#compared to the other variables. plot(unemp.b, pch='.') model<-lm(wi~., data=unemp.b[,-1]) step.m<-step(model, direction='backward') summary(step.m) #testing for outliers outlierTest(step.m) # After conducting the outlier test, looks like it is michigan #Sept 1980 where unemployment was 13.2 percent, point 57 is #close at 12.9 those are the two biggest, but the outlierTest #isolated row 56 dwtest(step.m) dwtest(step.m, alternative='two.sided') #yep lots of autocorrelation, so a multiple ARAIMA might be in order pred<-unemp.p[3:7] mult.f<-predict(step.m, newdata=pred, interval='prediction') a1<-accuracy(mult.f, unemp.p$wi) #because the MASE is missing I needed to add the following to compare later a1<-c(a1,'0') names(a1)<-c("ME", "RMSE", "MAE", "MPE", "MAPE", "MASE") # ARIMA auto.r<-auto.arima(unemp.b$wi, xreg=unemp.b[,3:7]) auto.r auto.f<-forecast(auto.r, xreg=unemp.p[,3:7], h=12) a2<-accuracy(auto.f, unemp.p$wi) a2 #Just remember that the 0 is just a place holder r.table<-rbind(a1, a2) row.names(r.table)<-c('Multiple Regression', 'Reg. with ARIMA Errors') r.table
At this point I combined the two accuracy tables to see which one was the best.
b.table<-rbind(a.table[,1:6], r.table) b.table<-b.table[order(b.table$MAPE),] b.table
The winner is: Regression with ARIMA Errors, for most of the analysis the simple mean like predictions dominated, even the multiple regression is considerably low on the list.
Down and Dirty Forecasting: Part 1
I wanted to see what I could do in a hurry using the commands found at Forecasting: Principles and Practice . I chose a simple enough data set of Wisconsin Unemployment from 1976 to the present (April 2013). I kept the last 12 months worth of data to test the accuracy of the models. The next blog post will include a multiple regression analysis. The analysis is lacking many important steps, particularly the ARIMA, but this is a down and dirty exercise.
Created by Pretty R at inside-R.org
Results so far: Looks like the mean like forecasts are doing the best, the fancy models are not doing very well.
library(forecast) library(lmtest) library(caret) #State Unemployment seasonally adjusted #http://www.quandl.com/FRED-Federal-Reserve-Economic-Data/WIUR-Unemployment-Rate-in-Wisconsin #Using Quandl data, great little site wi<-read.csv('http://www.quandl.com/api/v1/datasets/FRED/WIUR.csv?&auth_token=gigXwpxd6Ex91cjgz1B7&trim_start=1976-01-01&trim_end=2013-04-01&sort_order=desc', colClasses=c('Date'='Date')) #some minor clean up colnames(wi)<-c('date', 'rate') wi$date<-as.Date(wi$date) summary(wi) #base data, 1-436, test data 437-448 wi.b<-wi[1:436,] wi.p<-wi[437:448,] wi.ts<-ts(wi.b$rate, start=c(1976, 1), frequency=12) wi.p.ts<-ts(wi.p$rate, start=c(2012, 5), frequency=12) plot.ts(wi.ts) #Lets test some models mean<-meanf(wi.ts, 12) naive<-rwf(wi.ts, 12) s.naive<-snaive(wi.ts, 12) drift<-rwf(wi.ts, 12, drift=T) #linear fit m1<-tslm(wi.ts~trend) m2<-tslm(wi.ts~trend+season) #checking for autocorrelation res1 <- residuals(m1) par(mfrow=c(1,2)) plot(res1, ylab="Residuals",xlab="Year") Acf(res1, main="ACF of residuals") res2 <- residuals(m2) par(mfrow=c(1,2)) plot(res2, ylab="Residuals",xlab="Year") Acf(res2, main="ACF of residuals") par(mfrow=c(1,1)) #Durbin-Watson Test dwtest(m1, alt="two.sided") dwtest(m2, alt="two.sided") #yep autocorrelation city! No surprize here, due to the nature of unemployment #STL ETS Decomposition m3<-stl(wi.ts, s.window='periodic') plot(m3) m4<-ets(wi.ts, model='ZZZ') plot(m4) #ARIMA m5<-auto.arima(wi.ts) plot(forecast(m5, h=12))
#neural networks m6<-nnetar(wi.ts) m6 plot(forecast(m6, h=12)) #Testing for accuracy the first 4 models a1<-accuracy(mean, wi.p.ts) a2<-accuracy(naive, wi.p.ts) a3<-accuracy(s.naive, wi.p.ts) a4<-accuracy(drift, wi.p.ts) a.table<-rbind(a1, a2, a3, a4) #Creating the forecast and accuracy for the next 6 models f1<-forecast(m1, h=12) f2<-forecast(m2, h=12) f3<-forecast(m3, h=12) f4<-forecast(m4, h=12) f5<-forecast(m5, h=12) f6<-forecast(m6, h=12) a5<-accuracy(f1, wi.p.ts) a6<-accuracy(f2, wi.p.ts) a7<-accuracy(f3, wi.p.ts) a8<-accuracy(f4, wi.p.ts) a9<-accuracy(f5, wi.p.ts) a10<-accuracy(f6, wi.p.ts) #Combining into a table with row names a.table<-rbind(a.table, a5, a6, a7, a8, a9, a10) row.names(a.table)<-c('Mean', 'Naive', 'S. Naive', 'Drift', 'Lm~Trend', 'Lm~Trend+Sea', 'STL', 'ETS', 'ARIMA', 'Neuro') #make into a data frame so the best model is first, according to MAPE a.table<-as.data.frame(a.table) a.table<-a.table[order(a.table$MAPE),] a.table
Results so far: Looks like the mean like forecasts are doing the best, the fancy models are not doing very well.
Saturday, April 13, 2013
Spring Cleaning Data: 6 of 6- Saving the Data
With all the cleaning done, the only thing left to do is save the data to be analyzed, for future use, and I hope by others. The data I thought would be simple, but there were a few interesting twist, like the Primary Credit*, and using ifelse() to edit the districts.
I have included the product as well as the R-code in a single file for people to use and learn from. I would like to thanks all those who made comments, I find all of them helpful. Below are the links to the files generated and used in the series, and the r-code used to exporting and reloading the data.
I have included the product as well as the R-code in a single file for people to use and learn from. I would like to thanks all those who made comments, I find all of them helpful. Below are the links to the files generated and used in the series, and the r-code used to exporting and reloading the data.
List of files used and their links
Created by Pretty R at inside-R.org
Previous Posts (Part 1, Part 2, Part 3, Part 4, Part 5)
- DiscountWindow.R -the code used to download and clean up discount window data
- Districts.csv -the table used in post 5 of 6 to seperate the districts
- DiscountWindow.csv -a csv of the final product
- DiscountWindow.RData -RData of the final cleaned up product
#Export the data, csv and RData setwd("C:/Users Defined/") write.csv(dw, file='DiscountWindow.csv') save(dw, file='DiscountWindow.RData') #note when loading the data the envir= needs to be defined #with larger files the RData is definately the way to go #this file is small enough it does not matter load('DiscountWindow.RData', envir=.GlobalEnv) dw<-read.csv(file.choose(), header=T)
Previous Posts (Part 1, Part 2, Part 3, Part 4, Part 5)
Friday, April 12, 2013
Spring Cleaning Data: 5 of 6- 2 ifelse vs Merge
The blog in the data cleaning series looks at separating out
the Federal Reserve Districts. What I wanted was two additional columns, where
I had the name of the city and the number for each district. Since I was on a separation
kick I thought it would be fun to do this using ifelse() function.
Created by Pretty R at inside-R.org
Previous Posts (Part 1, Part 2, Part 3, Part 4)
Well, what
started out as a fun romp in the fields turned to an exercise in precision and frustration
that did end well, but took too much time, and too many lines of code to do
what I wanted.
While I was banging my head against the keyboard in frustration,
the thought occurred to me. Instead of using the ifelse() function, create a
table with the new columns of data then merge the original data with the table
just created. Two lines of code for both columns of data, definitely one of
those eureka moments.
The lesson in all of this, ifelse() functions are good
within a limited use, I would say 5 or less. Unless you really like doing them,
then have fun. If there are limited number of occurrences like this example 12
different districts, the table works very well. What took me 2 hours of work
using the ifelse() function, took me 15 minutes using the table method. The code
is simpler, and easier to understand. Sure, there is the extra table to be
imported, but it is small and very manageable.
I have placed the code below,
with the merge code first, followed by the ifelse() code. The table I used can
be downloaded from here (District Data). Read the district data in by using the read.csv() then merge the two files using the 'district' as the column they both have in common. The ifelse(logic, true, false), the logic is if the column looks like one of the districts, if true a 1/Boston, at the end there is the 'Error' just in case.
#Merging the data
dist<-read.csv(file.choose(), header=T) dw<-merge(dw, dist, by='district') #re-coding the district data to numerical tmp1<-ifelse(dw$district=='Boston (1)', 1, ifelse(dw$district=='New York (2)', 2, ifelse(dw$district=='Philadelphia (3)', 3, ifelse(dw$district=='Cleveland (4)', 4, ifelse(dw$district=='Richmond (5)', 5, ifelse(dw$district=='Atlanta (6)', 6, ifelse(dw$district=='Chicago (7)', 7, ifelse(dw$district=='St. Louis (8)', 8, ifelse(dw$district=='Minneapolis (9)', 9, ifelse(dw$district=='Kansas City (10)', 10, ifelse(dw$district=='Dallas (11)', 11, ifelse(dw$district=='San Francisco (12)', 12, 'Error')))))))))))) dw$dist.no<-as.numeric(tmp1) #Isolating the names, making to factor tmp2<-ifelse(dw$district=='Boston (1)', 'Boston', ifelse(dw$district=='New York (2)', 'New York', ifelse(dw$district=='Philadelphia (3)', 'Philadelphia', ifelse(dw$district=='Cleveland (4)', 'Cleveland', ifelse(dw$district=='Richmond (5)', 'Richmond', ifelse(dw$district=='Atlanta (6)', 'Atlanta', ifelse(dw$district=='Chicago (7)', 'Chicago', ifelse(dw$district=='St. Louis (8)', 'St. Louis', ifelse(dw$district=='Minneapolis (9)', 'Minneapolis', ifelse(dw$district=='Kansas City (10)', 'Kansas City', ifelse(dw$district=='Dallas (11)', 'Dallas', ifelse(dw$district=='San Francisco (12)', 'San Francisco', 'Error')))))))))))) dw$dist.city<-as.factor(tmp2)
Previous Posts (Part 1, Part 2, Part 3, Part 4)
Labels:
as.factor,
as.numeric,
ifelse,
merge,
new column,
r,
r-code,
read.csv
Thursday, April 11, 2013
Spring Cleaning Data: 4 of 6- Combining the files & Changing the Dates/Credit Type
So far the individual files have been left on their own, it is now time
to combine using the rbind function, simple enough after all we have done so
far, then the quick check with summary.
Created by Pretty R at inside-R.org
Links to the previous posts (post 1, post 2, post 3)
Now that we have one data frame, time to make larger changes
to the data. The first is to get the dates into a format that R can understand.
The as.Date() function does this by defining the variable, then the pattern for
the date. At this point, I had a hard time figuring out what each one meant;
basically you are defining what the date looks like now in the data frame, not
in the future.
For this data set the '%b %d %Y' or in other words Feb 01
2011, if the date looked like Feb-01-2011, then the code would be '%b-%d-%Y',
or if the date was 02-02-2011, then '%m-%d-%Y'. For a more comprehensive tutorial,
see the post on Quick-R.
#Changing the date variables, then #isolating the year variable for alter use library(stringr) dw$loan.date<-as.Date(dw$loan.date, '%b %d %Y') dw$mat.date<-as.Date(dw$mat.date, '%b %d %Y') dw$repay.date<-as.Date(dw$repay.date, '%b %d %Y')
At this point, I like to have two extra variables so I can
aggregate the data later for some nice results, in particular the year and the
month. The reason is I want to know if there is a difference in the years. I know there are only 2 years so far, but
every quarter new data will be released so I am setting up the code for it now.
The month I want to know if there is any seasonality to it. If I choose to I
can isolate the day, but this gets messy because February has 28/29 days, then
the rest of the months fluctuate between 30 and 31. The data is scattered and
blotchy as is, making the day too small of a unit to be useful.
The code assumes the date has been changed to the R default of YYYY-MM-DD, for the year I selected the first 4 numbers using the str_sub() function, while making it a numerical value- as.numeric(). The year and date variable I made it a factor for easier sorting and categorizing, with a similar process as above except I want both.
The next step is to change the credit type to something simpler for tables and graphs. I used the gsub, one of the most interesting and fun functions I never knew existed until I did this. Basically it will take a string then replace it with another. For this data I wanted to replace the "Primary Credit" with "primary" because it make things so much easier for graphs and tables. Then I changed it to a factor.
The code assumes the date has been changed to the R default of YYYY-MM-DD, for the year I selected the first 4 numbers using the str_sub() function, while making it a numerical value- as.numeric(). The year and date variable I made it a factor for easier sorting and categorizing, with a similar process as above except I want both.
The next step is to change the credit type to something simpler for tables and graphs. I used the gsub, one of the most interesting and fun functions I never knew existed until I did this. Basically it will take a string then replace it with another. For this data I wanted to replace the "Primary Credit" with "primary" because it make things so much easier for graphs and tables. Then I changed it to a factor.
#Changing the type of credit to one word dw$type.credit<-with(dw, gsub("Primary Credit", 'primary', type.credit)) dw$type.credit<-with(dw, gsub("Seasonal Credit", 'seasonal', type.credit)) dw$type.credit<-with(dw, gsub("Secondary Credit", 'secondary', type.credit)) #change to factor dw$type.credit<-as.factor(dw$type.credit) summary(dw)
Links to the previous posts (post 1, post 2, post 3)
Labels:
as.Date,
as.factor,
as.numeric,
dates,
dates and r,
gsub,
r,
r-code,
stringr,
with
Subscribe to:
Posts (Atom)