Handling date-times in R Cole Beck August 30, 2012 1 Introduction Date-time variables are a pain to work with in any language. We’ll discuss some of the common issues and how to overcome them. Before we examine the combination of dates and times, let’s focus on dates. Even by themselves dates can be a pain. One idea is to refuse to use them. Generally dates are interally stored as integers, and depending on the operations you’ll need, you could choose to do the same. As an added bonus, converting dates to integers may be useful in de-identifying your dataset. One easy way to do this is by having a lookup table. date.lookup <- format(seq(as.Date("2000-01-01"), as.Date("2010-12-31"), by = "1 day")) match("2004-01-19", date.lookup) ## [1] 1480 date.lookup[1337] ## [1] "2003-08-29" If that works for you, then that works for me and we’re done here... let’s assume you actually want dates though. 2 Date Class The simplest data type to use for dates is the ”Date” class. As mentioned, these will be internally stored as integers. In my example, day one was 2000-01-01. The specific date used to index your dates is called the origin. Typically programming languages use a default origin of 1970-01-01, though it is really day zero, not day one (negative values are perfectly valid). # create a date as.Date("2012-08-30") ## [1] "2012-08-30" # specify the format as.Date("08/30/2012", format = "%m/%d/%Y") ## [1] "2012-08-30" # use a different origin, for instance importing values from Excel as.Date(41149, origin = "1900-01-01") ## [1] "2012-08-30" 1 # take a difference Sys.Date() - as.Date("1970-01-01") ## Time difference of 15582 days # alternate method with specified units difftime(Sys.Date(), as.Date("1970-01-01"), units = "days") ## Time difference of 15582 days # see the internal integer representation unclass(Sys.Date()) ## [1] 15582 3 POSIXt Date-Time Class Dates are pretty simple and most of the operations that we could use for them will also apply to date-time variables. There are at least three good options for date-time data types: built-in POSIXt, chron package, lubridate package. I rarely support using external packages (spoiler: I use POSIXt), if for no other reason than I don’t like forcing dependencies on my code. But it can also be burdensome to write code that someone else has already written, so add-on packages are worth examining. There are two POSIXt types, POSIXct and POSIXlt. ”ct” can stand for calendar time, it stores the number of seconds since the origin. ”lt”, or local time, keeps the date as a list of time attributes (such as ”hour” and ”mon”). # current time as POSIXct unclass(Sys.time()) ## [1] 1.346e+09 # and as POSIXlt unclass(as.POSIXlt(Sys.time())) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## $sec [1] 9.166 $min [1] 54 $hour [1] 13 $mday [1] 30 $mon [1] 7 $year [1] 112 $wday [1] 4 2 ## ## ## ## ## ## ## ## ## $yday [1] 242 $isdst [1] 1 attr(,"tzone") [1] "" "CST" "CDT" 3.1 Format Thus far we’ve made the bold assumption that we want our date-times in the format YYYY-MM-DD. It’s not hard to set the format when creating or printing date-times. In fact it’s not a bad idea to always set the format so that you can be certain that your input/output looks like you expect. View the strftime help file for an idea on how to specify format. # create POSIXct variables as.POSIXct("080406 10:11", format = "%y%m%d %H:%M") ## [1] "2008-04-06 10:11:00 CDT" as.POSIXct("2008-04-06 10:11:01 PM", format = "%Y-%m-%d %I:%M:%S %p") ## [1] "2008-04-06 22:11:01 CDT" as.POSIXct("08/04/06 22:11:00", format = "%m/%d/%y %H:%M:%S") ## [1] "2006-08-04 22:11:00 CDT" # convert POSIXct variables to character strings format(as.POSIXct("080406 10:11", format = "%y%m%d %H:%M"), "%m/%d/%Y %I:%M %p") ## [1] "04/06/2008 10:11 AM" as.character(as.POSIXct("080406 10:11", format = "%y%m%d %H:%M"), format = "%m-%d-%y %H:%M") ## [1] "04-06-08 10:11" 3.2 Example 1 We’ve finally covered the basics, let’s run some practical examples. # when do I turn 1 billion seconds old? billbday <- function(bday, age = 10^9, format = "%Y-%m-%d %H:%M:%S") { x <- as.POSIXct(bday, format = format) + age togo <- round(difftime(x, Sys.time(), units = "days")) if (togo > 0) { msg <- sprintf("You will be %s seconds old on %s, which is %s days from now.", age, format(x, "%Y-%m-%d"), togo) } else { msg <- sprintf("You turned %s seconds old on %s, which was %s days ago.", age, format(x, "%Y-%m-%d"), -1 * togo) } 3 if (age > 125 * 365.25 * 86400) msg <- paste(msg, "Good luck with that.") print(msg) format(x, "%Y-%m-%d") } billbday("1981-04-13 15:00:00") ## [1] "You will be 1e+09 seconds old on 2012-12-20, which is 112 days from now." ## [1] "2012-12-20" 3.3 Data.frames Hate You That was easy enough, let’s use a data.frame. dts <- data.frame(day = c("20081101", "20081101", "20081101", "20081101", "20081101", "20081102", "20081102", "20081102", "20081102", "20081103"), time = c("01:20:00", "06:00:00", "12:20:00", "17:30:00", "21:45:00", "01:15:00", "06:30:00", "12:50:00", "20:00:00", "01:05:00"), value = c("5", "5", "6", "6", "5", "5", "6", "7", "5", "5")) dts1 <- paste(dts$day, dts$time) dts2 <- as.POSIXct(dts1, format = "%Y%m%d %H:%M:%S") dts3 <- as.POSIXlt(dts1, format = "%Y%m%d %H:%M:%S") dts.all <- data.frame(dts, ct = dts2, lt = dts3) # lt changed to POSIXct! str(dts.all) ## 'data.frame': 10 obs. of 5 variables: ## $ day : Factor w/ 3 levels "20081101","20081102",..: 1 1 1 1 1 2 2 2 ## $ time : Factor w/ 10 levels "01:05:00","01:15:00",..: 3 4 6 8 10 2 5 ## $ value: Factor w/ 3 levels "5","6","7": 1 1 2 2 1 1 2 3 1 1 ## $ ct : POSIXct, format: "2008-11-01 01:20:00" "2008-11-01 06:00:00" ## $ lt : POSIXct, format: "2008-11-01 01:20:00" "2008-11-01 06:00:00" 2 3 7 9 1 ... ... Not even a warning? Stupid date-times, I don’t care if the documentation states that you should use ct’s for data.frames. Let’s try again. dts.all <- dts dts.all$ct <- dts2 dts.all$lt <- dts3 str(dts.all) ## 'data.frame': 10 obs. of 5 variables: ## $ day : Factor w/ 3 levels "20081101","20081102",..: 1 1 1 1 1 2 2 2 ## $ time : Factor w/ 10 levels "01:05:00","01:15:00",..: 3 4 6 8 10 2 5 ## $ value: Factor w/ 3 levels "5","6","7": 1 1 2 2 1 1 2 3 1 1 ## $ ct : POSIXct, format: "2008-11-01 01:20:00" "2008-11-01 06:00:00" ## $ lt : POSIXlt, format: "2008-11-01 01:20:00" "2008-11-01 06:00:00" 2 3 7 9 1 ... ... unclass(dts.all$ct) ## ## [1] 1.226e+09 1.226e+09 1.226e+09 1.226e+09 1.226e+09 1.226e+09 1.226e+09 1.226e+09 [9] 1.226e+09 1.226e+09 4 ## attr(,"tzone") ## [1] "" unclass(dts.all$lt) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## $sec [1] 0 0 0 0 0 0 0 0 0 0 $min [1] 20 0 20 30 45 15 30 50 $hour [1] 1 6 12 17 21 1 0 5 6 12 20 1 $mday [1] 1 1 1 1 1 2 2 2 2 3 $mon [1] 10 10 10 10 10 10 10 10 10 10 $year [1] 108 108 108 108 108 108 108 108 108 108 $wday [1] 6 6 6 6 6 0 0 0 0 1 $yday [1] 305 305 305 305 305 306 306 306 306 307 $isdst [1] 1 1 1 1 1 1 0 0 0 0 That worked? Suspicious behaviour now may yield suspicious behaviour later. Let’s try rounding our times. # round sets to POSIXlt and fails! dts.all[, "ct"] <- round(dts.all[, "ct"], units = "hours") ## Warning: provided 9 variables to replace 1 variables # force back to POSIXct dts.all[, "ct"] <- as.POSIXct(round(dts2, units = "hours")) # rounding our POSIXlt column also fails dts.all[, "lt"] <- round(dts3, units = "hours") ## Warning: provided 9 variables to replace 1 variables # while this works dts.all$lt <- round(dts3, units = "hours") Let’s try to replace one element of one row. 5 # this fails dts.all[1, "lt"] <- as.POSIXlt("2008-11-01 01:00:00") ## Warning: ## Error: provided 9 variables to replace 1 variables ’origin’ must be supplied # this works dts.all$lt[1] <- as.POSIXlt("2008-11-01 01:00:00") # this works? dts.all[1, "lt"] <- as.POSIXct("2008-11-01 01:00:00") There’s magic in the $. But it seems we can use this trick to force things (that shouldn’t be forced?). So maybe there’s a reason we’re supposed to use POSIXct times in data.frames. While tricks are discouraged, and it might take some effort, it’s certainly nice to have access to our list attributes. time1 <- dts.all$lt[1] time2 <- dts.all$lt[2] # presumably do something with the data from time1 to time2 while (time1 < time2) { time1$hour <- time1$hour + 1 } print(sprintf("%s -- %s", time1, time2)) ## [1] "2008-11-01 06:00:00 -- 2008-11-01 06:00:00" 3.4 Daylight Savings and Time Zones [also hate you] # another example time1 <- dts.all$lt[5] time2 <- dts.all$lt[7] while (time1 < time2) { time1$hour <- time1$hour + 1 # notice how hour is the only attribute to change print(unlist(time1)) } ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## sec 0 sec 0 sec 0 sec 0 sec 0 sec 0 sec 0 sec 0 min 0 min 0 min 0 min 0 min 0 min 0 min 0 min 0 hour 23 hour 24 hour 25 hour 26 hour 27 hour 28 hour 29 hour 30 mday 1 mday 1 mday 1 mday 1 mday 1 mday 1 mday 1 mday 1 mon 10 mon 10 mon 10 mon 10 mon 10 mon 10 mon 10 mon 10 year 108 year 108 year 108 year 108 year 108 year 108 year 108 year 108 wday 6 wday 6 wday 6 wday 6 wday 6 wday 6 wday 6 wday 6 yday 305 yday 305 yday 305 yday 305 yday 305 yday 305 yday 305 yday 305 6 isdst 1 isdst 1 isdst 1 isdst 1 isdst 1 isdst 1 isdst 1 isdst 1 ## ## ## ## sec 0 sec 0 min 0 min 0 hour 31 hour 32 mday 1 mday 1 mon 10 mon 10 year 108 year 108 wday 6 wday 6 yday isdst 305 1 yday isdst 305 1 print(sprintf("%s -- %s", time1, time2)) ## [1] "2008-11-02 08:00:00 -- 2008-11-02 07:00:00" # DST and tzone are confusing they're still equal time1 == time2 ## [1] TRUE # combining them seems to work c(time1, time2) ## [1] "2008-11-02 07:00:00 CST" "2008-11-02 07:00:00 CST" # default to right tzone as.POSIXlt(as.POSIXct(time1)) ## [1] "2008-11-02 07:00:00 CST" Remember our first solution, where we decided to skip dates altogther and convert everything to integer values? It sure would be nice if we didn’t have to worry about daylight savings or local time zones. We can use universal time (UTC) for that. Actually, like specifying the format, it’s a good idea to always specify your time zone even if you don’t use UTC. # universal time dts4 <- round(as.POSIXlt(dts1, format = "%Y%m%d %H:%M:%S", tz = "UTC"), units = "hours") dts4 ## [1] ## [4] ## [7] ## [10] "2008-11-01 "2008-11-01 "2008-11-02 "2008-11-03 01:00:00 18:00:00 07:00:00 01:00:00 UTC" "2008-11-01 06:00:00 UTC" "2008-11-01 12:00:00 UTC" UTC" "2008-11-01 22:00:00 UTC" "2008-11-02 01:00:00 UTC" UTC" "2008-11-02 13:00:00 UTC" "2008-11-02 20:00:00 UTC" UTC" # central standard time round(as.POSIXlt(dts1, format = "%Y%m%d %H:%M:%S", tz = "CST"), units = "hours") ## [1] ## [4] ## [7] ## [10] 3.5 "2008-11-01 "2008-11-01 "2008-11-02 "2008-11-03 01:00:00 18:00:00 07:00:00 01:00:00 CST" "2008-11-01 06:00:00 CST" "2008-11-01 12:00:00 CST" CST" "2008-11-01 22:00:00 CST" "2008-11-02 01:00:00 CST" CST" "2008-11-02 13:00:00 CST" "2008-11-02 20:00:00 CST" CST" Example 2 Let’s examine a function that will fill in values when we have missing dates. # POSIXct mydata.ct # POSIXlt mydata.lt data.frame <- data.frame(date = dts.all$ct, value = dts.all$value) data.frame <- data.frame(date = NA, value = dts.all$value) 7 mydata.lt$date <- dts.all$lt # POSIX.lt data.frame using UTC mydata.lt.utc <- data.frame(date = NA, value = dts.all$value) mydata.lt.utc$date <- dts4 fill.dates <- function(x) { dates <- seq(x[1, "date"], x[nrow(x), "date"], by = "hours") x2 <- data.frame(date = rep(NA, length(dates)), value = NA) x2$date <- as.POSIXlt(dates) x2$value[match(as.character(x$date), as.character(x2$date))] <- x$value for (i in which(is.na(x2$value))) x2[i, "value"] <- x2[i - 1, "value"] x2 } data.ct <- fill.dates(mydata.ct) data.lt <- fill.dates(mydata.lt) data.lt.utc <- fill.dates(mydata.lt.utc) cbind(data.ct[21:30, ], data.lt[21:30, ], data.lt.utc[21:30, ]) ## ## ## ## ## ## ## ## ## ## ## 4 21 22 23 24 25 26 27 28 29 30 2008-11-01 2008-11-01 2008-11-01 2008-11-02 2008-11-02 2008-11-02 2008-11-02 2008-11-02 2008-11-02 2008-11-02 date value date value date value 21:00:00 2 2008-11-01 21:00:00 2 2008-11-01 21:00:00 2 22:00:00 1 2008-11-01 22:00:00 1 2008-11-01 22:00:00 1 23:00:00 1 2008-11-01 23:00:00 1 2008-11-01 23:00:00 1 00:00:00 1 2008-11-02 00:00:00 1 2008-11-02 00:00:00 1 01:00:00 1 2008-11-02 01:00:00 1 2008-11-02 01:00:00 1 01:00:00 1 2008-11-02 01:00:00 1 2008-11-02 02:00:00 1 02:00:00 1 2008-11-02 02:00:00 1 2008-11-02 03:00:00 1 03:00:00 1 2008-11-02 03:00:00 1 2008-11-02 04:00:00 1 04:00:00 1 2008-11-02 04:00:00 1 2008-11-02 05:00:00 1 05:00:00 1 2008-11-02 05:00:00 1 2008-11-02 06:00:00 1 Chron Package Let’s examine the chron package. ”chron” is short for chronological, not chronic pain, which would be logical given the amount of pain programmers have had to deal with implementing date-times over the years (last joke, I promise). It’s not a base package, so you’ll have to install it, as will everyone else that runs your code. It’s important to note that chron expects numeric values to be time since origin. If you want 101010 to mean 2010-10-10, you’ll need to set it as a character string first (otherwise it’s 2246-07-23). library(chron) # fails on a factor chron(dts$day, format = "ymd") ## Error: object dates. must be numeric or character chron(as.character(dts$day), format = "ymd") ## [1] 081101 081101 081101 081101 081101 081102 081102 081102 081102 081103 chron.dts <- chron(date = as.character(dts$day), time = dts$time, format = c("ymd", "h:m:s")) cdts.all <- dts cdts.all$cd <- chron.dts # I hate two-digit year format; methods to display 4 are awful format(as.POSIXlt(cdts.all$cd, tz = "UTC"), "%Y%m%d %H:%M:%S") 8 ## ## ## [1] "20081101 01:20:00" "20081101 06:00:00" "20081101 12:20:00" "20081101 17:30:00" [5] "20081101 21:45:00" "20081102 01:15:00" "20081102 06:30:00" "20081102 12:50:00" [9] "20081102 20:00:00" "20081103 01:05:00" # options(chron.year.abb=FALSE) # don't use options as a global variable So what does chron have to offer? In my opinion, very little, it mostly replicates what’s already there. 5 Lubridate Package Lubridate is another package you may wish to install. It’s documentation boasts that it ”makes working with dates fun instead of frustrating”. It does include many functions unavailable to us with POSIXt objects. More importantly the basic stuff is easy to do. library(lubridate) ymd_hms("2012-12-31 23:59:59") ## [1] "2012-12-31 23:59:59 UTC" ldate <- mdy_hms("12/31/2012 23:59:59") ldate + seconds(1) ## [1] "2013-01-01 UTC" month(ldate) <- 8 Understanding the different objects used to represent time is the key to using lubridate. 5.1 Instants An instant is a specific moment in time. now() ## [1] "2012-08-30 13:54:09 CDT" # last day of the month ceiling_date(now(), unit = "month") - days(1) ## [1] "2012-08-31 CDT" # is it morning? am(now()) ## [1] FALSE 5.2 Intervals, Durations, Periods An interval is the time in between two instants. Know it exists, but assume you don’t care. Durations are exact time spans, measured in seconds. It is the absolute time between two events. Periods are used to handle relative time measurements between two events. # how do you want to handle leap years? ldate - dyears(1) 9 ## [1] "2011-09-01 23:59:59 UTC" ldate - years(1) ## [1] "2011-08-31 23:59:59 UTC" # or DST? force_tz(ymd_hms(as.character(dts.all$lt[5])), "America/Chicago") + dhours(6) ## [1] "2008-11-02 03:00:00 CST" force_tz(ymd_hms(as.character(dts.all$lt[5])), "America/Chicago") + hours(6) ## [1] "2008-11-02 04:00:00 CST" # within is a nice function to use with intervals now() %within% new_interval(ymd("2012-07-01"), ymd("2013-06-30")) ## [1] TRUE I don’t know about fun, but you could make a very strong argument to use lubridate to handle your date-time variables. 6 Bonus Example: Determine Your Format This simple function will try to determine your date-time format. You’ll have problems if the format isn’t consistent. This function could easily be extended, but as is, it may be helpful. # FUNCTION guessDateFormat @x vector of character dates/datetimes @returnDates return # actual dates rather than format convert character datetime to POSIXlt datetime, or at # least guess the format such that you could convert to datetime guessDateFormat <- function(x, returnDates = FALSE, tzone = "") { x1 <- x # replace blanks with NA and remove x1[x1 == ""] <- NA x1 <- x1[!is.na(x1)] if (length(x1) == 0) return(NA) # if it's already a time variable, set it to character if ("POSIXt" %in% class(x1[1])) { x1 <- as.character(x1) } dateTimes <- do.call(rbind, strsplit(x1, " ")) for (i in ncol(dateTimes)) { dateTimes[dateTimes[, i] == "NA"] <- NA } # assume the time part can be found with a colon timePart <- which(apply(dateTimes, MARGIN = 2, FUN = function(i) { any(grepl(":", i)) })) # everything not in the timePart should be in the datePart datePart <- setdiff(seq(ncol(dateTimes)), timePart) # should have 0 or 1 timeParts and exactly one dateParts if (length(timePart) > 1 || length(datePart) != 1) stop("cannot parse your time variable") 10 timeFormat <- NA if (length(timePart)) { # find maximum number of colons in the timePart column ncolons <- max(nchar(gsub("[^:]", "", na.omit(dateTimes[, timePart])))) if (ncolons == 1) { timeFormat <- "%H:%M" } else if (ncolons == 2) { timeFormat <- "%H:%M:%S" } else stop("timePart should have 1 or 2 colons") } # remove all non-numeric values dates <- gsub("[^0-9]", "", na.omit(dateTimes[, datePart])) # sep is any non-numeric value found, hopefully / or sep <- unique(na.omit(substr(gsub("[0-9]", "", dateTimes[, datePart]), 1, 1))) if (length(sep) > 1) stop("too many seperators in datePart") # maximum number of characters found in the date part dlen <- max(nchar(dates)) dateFormat <- NA # when six, expect the century to be omitted if (dlen == 6) { if (sum(is.na(as.Date(dates, format = "%y%m%d"))) == 0) { dateFormat <- paste("%y", "%m", "%d", sep = sep) } else if (sum(is.na(as.Date(dates, format = "%m%d%y"))) == 0) { dateFormat <- paste("%m", "%d", "%y", sep = sep) } else stop("datePart format [six characters] is inconsistent") } else if (dlen == 8) { if (sum(is.na(as.Date(dates, format = "%Y%m%d"))) == 0) { dateFormat <- paste("%Y", "%m", "%d", sep = sep) } else if (sum(is.na(as.Date(dates, format = "%m%d%Y"))) == 0) { dateFormat <- paste("%m", "%d", "%Y", sep = sep) } else stop("datePart format [eight characters] is inconsistent") } else { stop(sprintf("datePart has unusual length: %s", dlen)) } if (is.na(timeFormat)) { format <- dateFormat } else if (timePart == 1) { format <- paste(timeFormat, dateFormat) } else if (timePart == 2) { format <- paste(dateFormat, timeFormat) } else stop("cannot parse your time variable") if (returnDates) return(as.POSIXlt(x, format = format, tz = tzone)) format } # generate some dates mydates <- format(as.POSIXct(sample(31536000, 20), origin = "2011-01-01", tz = "UTC"), "%m/%d/%Y %H:%M") mydates ## ## ## [1] "02/07/2011 06:51" "11/21/2011 17:03" "09/17/2011 22:42" "02/16/2011 13:45" [5] "12/14/2011 19:11" "09/08/2011 09:22" "12/06/2011 14:06" "02/02/2011 11:00" [9] "03/27/2011 06:12" "01/05/2011 15:09" "04/15/2011 04:17" "10/20/2011 14:20" 11 ## [13] "11/13/2011 21:46" "02/26/2011 03:24" "12/29/2011 11:02" "03/17/2011 02:24" ## [17] "02/27/2011 13:51" "06/27/2011 08:36" "03/14/2011 10:54" "01/28/2011 14:14" guessDateFormat(mydates) ## [1] "%m/%d/%Y %H:%M" 12