Saturday, July 13, 2013

Read from Excel

We get a lot of data in the form of Excel spreadsheets. Generally I have saved each worksheet as separate text files and read into R from them. But that is extra work and goes against my preference of having a single source for any data.

I guess I'd never tried hard enough to read directly from Excel into R, but seeing Jeffrey Breen's slides last year on Tapping the Data Deluge with R inspired me to give the XLConnect package a try.

It works well, but I missed one thing. I often put comments in the beginning of the text file starting with a "#" or some other comment character which read.table can handle. Martin Studer had a suggestion for handling such comments in the Excel file.

The function below expands on that to allow comment lines and blank lines throughout the Excel dataset, for example like this
# some data
X Y Z
1 2 3
2 3 4
3   5#where is Y?
4 5 6

# note things change here
17 15 13
16 14 12

So, in the function,
  • The whole worksheet is read in as a data.frame d.all, possibly skipping a fixed number of rows.
  • Blank rows are removed
  • The data.frame is written to a temporary text file
  • The text file is read in as the desired data.frame, skipping the first row of names added by write.table and dealing with the comments
readFromExcel <- function(wbName,sheetName=1,comment.char="#",header=TRUE,skip=NULL,as.is=TRUE,sep="\t") {
  require(XLConnect) 
  if(is.null(skip)) skip <- 0
  tempFile <- basename(paste0(wbName,'TEMP.txt'))
  d.all <- readWorksheetFromFile(wbName,sheetName,header=FALSE,startRow=skip+1)
  d.all <- d.all[rowSums(is.na(d.all)) != ncol(d.all),] # remove blank rows
  write.table(d.all,tempFile,quote=FALSE,row.names=FALSE,sep=sep)
  df <- read.table(tempFile,header=header,comment.char=comment.char,sep=sep,skip=1,as.is=as.is)
  file.remove(tempFile)
  df
}
The function could be improved by allowing more arguments to pass to readWorksheet such as colTypes or dateTimeFormat.

No comments:

Post a Comment