Speeding up lots of GLMs in R

First off, sorry about the long post. Figured it's better to give context to get good answers (I hope!). Some time ago I wrote an R function that will get all pairwise interactions of variables in a data frame. This worked fine at the time, but now a colleague would like me to do this with a much larger dataset. They don't know how many variables they are going to have in the end but they are guessing approximately 2,500 - 3,000. My function below is way too slow for this (4 minutes for 100 variables). At the bottom of this post I have included some timings for various numbers of variables and total numbers of interactions. I have the results of calling Rprof() on the 100 variables run of my function, so If anyone wants to take a look at it let me know. I don't want to make a super long any longer than it needs to be. What I'd like to know is if there is anything I can do to speed this function up. I tried looking going directly to glm.fit, but as far as I understood, for that to be useful the computation of the design matrices and all of that other stuff that I frankly don't understand, needs to be the same for each model, which is not the case for my analysis, although perhaps I am wrong about this. Any ideas on how to make this run faster would be greatly appreciated. I am planning on using parallelization to run the analysis in the end but I don't know how many CPU's I am going to have access to but I'd say it won't be more than 8. Thanks in advance, Cheers Davy. getInteractions2 = function(data, fSNPcol, ccCol) { #fSNPcol is the number of the column that contains the first SNP #ccCol is the number of the column that contains the outcome variable require(lmtest) a = data.frame() snps = names(data)[-1:-(fSNPcol-1)] names(data)[ccCol] = "PHENOTYPE" terms = as.data.frame(t(combn(snps,2))) attach(data) fit1 = c() fit2 = c() pval = c() for(i in 1:length(terms$V1)) { fit1 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i])),family="binomial") fit2 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i]))+I(get(as.character(terms$V1[i]))*get(as.character(terms$V2[i]))),family="binomial") a = lrtest(fit1, fit2) pval = c(pval, a[2,"Pr(>Chisq)"]) } detach(data) results = cbind(terms,pval) return(results) } In the table below is the system.time results for increasing numbers of variables being passed through the function. n is the number, and Ints, is the number of pair-wise interactions given by that number of variables. n Ints user.self sys.self elapsed time 10 45 1.20 0.00 1.30 time 15 105 3.40 0.00 3.43 time 20 190 6.62 0.00 6.85 ... time 90 4005 178.04 0.07 195.84 time 95 4465 199.97 0.13 218.30 time 100 4950 221.15 0.08 242.18 Some code to reproduce a data frame in case you want to look at timings or the Rprof() results. Please don't run this unless your machine is super fast, or your prepared to wait for about 15-20 minutes. df = data.frame(paste("sid",1:2000,sep=""),rbinom(2000,1,.5)) gtypes = matrix(nrow=2000, ncol=3000) gtypes = apply(gtypes,2,function(x){x=sample(0:2, 2000, replace=T);x}) snps = paste("rs", 1000:3999,sep="") df = cbind(df,gtypes) names(df) = c("sid", "status", snps) times = c() for(i in seq(10,100, by=5)){ if(i==100){Rprof()} time = system.time((pvals = getInteractions2(df[,1:i], 3, 2))) print(time) times = rbind(times, time) if(i==100){Rprof(NULL)} } numI = function(n){return(((n^2)-n)/2)} timings = cbind(seq(10,100,by=5), sapply(seq(10,100,by=5), numI),times)
You can always rent more CPUs from Amazon's EC2 cloud...

以上就是Speeding up lots of GLMs in R的详细内容,更多请关注web前端其它相关文章!

赞(0) 打赏
未经允许不得转载:web前端首页 » JavaScript 答疑

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

前端开发相关广告投放 更专业 更精准