memory - merging really not that large data.tables immediately results in R being killed -


i have 32gb of ram on machine, can r killed faster ;)

example

the goal here achieve rbind() of 2 data.tables using functions make use of data.table's efficiency.

input:

rm(list=ls()) gc() 

output:

          used (mb) gc trigger   (mb)  max used   (mb) ncells 1604987 85.8    2403845  128.4   2251281  120.3 vcells 3019405 23.1  537019062 4097.2 468553954 3574.8 

input:

tmp.table <- data.table(x1=sample(1:7,4096000,replace=true),                            x2=as.factor(sample(1:2,4096000,replace=true)),                            x3=sample(1:1000,4096000,replace=true),                            x4=sample(1:256,4096000,replace=true),                            x5=sample(1:16,4096000,replace=true),                            x6=rnorm(4096000))  setkey(tmp.table,x1,x2,x3,x4,x5,x6)  join.table <- data.table(x1 = integer(), x2 = factor(),                           x3 = integer(), x4=integer(),                          x5 = integer(), x6 = numeric())  setkey(join.table,x1,x2,x3,x4,x5,x6)  tables() 

output:

     name            nrow  mb cols              key               [1,] join.table         0   1 x1,x2,x3,x4,x5,x6 x1,x2,x3,x4,x5,x6 [2,] tmp.table  4,096,000 110 x1,x2,x3,x4,x5,x6 x1,x2,x3,x4,x5,x6 total: 111mb 

input:

join.table <- merge(join.table,tmp.table,all.y=true) 

output:

ha! nope. rstudio restarts session.

question

what's going on here? explicitly setting factor levels in join.table had no effect. rbind() instead of merge() didn't help--exact same behavior. have done way more complicated , bulky things related data without problems.

version info

$platform [1] "x86_64-pc-linux-gnu"  $arch [1] "x86_64"  $os [1] "linux-gnu"  $system [1] "x86_64, linux-gnu"  $version.string [1] "r version 3.0.2 (2013-09-25)"  $nickname [1] "frisbee sailing"  > rstudio::versioninfo() $version [1] ‘99.9.9’  $mode [1] "server" 

data.table version 1.8.11.

update: has been fixed in commit 1123 of v1.8.11. news:

o rbindlist @ least 1 factor column along presence of @ least 1 empty data.table resulted in segfault (or in linux/mac reported error related hash tables). fixed, #5355. trevor alexander reporting on (and mnel filing bug report): merging not large data.tables results in r being killed


this can reproduced single row data.table factor column , zero-row data.table factor column.

library(data.table) <- data.table(x=factor(1), key='x') b <- data.table(x=factor(), key='x') merge(b, a, all.y=true)  # rstudio -> r encountered fatal error #  r gui -> r windoze gui has stopped working 

using debugonce(data.table:::merge.data.table) can traced line rbind(dt,yy) equivalent of

rbind(b,a) 

which, if run it, give same error.

this has been reported package authors issue #5355


Comments

Popular posts from this blog

python - Subclassed QStyledItemDelegate ignores Stylesheet -

java - HttpClient 3.1 Connection pooling vs HttpClient 4.3.2 -

SQL: Divide the sum of values in one table with the count of rows in another -