r - Largest set of columns having at least k shared rows -
i have large data frame (50k 5k). make smaller data frame using following rule. given k, 0>k>n, select largest set of columns such k rows have non-na
values in of these columns.
this seems might hard computer on big data frame, i'm hoping possible. have written code operation.
it seems way of doing complex. relies on (1) computing list of possible subsets of set of columns, , (2) checking how many shared rows have. small numbers of columns (1) gets slow (e.g. 45 seconds 25 columns).
question: theoretically possible largest set of columns sharing @ least k non-na
rows? if so, more realistic approach?
@alexis_laz's elegant answer similar question takes inverse approach mine, examining (fixed-size) subsets of observations/samples/draws/units , checking variables present in them.
taking combinations of n observations difficult large n. example, length(combn(1:500, 3, simplify = false))
yields 20,708,500 combinations 500 observations , fails produce combinations on computer sizes greater 3. makes me worry it's not gonna possible large n , p in either approach.
i have included example matrix reproducibility.
require(dplyr) # generate example matrix set.seed(123) n = 100 p = 25 missing = 25 mat = rnorm(n * p) mat[sample(1:(n*p), missing)] = na mat = matrix(mat, nrow = n, ncol = p) colnames(mat) = 1:p # matrix reporting whether value na hasval = 1-is.na(mat) system.time( # collect possible subsets of columns' indices namesubsets <<- unlist(lapply(1:ncol(mat), combn, x = colnames(mat), simplify = false), recursive = false, use.names = false) ) #how many observations have of subset variables countobswithvars = function(varsvec){ selectedcols = as.matrix(hasval[,varsvec]) countinrow = apply(selectedcols, 1, sum) # each row, number of matching values nummatching = sum(countinrow == length(varsvec)) #has selected columns } system.time( numobswithvars <<- unlist(lapply(namesubsets, countobswithvars)) ) # collect results data.frame df = data.frame(subsetnum = 1:length(numobswithvars), numobswithvars = numobswithvars, numvarsinsubset = unlist(lapply(namesubsets, length)), varsinsubset = i(namesubsets)) # find largest set of columns each number of rows maxdf = df %>% group_by(numobswithvars) %>% filter(numvarsinsubset== max(numvarsinsubset)) %>% arrange(numobswithvars)
Comments
Post a Comment