r - Largest set of columns having at least k shared rows -


i have large data frame (50k 5k). make smaller data frame using following rule. given k, 0>k>n, select largest set of columns such k rows have non-na values in of these columns.

this seems might hard computer on big data frame, i'm hoping possible. have written code operation.

it seems way of doing complex. relies on (1) computing list of possible subsets of set of columns, , (2) checking how many shared rows have. small numbers of columns (1) gets slow (e.g. 45 seconds 25 columns).

question: theoretically possible largest set of columns sharing @ least k non-na rows? if so, more realistic approach?

@alexis_laz's elegant answer similar question takes inverse approach mine, examining (fixed-size) subsets of observations/samples/draws/units , checking variables present in them.

taking combinations of n observations difficult large n. example, length(combn(1:500, 3, simplify = false)) yields 20,708,500 combinations 500 observations , fails produce combinations on computer sizes greater 3. makes me worry it's not gonna possible large n , p in either approach.

i have included example matrix reproducibility.

require(dplyr)  # generate example matrix set.seed(123) n = 100 p = 25 missing = 25 mat = rnorm(n * p) mat[sample(1:(n*p), missing)] = na mat = matrix(mat, nrow = n, ncol = p) colnames(mat) = 1:p  # matrix reporting whether value na hasval = 1-is.na(mat)  system.time(   # collect possible subsets of columns' indices   namesubsets <<- unlist(lapply(1:ncol(mat), combn, x = colnames(mat), simplify = false),                           recursive = false,                           use.names = false) )  #how many observations have of subset variables countobswithvars = function(varsvec){   selectedcols = as.matrix(hasval[,varsvec])   countinrow = apply(selectedcols, 1, sum) # each row, number of matching values   nummatching = sum(countinrow == length(varsvec)) #has selected columns } system.time(   numobswithvars <<- unlist(lapply(namesubsets, countobswithvars)) )  # collect results data.frame df = data.frame(subsetnum = 1:length(numobswithvars),                  numobswithvars = numobswithvars,                 numvarsinsubset = unlist(lapply(namesubsets, length)),                 varsinsubset = i(namesubsets))   # find largest set of columns each number of rows maxdf = df %>% group_by(numobswithvars) %>%   filter(numvarsinsubset== max(numvarsinsubset)) %>%   arrange(numobswithvars) 


Comments

Popular posts from this blog

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -

asp.net mvc - breakpoint on javascript in CSHTML? -