python - Data preparation for scikit learn decision tree -
i'm trying prepare dataset scikit learn, planning build pandas dataframe feed decision tree classifier.
the data represents different companies varying criteria, criteria can have multiple values - such "customer segment" - which, given company, any, or of: smb, midmarket, enterprise, etc. there other criteria/columns multiple possible values. need decisions made upon individual values, not aggregate - company smb, company midmarket, , not "grouping" of customer smb , midmarket.
is there guidance on how handle this? need generate rows every variant given company fed learning routine? such input of:
company,segment a,smb:mm:ent
becomes:
a, smb a, mm a, ent
as other variants may come additional criteria/columns - example "customer vertical" include multiple values? seems increase dataset size. there better way structure data and/or handle scenario?
my ultimate goal let users complete short survey simple questions, , map responses values prediction of "right" company, given segment, vertical, product category, etc. i'm struggling build right learning dataset accomplish that.
let's try.
df = pd.dataframe({'company':['a','b'], 'segment':['smb:mm:ent', 'smb:mm']}) expended_segment = df.segment.str.split(':', expand=true) expended_segment.columns = ['segment'+str(i) in range(len(expended_segment.columns))] wide_df = pd.concat([df.company, expended_segment], axis=1) result = pd.melt(wide_df, id_vars=['company'], value_vars=list(set(wide_df.columns)-set(['company']))) result.dropna()
Comments
Post a Comment