python - Data preparation for scikit learn decision tree -

- September 15, 2015

i'm trying prepare dataset scikit learn, planning build pandas dataframe feed decision tree classifier.

the data represents different companies varying criteria, criteria can have multiple values - such "customer segment" - which, given company, any, or of: smb, midmarket, enterprise, etc. there other criteria/columns multiple possible values. need decisions made upon individual values, not aggregate - company smb, company midmarket, , not "grouping" of customer smb , midmarket.

is there guidance on how handle this? need generate rows every variant given company fed learning routine? such input of:

company,segment a,smb:mm:ent

becomes:

a, smb a, mm a, ent

as other variants may come additional criteria/columns - example "customer vertical" include multiple values? seems increase dataset size. there better way structure data and/or handle scenario?

my ultimate goal let users complete short survey simple questions, , map responses values prediction of "right" company, given segment, vertical, product category, etc. i'm struggling build right learning dataset accomplish that.

let's try.

df = pd.dataframe({'company':['a','b'], 'segment':['smb:mm:ent', 'smb:mm']}) expended_segment = df.segment.str.split(':', expand=true) expended_segment.columns = ['segment'+str(i) in range(len(expended_segment.columns))] wide_df = pd.concat([df.company, expended_segment], axis=1) result = pd.melt(wide_df, id_vars=['company'], value_vars=list(set(wide_df.columns)-set(['company']))) result.dropna()

Search This Blog

First Image

python - Data preparation for scikit learn decision tree -

Comments

Post a Comment

Popular posts from this blog

php - Passing multiple values in a url using checkbox -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -