python - Pandas DataFrame [cell=(label,value)], split into 2 separate dataframes -

- April 15, 2014

i found awesome way parse html pandas. data in kind of weird format (attached below). want split data 2 separate dataframes.

notice how each cell separated , ... is there efficient method split of these cells , create 2 dataframes, 1 labels , 1 ( value ) in parenthesis?

numpy has ufuncs, there way can use them on string dtypes since can converted np.array df.as_matrix()? i'm trying steer clear of for loops, iterate through indices , populate empty array that's pretty barbaric.

i'm using beaker notebook btw, it's cool (highly recommended)

#set url destination url = "http://www.reef.org/print/db/stats"  #process raw table df_raw = pd.pandas.read_html(url)[0]  #get start/end indices of table start_label = "10 frequent species"; start_idx = (df_raw.iloc[:,0] == start_label).argmax() end_label = "top 10 sites species richness"; end_idx = (df_raw.iloc[:,0] == end_label).argmax()  #process table df_freqspecies = pd.dataframe(                               df_raw.as_matrix()[(start_idx + 1):end_idx,:],                               columns = df_raw.iloc[0,:] ) df_freqspecies  #split these 2 separate dataframes

here's naive way of doing such:

import re df_species = pd.dataframe(np.zeros_like(df_freqspecies),columns=df_freqspecies.columns) df_freq = pd.dataframe(np.zeros_like(df_freqspecies).astype(str),columns=df_freqspecies.columns)  dims = df_freqspecies.shape in range(dims[0]):     j in range(dims[1]):         #parse current dataframe         species, freq = re.split("\s\(\d",df_freqspecies.iloc[i,j])         freq = float(freq[:-1])         #populate split dataframes         df_species.iloc[i,j] = species         df_freq.iloc[i,j] = freq

i want these 2 dataframes output:

(1) species; , (2) frequencies

you can way:

df1:

in [182]: df1 = df_freqspecies.replace(r'\s*\(\d+\.*\d*\)', '', regex=true)  in [183]: df1.head() out[183]: 0 tropical western atlantic california, pacific northwest , alaska  \ 0                  bluehead                          copper rockfish 1                 blue tang                                  lingcod 2      stoplight parrotfish                        painted greenling 3        bicolor damselfish                           sunflower star 4              french grunt                          plumose anemone  0                      hawaii tropical eastern pacific  \ 0               saddle wrasse           king angelfish 1  hawaiian whitespotted toby          mexican hogfish 2       raccoon butterflyfish               barberfish 3            manybar goatfish            flag cabrilla 4                moorish idol   panamic sergeant major  0              south pacific northeast , eastern canada  \ 0            regal angelfish                          cunner 1  bluestreak cleaner wrasse                 winter flounder 2           manybar goatfish                     rock gunnel 3             brushtail tang                         pollock 4       two-spined angelfish                  grubby sculpin  0 south atlantic states       central indo-pacific 0         slippery dick               moorish idol 1       belted sandfish       three-spot dascyllus 2        black sea bass  bluestreak cleaner wrasse 3               tomtate     blacklip butterflyfish 4                cubbyu        clark's anemonefish

and df2

in [193]: df2 = df_freqspecies.replace(r'.*\((\d+\.*\d*)\).*', r'\1', regex=true)  in [194]: df2.head() out[194]: 0 tropical western atlantic california, pacific northwest , alaska hawaii  \ 0                        85                                     54.6     92 1                      84.8                                     53.2   85.8 2                        81                                     50.8   85.7 3                      79.9                                     50.2   85.7 4                      74.8                                     49.7   82.9  0 tropical eastern pacific south pacific northeast , eastern canada  \ 0                     85.7            79                            67.4 1                     82.5          77.3                            46.6 2                     75.2          73.9                            26.2 3                     68.9          73.3                            25.2 4                     67.9          72.8                            23.7  0 south atlantic states central indo-pacific 0                  79.7                 80.1 1                  78.5                 75.6 2                  78.5                 73.5 3                  72.7                 71.4 4                  65.7                 70.2

regex debugging , explanation:

we want remove everything, except number in parentheses:

(\d+\.*\d*) - group(1) - it's our number

\((\d+\.*\d*)\) - our number in parentheses

.*\((\d+\.*\d*)\).* - whole thing - before '(', '(', our number, ')', till end of cell

it replaced group(1) - our number

Search This Blog

First Image

python - Pandas DataFrame [cell=(label,value)], split into 2 separate dataframes -

Comments

Post a Comment

Popular posts from this blog

php - Passing multiple values in a url using checkbox -

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -