python - Pandas DataFrame [cell=(label,value)], split into 2 separate dataframes -
i found awesome way parse
html
pandas
. data in kind of weird format (attached below). want split data 2 separate dataframes
.
notice how each cell
separated ,
... is there efficient method split of these cells , create 2 dataframes, 1 labels , 1 ( value )
in parenthesis?
numpy
has ufuncs
, there way can use them on string
dtypes
since can converted np.array
df.as_matrix()
? i'm trying steer clear of for loops
, iterate through indices , populate empty array that's pretty barbaric.
i'm using beaker notebook
btw, it's cool (highly recommended)
#set url destination url = "http://www.reef.org/print/db/stats" #process raw table df_raw = pd.pandas.read_html(url)[0] #get start/end indices of table start_label = "10 frequent species"; start_idx = (df_raw.iloc[:,0] == start_label).argmax() end_label = "top 10 sites species richness"; end_idx = (df_raw.iloc[:,0] == end_label).argmax() #process table df_freqspecies = pd.dataframe( df_raw.as_matrix()[(start_idx + 1):end_idx,:], columns = df_raw.iloc[0,:] ) df_freqspecies #split these 2 separate dataframes
here's naive way of doing such:
import re df_species = pd.dataframe(np.zeros_like(df_freqspecies),columns=df_freqspecies.columns) df_freq = pd.dataframe(np.zeros_like(df_freqspecies).astype(str),columns=df_freqspecies.columns) dims = df_freqspecies.shape in range(dims[0]): j in range(dims[1]): #parse current dataframe species, freq = re.split("\s\(\d",df_freqspecies.iloc[i,j]) freq = float(freq[:-1]) #populate split dataframes df_species.iloc[i,j] = species df_freq.iloc[i,j] = freq
i want these 2 dataframes output:
you can way:
df1:
in [182]: df1 = df_freqspecies.replace(r'\s*\(\d+\.*\d*\)', '', regex=true) in [183]: df1.head() out[183]: 0 tropical western atlantic california, pacific northwest , alaska \ 0 bluehead copper rockfish 1 blue tang lingcod 2 stoplight parrotfish painted greenling 3 bicolor damselfish sunflower star 4 french grunt plumose anemone 0 hawaii tropical eastern pacific \ 0 saddle wrasse king angelfish 1 hawaiian whitespotted toby mexican hogfish 2 raccoon butterflyfish barberfish 3 manybar goatfish flag cabrilla 4 moorish idol panamic sergeant major 0 south pacific northeast , eastern canada \ 0 regal angelfish cunner 1 bluestreak cleaner wrasse winter flounder 2 manybar goatfish rock gunnel 3 brushtail tang pollock 4 two-spined angelfish grubby sculpin 0 south atlantic states central indo-pacific 0 slippery dick moorish idol 1 belted sandfish three-spot dascyllus 2 black sea bass bluestreak cleaner wrasse 3 tomtate blacklip butterflyfish 4 cubbyu clark's anemonefish
and df2
in [193]: df2 = df_freqspecies.replace(r'.*\((\d+\.*\d*)\).*', r'\1', regex=true) in [194]: df2.head() out[194]: 0 tropical western atlantic california, pacific northwest , alaska hawaii \ 0 85 54.6 92 1 84.8 53.2 85.8 2 81 50.8 85.7 3 79.9 50.2 85.7 4 74.8 49.7 82.9 0 tropical eastern pacific south pacific northeast , eastern canada \ 0 85.7 79 67.4 1 82.5 77.3 46.6 2 75.2 73.9 26.2 3 68.9 73.3 25.2 4 67.9 72.8 23.7 0 south atlantic states central indo-pacific 0 79.7 80.1 1 78.5 75.6 2 78.5 73.5 3 72.7 71.4 4 65.7 70.2
regex debugging , explanation:
we want remove everything, except number in parentheses:
(\d+\.*\d*)
- group(1) - it's our number
\((\d+\.*\d*)\)
- our number in parentheses
.*\((\d+\.*\d*)\).*
- whole thing - before '(', '(', our number, ')', till end of cell
it replaced group(1) - our number
Comments
Post a Comment