python - Pandas read_fwf: specify dtype -
i reading in huge fixed width text file in chunks , export data csv. because pandas.read_fwf not allow specify dtypes, wondering other way there exists force columns strings. reason pandas infers columns float though not , not want .0
within column.
using data[column] = data[column].astype(str)
not not rid of decimals. converting columns of float64 dtype int doesn't work either since nas cannot converted. ideas?
here's snippet of code:
dat = pd.read_fwf(file_to_read,colspecs=cols,header=none,chunksize=100000,names=header) #first chunk data.info() int64index: 100000 entries, 0 99999 columns: 562 entries, dtypes: float64(405), int64(4), object(153) memory usage: 429.5+ mb column in data.columns: if data[column].dtype == 'float64': data[column] = data[column].astype(int) else: pass
i str().replace('.0','')
, want find easier way iterating through column takes lot of time.
the converter
parameter can used preserve data strings since pd.read_fwf
not try guess dtype if converter specified:
import pandas pd try: # python2 cstringio import stringio except importerror: # python3 io import stringio content = '''\ 1.0 2 3.0 4 b 5 x c m y d ''' header = ['foo', 'bar', 'baz'] df in pd.read_fwf(stringio(content), header=none, chunksize=2, names=header, converters={h:str h in header}): print(df) df.info()
yields
foo bar baz 0 1.0 2 1 3.0 4 b foo bar baz 0 5 x c 1 m y d <class 'pandas.core.frame.dataframe'> rangeindex: 2 entries, 0 1 data columns (total 3 columns): foo 2 non-null object bar 2 non-null object baz 2 non-null object dtypes: object(3) memory usage: 120.0+ bytes
Comments
Post a Comment