python 2.7 - Getting CParserError: Error tokenizing data. C error: Expected 281 fields in line 1025974, saw 331 -
i have 17gb tab separated file , above error when using python/pandas
i doing following:
data = pd.read_csv('/tmp/testdata.tsv',sep='\t')
i have tried adding encoding='utf8' , tried read_table , various flags, including low_memory=true, same error @ same line.
i ran following on file:
awk -f"\t" 'fnr==1025974 {print nf}' /tmp/testdata.tsv
an returns 281 number of fields awk telling me line has correct 281 columns, read_csv telling me have 331.
i tried above awk on line 1025973 , 1025975, sure wasn't relative 0 , both come 281 fields.
what missing here?
so debug this, took header line, took single line above , ran through read_csv. got error:
error tokenizing data. c error: eof inside string starting @ line 1
the problem turned out that, default, read_csv closing double quote if sees double quote after delimiter.
i incorrectly assumed if specified sep="\t" split on tabs , not care other characters.
long story short, fix this, add following flag read_csv
quoting=3 quote_none.
Comments
Post a Comment