python 2.7 - Getting CParserError: Error tokenizing data. C error: Expected 281 fields in line 1025974, saw 331 -


i have 17gb tab separated file , above error when using python/pandas

i doing following:

data = pd.read_csv('/tmp/testdata.tsv',sep='\t') 

i have tried adding encoding='utf8' , tried read_table , various flags, including low_memory=true, same error @ same line.

i ran following on file:

awk -f"\t" 'fnr==1025974 {print nf}' /tmp/testdata.tsv 

an returns 281 number of fields awk telling me line has correct 281 columns, read_csv telling me have 331.

i tried above awk on line 1025973 , 1025975, sure wasn't relative 0 , both come 281 fields.

what missing here?

so debug this, took header line, took single line above , ran through read_csv. got error:

error tokenizing data. c error: eof inside string starting @ line 1

the problem turned out that, default, read_csv closing double quote if sees double quote after delimiter.

i incorrectly assumed if specified sep="\t" split on tabs , not care other characters.

long story short, fix this, add following flag read_csv

quoting=3 quote_none.


Comments

Popular posts from this blog

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -

asp.net mvc - breakpoint on javascript in CSHTML? -