python 2.7 - Getting CParserError: Error tokenizing data. C error: Expected 281 fields in line 1025974, saw 331 -

i have 17gb tab separated file , above error when using python/pandas

i doing following:

data = pd.read_csv('/tmp/testdata.tsv',sep='\t')

i have tried adding encoding='utf8' , tried read_table , various flags, including low_memory=true, same error @ same line.

i ran following on file:

awk -f"\t" 'fnr==1025974 {print nf}' /tmp/testdata.tsv

an returns 281 number of fields awk telling me line has correct 281 columns, read_csv telling me have 331.

i tried above awk on line 1025973 , 1025975, sure wasn't relative 0 , both come 281 fields.

what missing here?

so debug this, took header line, took single line above , ran through read_csv. got error:

error tokenizing data. c error: eof inside string starting @ line 1

the problem turned out that, default, read_csv closing double quote if sees double quote after delimiter.

i incorrectly assumed if specified sep="\t" split on tabs , not care other characters.

long story short, fix this, add following flag read_csv

quoting=3 quote_none.

First Image