Monday, December 21, 2009

 

Parsing Windows-encoded CSV file, again

This post continues a topic "how to deal with UTF-16-LE encoding", see this post.

If writing UTF-16-LE is no picnic, reading is even more challenging, unless you want to read the whole file as one line. If you prefer or are forced to use per-line approach, then you'd better be aware that properly encoded UTF-16-LE file includes zero byte '\0' after the end-of-line symbol, that is, every line ends up with four bytes 0x0d000a00, or if you will '\r\x00\n\x00'. However, when reading line-by-line, processing stops at '\n' and following '\x00' is interpreted as belonging to the next line!

To rectify this problem, you can use this Python-based "generator" to read UTF-16-LE-encoded file lile-by-line as unicode:

def readiterator(file) :
    fh = open ( file, "rb" )
    for line in fh :
        if line == '\x00' : continue
        if line[:2] == '\xff\xfe' :
            line = line[2:] + "\x00"
        else :
            line = line[1:] + "\x00"
        res = unicode ( line, "utf_16_le" )
        yield res.encode ( "utf-8" )
    fh.close ()

One typical application of that would be parsing CSV file (e.g., reasult of Excel CSV export) using Python built-in "csv" module, which has no knowledge of encoding, though luckily does support unicode input:

cvsreader = csv.reader(readiterator(input_csv_file))

If you decide to (for example) make changes to the table and save it again as CSV file, you'll quickly discover that you can't similarly use csv.write() directly, since it does not work with unicode strings, at all. You will have to play a trick taken directly from official Python documentation, to first convert to UTF-8 and dump CSV to a temporary string , and then read this string and convert back to unicode. Here one way to do that:

class MyCSVWriter:
    def __init__ (self,file_writer) :
        self.stream = file_writer
        self.queue = StringIO.StringIO ()
        self.writer = csv.writer(self.queue)

    def writerow (self,row) :
        self.writer.writerow([s.encode("utf-8") for s in row])
        self.stream.write(unicode(self.queue.getvalue(),"utf-8"))
        self.queue.truncate(0)

    def close(self) :
        self.stream.close ()

Of course, you still need a backend to dump unicode data as UTF-16-LE-encoded file:

class MyFileWriter:
    def __init__ ( self, file ) :
        self.fh = open ( file, "wb" )
        self.lineno = 0
            
    def write (self, line) :
        if self.lineno == 0 :
            self.fh.write ( '\xff\xfe' )
        self.lineno += 1
        self.fh.write ( line.encode ( "utf_16_le" ) )

    def close (self) :
        self.fh.close()

These two classes finally make it possible to create a CSV "writer" which can be used to write data just retrieved by aforementioned "reader"

csvwriter = MyCSVWriter(MyFileWriter(output_csv_file))

All of these code snippets are taken from an utility parsegab.py which I write to make some very specific changes to Google Address Book, using workflow "export – fix – erase all – import back".

Labels: , , ,


This page is powered by Blogger. Isn't yours?