Monday, April 28, 2008
Dealing with "native" Windows encoding
Microsoft Windows and other Microsoft utilities, like Microsoft Office, use encoding "UTF-16LE" by default; if they offer you multiple choices of encoding, they call it simply "Unicode". If the goal is to generate Unicode files which could be opened by all Microsoft applications, these better be in UTF-16LE.
Multiple language and libraries offer built-in conversion to UTF-16LE; however, one must be aware of two potential problems with that: (1) standard 4-byte header that Windows expects (and writes on output), and (2) potential problem with built-in DOS line ending mode ("text mode"); files must be written in "binary" mode.
Proper way to create UTF-16LE file in Python would be this:
fh = open ( "Test.txt", "wb" ) fh.write ( "\xff\xfe") fh.write ( u"Проверка\r\n".encode("UTF-16LE" ) ) fh.close()
Labels: python, unicode, windows