My Tech Notes

Thursday, February 17, 2011

MPM

In my mod-python/apache based server implementation at some point I generate certain amount of data (from MySQL database) used in processing user input. This generation takes a while (~10 secs on my development server, ~30 secs on production VPS-based server) and is virtually unchanged from one invocation to the next. Why can't I save it in a disk cache and use it instead? OK, I tried just that.

You can imagine my surprise when I realized that reading from cache took exactly same time as original data generation! How is it even possible?

It turned out that the reason for this phenomenon was that my data consisted mostly of complied Python RE's (regular expressions). However, Python's underlying serialization engine ("pickle") cannot save compiled RE's; when you try to, on the pickle.load() call these RE's will be automatically (re-)compiled. This is what I experienced: most of the time which I initially attributed to DB access and processing results was in fact spent compiling RE's. When I switched to file cache, I gained nothing: same RE's were recompiled again on cache load.

Why such limitation? No one is really sure, however, this is exactly how things are at the moment:

Q: Is it possible to store these regular expressions in a cache on disk in a pre-compiled manner to avoid having to execute the regex compilations on each import?
A: Not easily. You'd have to write a custom serializer that hooks into the C sre implementation of the Python regex engine. Any performance benefits would be vastly outweighed by the time and effort required.

So what should or could I do to optimize my server? It is really annoying to have to wait half a minute every time for a server to generate the same data over and over again.

Looking for an answer to this question, I turned to mod-pyton documentation, section on session management. It appears that mod-python supports three kinds of section management engines: MemorySession, DbmSession, FileSession; naturally, MemorySession implements persistent storage in memory, whereas DbmSession and FileSession are essentially different ways to provide disk-based caching.

Now, there is little doubt that said caching will use internally Python's standard pickle engine, which will take me back exactly to square one.

Can I use MemorySession? mod-python "Session" implementation can make a determination which session engine to use. Documentation has this to say regarding when MemorySession is chosen:

If session type option is not found, the function queries the MPM and based on that returns either a new instance of DbmSession or MemorySession. MemorySession will be used if the MPM is threaded and not forked (such is the case on Windows), or if it threaded, forked, but only one process is allowed (the worker MPM can be configured to run this way). In all other cases DbmSession is used.

What is it talking about, and what is "MPM"? It took me a while to figure that out. MPM in fact stands for Multi-Processing Module and it has to do with how apache distributes incoming requests. Borrowing an excellent explanation from here,

Apache can operate in a number of different modes dependent on the platform being used and the way in which it is configured. This ranges from multiple processes being used with only one request being handled at a time within each process, to one or more processes being used with concurrent requests being handled in distinct threads executing within the same process or distinct processes.
The UNIX "prefork" Mode
This mode is the most commonly used. It was the only mode of operation available in Apache 1.3 and is still the default mode on UNIX systems in Apache 2.0 and 2.2. In this setup, the main Apache process will at startup create multiple child processes. When a request is received by the parent process, it will be handed off to one of the child processes to be handled.
The UNIX "worker" Mode
The "worker" mode is similar to "prefork" mode except that within each child process there will exist a number of worker threads. Instead of a request being handed off to the next available child process with the handling of the request being the only thing the child process is doing, the request will be handed off to a specific worker thread within a child process with other worker threads in the same child process potentially handling other requests at the same time.

You can find out which mode is used by these function calls (from mod-python request handler)

threaded = apache.mpm_query(apache.AP_MPMQ_IS_THREADED)
forked = apache.mpm_query(apache.AP_MPMQ_IS_FORKED)

The last question is, how to switch from one mode to another? In debian, simply install one of apache2-mpm-XXX packages:

debian-linux% apt-cache search ^apache2-mpm
apache2-mpm-itk - multiuser MPM for Apache 2.2
apache2-mpm-event - Apache HTTP Server - event driven model
apache2-mpm-prefork - Apache HTTP Server - traditional non-threaded model
apache2-mpm-worker - Apache HTTP Server - high speed threaded model

I am not sure what apache2-mpm-event is, but apache2-mpm-prefork and apache2-mpm-worker, when installed, will automatically uninstall (!) the other one and automatically make changes to /etc/apache2/apache2.conf to turn on respective mode, like that:

# worker MPM
# StartServers: initial number of server processes to start
# MaxClients: maximum number of simultaneous client connections
# MinSpareThreads: minimum number of worker threads which are kept spare
# MaxSpareThreads: maximum number of worker threads which are kept spare
# ThreadsPerChild: constant number of worker threads in each server process
# MaxRequestsPerChild: maximum number of requests a server process serves
<IfModule mpm_worker_module>
    StartServers          2
    MaxClients          150
    MinSpareThreads      25
    MaxSpareThreads      75
    ThreadsPerChild      25
    MaxRequestsPerChild   0
</IfModule>

All that remains is to add somewhere to apache configuration

PythonOption mod_python.session.session_type MemorySession

... and verify it in Python code

if req.get_options().get('mod_python.session.session_type') == MemorySession and \
   apache.mpm_query(apache.AP_MPMQ_IS_THREADED) != 0

   # using persistent storage
   session = Session.Session(req)
   ......................

And that's it - the processing which used to take half a minute now done in a fraction of a second!

Labels: apache, debian, development, linux, mod-python, python, server

# posted by Kostya @ 8:52 PM 0 comments

Time delta in Python

Obviously, Python is a very nice and intuitive language (some subtle problems like described here notwithstanding), but one thing which always confused me immensely is time management in Python, for example relations between datetime module, time module and date object.

By way of example, if you want to measure time delta between two events, there are two ways to do it. Run an interactive Python session and try it yourself:

Unix% python
Python 2.5.2 (r252:60911, Jan 24 2010, 14:53:14) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import time, datetime
>>> x = time.time(); y = datetime.datetime.now()
>>> time.time() - x;  (datetime.datetime.now() - y).seconds + (datetime.datetime.now() - y).microseconds/1000000.0
11.006932973861694
11.007103000000001

(In Python 2.7 they at least added method timedelta.total_seconds(), making it slightly more consistent).

Labels: development, python

# posted by Kostya @ 7:04 AM 0 comments

Monday, December 21, 2009

Parsing Windows-encoded CSV file, again

This post continues a topic "how to deal with UTF-16-LE encoding", see this post.

If writing UTF-16-LE is no picnic, reading is even more challenging, unless you want to read the whole file as one line. If you prefer or are forced to use per-line approach, then you'd better be aware that properly encoded UTF-16-LE file includes zero byte '\0' after the end-of-line symbol, that is, every line ends up with four bytes 0x0d000a00, or if you will '\r\x00\n\x00'. However, when reading line-by-line, processing stops at '\n' and following '\x00' is interpreted as belonging to the next line!

To rectify this problem, you can use this Python-based "generator" to read UTF-16-LE-encoded file lile-by-line as unicode:

def readiterator(file) :
    fh = open ( file, "rb" )
    for line in fh :
        if line == '\x00' : continue
        if line[:2] == '\xff\xfe' :
            line = line[2:] + "\x00"
        else :
            line = line[1:] + "\x00"
        res = unicode ( line, "utf_16_le" )
        yield res.encode ( "utf-8" )
    fh.close ()

One typical application of that would be parsing CSV file (e.g., reasult of Excel CSV export) using Python built-in "csv" module, which has no knowledge of encoding, though luckily does support unicode input:

cvsreader = csv.reader(readiterator(input_csv_file))

If you decide to (for example) make changes to the table and save it again as CSV file, you'll quickly discover that you can't similarly use csv.write() directly, since it does not work with unicode strings, at all. You will have to play a trick taken directly from official Python documentation, to first convert to UTF-8 and dump CSV to a temporary string , and then read this string and convert back to unicode. Here one way to do that:

class MyCSVWriter:
    def __init__ (self,file_writer) :
        self.stream = file_writer
        self.queue = StringIO.StringIO ()
        self.writer = csv.writer(self.queue)

    def writerow (self,row) :
        self.writer.writerow([s.encode("utf-8") for s in row])
        self.stream.write(unicode(self.queue.getvalue(),"utf-8"))
        self.queue.truncate(0)

    def close(self) :
        self.stream.close ()

Of course, you still need a backend to dump unicode data as UTF-16-LE-encoded file:

class MyFileWriter:
    def __init__ ( self, file ) :
        self.fh = open ( file, "wb" )
        self.lineno = 0
            
    def write (self, line) :
        if self.lineno == 0 :
            self.fh.write ( '\xff\xfe' )
        self.lineno += 1
        self.fh.write ( line.encode ( "utf_16_le" ) )

    def close (self) :
        self.fh.close()

These two classes finally make it possible to create a CSV "writer" which can be used to write data just retrieved by aforementioned "reader"

csvwriter = MyCSVWriter(MyFileWriter(output_csv_file))

All of these code snippets are taken from an utility parsegab.py which I write to make some very specific changes to Google Address Book, using workflow "export – fix – erase all – import back".

Labels: CSV, python, unicode, windows

# posted by Kostya @ 3:56 PM 0 comments

Monday, September 21, 2009

Auto-generating JavaScript code

Languages which are trying to be smarter make it harder to auto-generate code.

Indeed, I wrote a small piece of code to output JavaScript Array, something along these lines (Python):

write ("var myarr=Array(" + ",".join([str(o.id) for  o in objects]) + "); \n")

Of course, it was working just fine till I encountered a case where length of incoming Python list objects is one, where it immediately broke, because meaning of JavaScript initialization

var myarr=Array(10);

is not at all what the above piece of code was silently expecting.

Labels: development, javascript, python

# posted by Kostya @ 4:19 PM 1 comments

Tuesday, August 25, 2009

Potential problems with mutable default arguments in Python

Just spent about half an hour tracking down a mysterious problem in one of my web servers implemented in Python, where same page occasionally was showing obviously duplicate info in some table cells while loading correctly on other invocations.

The bug was, of course, a rather stupid one. I was using my own implementation of a container and wanted to initialize it with a regular array, like that:

class Container :
    def __init__ (self, initial_arr) :
        self.content = initial_arr
.........................................
a = Container( ["one","two","three"] )

Of course, this was already problematic enough, since I was initializing my container with (original) reference and not a copy, but provided that a consumer of this class is wary of this behavior, this is not incorrect per se.

The real problem emerges when, naturally, I wanted to provide a default initialization to an empty container, and did it like that:

class Container :
    def __init__ (self, initial_arr=[]) :
        self.content = initial_arr

So that I could write

a = Container()

With this "improvement", this is no longer just a poor style, but a clear bug: new instance of class is initialized with a reference to default argument, and if class is modified, so is default argument, in effect carrying over changed to the next, unrelated, instance.

No wonder I was seeing duplicated cells returned by server…

Here is the complete script which illustrates the problem:

#! /usr/bin/env python

class Container :
    def __init__ (self, initial_arr=[]) :
        self.content = initial_arr
    def __repr__(self) :
        return repr(self.content)
    def append(self,elm) :
        self.content.append(elm)

def foo (use_default) :
    if use_default :
        a = Container ()
    else :
        a = Container ( ["one","two","three"] )
    a.append ( "four" )
    print "a = %r" % a

print "\nUsing explicit argument"
foo (use_default=False)
foo (use_default=False)

print "\nUsing default argument"
foo (use_default=True)
foo (use_default=True)

This script generates the following output:

Using explicit argument
a = ['one', 'two', 'three', 'four']
a = ['one', 'two', 'three', 'four']

Using default argument
a = ['four']
a = ['four', 'four']

Ideal solution would have been to force default argument (if mutable) to be read-only. Unfortunately, Python does not support read-only values.

Lacking that, the only option available is to never use mutable values as default arguments. For example, above script could have been written like that:

class Container :
    def __init__ (self, initial_arr=None) :
        if initial_arr is None : initial_arr = []
        self.content = initial_arr
.........................................

With this simple update, class behaves "as intended" (whether this is a good behavior to begin with is another question entirely).

Update (14-Sep-09). One more manifestation of the same problem is initializing class members to mutable values outside of method __init__(). For example, this code

class Foo :
    m_x = []              # DANGEROUS!!!
    def __init__ (self) :
        self.m_y = []     # This is much better!
    def append(self,obj) :
        self.m_x.append (obj)
        self.m_y.append (obj)
    def __repr__ (self) :
        return "Foo(m_x=%r,m_y=%r)" % (self.m_x,self.m_y)

v = Foo ()
v.append ("A")
w = Foo ()
w.append ("B")
print "v = %r, w = %r" %(v,w)

Generates output

v = Foo(m_x=['A', 'B'],m_y=['A']), w = Foo(m_x=['A', 'B'],m_y=['B'])

Labels: development, python

# posted by Kostya @ 10:51 PM 0 comments

Wednesday, July 22, 2009

Inline JavaScript

I write a lot of Web Interfaces in Python. Occasionally generated Web pages are using moderately complicated JavaScript code. In this situation, you can do one of two things:

Insert JavaScript code in Python source as a (multi-line) string;
Use a separate JavaScript file.

First approach is a good one when there are relatively few lines of JavaScript code, while second approach makes sense if this stand-alone file could be re-used.

However, neither approach works if JavaScript logic is complicated but current page-specific.

A possible alternative is to keep JavaScript in a separate source file in the depository, but embed a copy of it into python files immediately before check-in.

This solution could be used directly, or combined with some JavaScript compression and/or obfuscation technology.

First, about compression and obfuscation.

I found it most productive to use combination of two separate tools: a popular compression tool JSMin (available in 10 different languages, I am using Python version) and ShrinkSafe, an obfuscation/compression tool written in Java.

Provided that you have SUN java available, js.jar (Rhino “JavaScript in pure Java” engine from Mozilla) is somewhere in you CLASSPATH, and jsmin.py is in PATH, the complete process looks like that:

java -jar <path>/shrinksafe.jar <original>.js | jsmin.py > <new>.js

Next step is to use a Python utility jsprocess.py which I wrote. It is reading a list of Python input files and is looking for all calls to the utility inline_js() (which must not exists other than version created by a previous run of jsprocess.py). For very call which looks like inline_js ("file_name.js"), e.g.

fh.write ( inline_js("docform.js") )

it will append to the end of the file definition of inline_js() which embeds compressed/obfuscated copy of "docform.js". If function inline_js() is already defined (anywhere), it will be replaced.

Labels: development, javascript, python

# posted by Kostya @ 4:27 PM 0 comments

Monday, April 28, 2008

Dealing with "native" Windows encoding

Microsoft Windows and other Microsoft utilities, like Microsoft Office, use encoding "UTF-16LE" by default; if they offer you multiple choices of encoding, they call it simply "Unicode". If the goal is to generate Unicode files which could be opened by all Microsoft applications, these better be in UTF-16LE.

Multiple language and libraries offer built-in conversion to UTF-16LE; however, one must be aware of two potential problems with that: (1) standard 4-byte header that Windows expects (and writes on output), and (2) potential problem with built-in DOS line ending mode ("text mode"); files must be written in "binary" mode.

Proper way to create UTF-16LE file in Python would be this:

fh = open ( "Test.txt", "wb" )
fh.write ( "\xff\xfe")
fh.write ( u"Проверка\r\n".encode("UTF-16LE" ) )
fh.close()

Labels: python, unicode, windows

# posted by Kostya @ 10:17 AM 0 comments

Monday, October 16, 2006

do-while loops in Python

It just occurred to me today that apparently Python does not have a "do-while" type of loop, or any equivalent. The most obvious usage for this would be :

f = open ( filename )
do 
   buf = f.read(blocksize)
   do_something_with_buf(buf)
while buf

suggestions on how to handle situation like this in Python vary, from using "break" in traditional "while" loop to avoiding (any form of) while altogether, using Python-style "for" instead.

Apparently, there is PEP (Python Enhancement Proposal) 315 pending since 2003 that would request just that, a "do-while" loop implementation. This was considered at some point as a possible candidate for Python 2.5, but was rejected as not that important, and this comment and recent discussion at python-dev mailing list revealed lack of consensus on simple, implementable, Python-style syntax for "do-while" loops.

It seems like therefore this will have to wait (at best) till "Python 3000" version of Python (a.k.a. "Py3k" or sometimes even "p3yk"), which is kind of sad.

BTW, this presentation has a nice overview of Py3k plans and ideas.

Labels: python

# posted by Kostya @ 6:12 AM 0 comments

Wednesday, May 10, 2006

Installing Python + ZSI on Windows

Now, who could ever think that in setting up my new laptop the most difficult thing would be to set up Python?

To be sure, Python is a very nice language, completely dynamic and OO, which allows you to implement a fast prototype for a complex object interaction. Then, you might want to re-implement in more "static" language like Java or just leave it Python if speed and reliability isn't among your first priorities.

However, another very nice feature of Python is a very well-done and well-supported Windows port, including a native Windows installer. There is a price to pay for this beauty: each Python release is implemented in specific Visual Studio C++/.NET version; e.g. all 2.4.* releases (latest stable release at this moment) are done in Visual Studio .NET 2003 (internal version = "7.1"); whereas 2.3.* releases are done in Visual C++ 6.0 (internal version = "6.0"). That said, the latest suite from Microsoft (right now) is Visual Studio .NET 2005 (internal version = "8.0"), and this is exactly what I have (by default) installed on my new laptop.

Praises to Python Windows port above notwithstanding, file msvccompiler.py, part of standard Python distribution, does not do the best possible job at detecting user's Visual Studio environment. It has not occurred to the author that the latest version it knows about (7.1) will sooner or later be superseded with a newer one; as as result, on my laptop an attempt to install any Python extension that contains C code fails with dubious message "The .NET Framework SDK needs to be installed before building extensions for Python"; message sure to puzzle someone who knows damn well .NET SDK is installed on his machine...

As a final remark, I must say that I am using Python for (effectively) RPC calls via TCP/IP using SOAP and Python extension called ZSI (along with mod-python on the server). I was using version 1.7 of ZSI, which only worked for me after applying simple patch to the client code.

Anyway, let me without further adieu present my sequence of actions:

Installed binary distribution of Python for Windows; latest stable release 2.4.3;
Downloaded and installed latest ZSI build 2.0rc2 (no C code so installed flawlessly); I noticed that client code has changed dramatically since 1.7 so that my patch may be no longer required;
Run test script. It appears to fail because the API (specifically function ZSI.client.Binding) changed in an incompatible way. What's more, there is no API to tell me version number, so there is no simple way to write client code compatible with both 1.7 and post-1.7 ZSI API. After a while, I solve this problem by parsing function documentation string Binding.__init__.__doc__;
Run test script. It fails complaining that it cannot load "xml.dom.ext.reader";
This is actually very peculiar, since Python is of course well-equipped with XML DOM parsers; but yes, I vaguely remember that indeed for some mysterious reasons ZSI depends on an external expat-based XML parser;
OK, I go ahead and download the latest source release of PyXML (I mistakenly think binaries are not available for Python 2.4 since this version of PyXML is rather old, but in fact they are);
Build fails with message "The .NET Framework SDK needs to be installed before building extensions for Python" (see above);
I try to modify file msvccompiler.py to convince it to use my installed version of Visual Studio. After a while, it does work and installation of PyXML succeeds;
Test script now crashes Python executable. This is perhaps related to incompatibilities of two dynamic runtimes that Python itself (7.1) and PyXML are trying to load;
I download source distribution of Python (2.4) and try to compile it from source using Visual Studio 2005. It builds simple python.exe and it crashes on startup, invoking debugger and stalling build;
I remove all previous installations of Python and install older Python version 2.3 from the scratch (binaries) along with Visual C++ 6.0 environment;
Following the steps described above, both ZSI (2.0-rc2) and PyXML now install successfully;
Test script fails somewhere in ZSI client code. An attempt to debug it reveals that function Binding::RPC is called from _Caller with (default) argument replytype=None, which then fails in parsing. An attempt to fix this (TC.Any()) improves the result a little bit, but not all that much. It appears that ZSI changed XML marshalling logic and thus I cannot have post-1.7 client and 1.7 server;
I try to install older version of ZSI (1.7 + my patch) above the previously installed (Python extension installation mechanism does not give me any simple way to uninstall); this results in empty SOAP message being passed to the server;
Desperate, I simply erase the sub-directory d:\Python23\Lib\site-packages\ZSI and reinstall ZSI 1.7;
Run the test script; finally it works.

Labels: python, SOAP, visual studio, windows

# posted by Kostya @ 7:13 PM 0 comments

Thursday, February 02, 2006

Controlling ssh from (python) script

It appears ssh only reads its password from the TTY making it difficult to supply one via a script. Here is a possible solution in python:

The trick is, of course, to create pseudo-TTY:

--
#!/usr/bin/env python
#Remote command through SSH using user/password
import os, time

def pause(d=0.2):
 time.sleep(d)

def rcmd(user, rhost, pw, cmd):
 #Fork a child process, using a new pseudo-terminal as the child's 
controlling terminal.
 pid, fd = os.forkpty()
 # If Child; execute external process
 if pid == 0:
  os.execv("/bin/ssh", ["/bin/ssh", "-l", user, rhost] + cmd)
 #if parent, read/write with child through file descriptor
 else:
  pause()
  #Get password prompt; ignore
  os.read(fd, 1000)
  pause()
  #write password
  os.write(fd, pw + "\n")
  pause()
  res = ''
  #read response from child process
  s = os.read(fd,1 )
  while s:
   res += s
   s = os.read(fd, 1)
  return res

#Example: execute ls on server 'serverdomain.com'
print rcmd('username', 'serverdomain.com', 'Password', ['ls -l'])
---