Thursday, February 17, 2011
In my mod-python/apache based server implementation at some point I generate certain amount of data (from MySQL database) used in processing user input. This generation takes a while (~10 secs on my development server, ~30 secs on production VPS-based server) and is virtually unchanged from one invocation to the next. Why can't I save it in a disk cache and use it instead? OK, I tried just that.
You can imagine my surprise when I realized that reading from cache took exactly same time as original data generation! How is it even possible?
It turned out that the reason for this phenomenon was that my data consisted mostly of complied Python RE's (regular expressions). However, Python's underlying serialization engine ("pickle") cannot save compiled RE's; when you try to, on the pickle.load() call these RE's will be automatically (re-)compiled. This is what I experienced: most of the time which I initially attributed to DB access and processing results was in fact spent compiling RE's. When I switched to file cache, I gained nothing: same RE's were recompiled again on cache load.
Why such limitation? No one is really sure, however, this is exactly how things are at the moment:
Q: Is it possible to store these regular expressions in a cache on disk in a pre-compiled manner to avoid having to execute the regex compilations on each import?
A: Not easily. You'd have to write a custom serializer that hooks into the C sre implementation of the Python regex engine. Any performance benefits would be vastly outweighed by the time and effort required.
So what should or could I do to optimize my server? It is really annoying to have to wait half a minute every time for a server to generate the same data over and over again.
Looking for an answer to this question, I turned to mod-pyton documentation, section on session management. It appears that mod-python supports three kinds of section management engines: MemorySession, DbmSession, FileSession; naturally, MemorySession implements persistent storage in memory, whereas DbmSession and FileSession are essentially different ways to provide disk-based caching.
Now, there is little doubt that said caching will use internally Python's standard pickle engine, which will take me back exactly to square one.
Can I use MemorySession? mod-python "Session" implementation can make a determination which session engine to use. Documentation has this to say regarding when MemorySession is chosen:
If session type option is not found, the function queries the MPM and based on that returns either a new instance of DbmSession or MemorySession. MemorySession will be used if the MPM is threaded and not forked (such is the case on Windows), or if it threaded, forked, but only one process is allowed (the worker MPM can be configured to run this way). In all other cases DbmSession is used.
What is it talking about, and what is "MPM"? It took me a while to figure that out. MPM in fact stands for Multi-Processing Module and it has to do with how apache distributes incoming requests. Borrowing an excellent explanation from here,
Apache can operate in a number of different modes dependent on the platform being used and the way in which it is configured. This ranges from multiple processes being used with only one request being handled at a time within each process, to one or more processes being used with concurrent requests being handled in distinct threads executing within the same process or distinct processes.
The UNIX "prefork" Mode
This mode is the most commonly used. It was the only mode of operation available in Apache 1.3 and is still the default mode on UNIX systems in Apache 2.0 and 2.2. In this setup, the main Apache process will at startup create multiple child processes. When a request is received by the parent process, it will be handed off to one of the child processes to be handled.
The UNIX "worker" Mode
The "worker" mode is similar to "prefork" mode except that within each child process there will exist a number of worker threads. Instead of a request being handed off to the next available child process with the handling of the request being the only thing the child process is doing, the request will be handed off to a specific worker thread within a child process with other worker threads in the same child process potentially handling other requests at the same time.
You can find out which mode is used by these function calls (from mod-python request handler)
threaded = apache.mpm_query(apache.AP_MPMQ_IS_THREADED) forked = apache.mpm_query(apache.AP_MPMQ_IS_FORKED)
The last question is, how to switch from one mode to another? In debian, simply install one of apache2-mpm-XXX packages:
debian-linux% apt-cache search ^apache2-mpm apache2-mpm-itk - multiuser MPM for Apache 2.2 apache2-mpm-event - Apache HTTP Server - event driven model apache2-mpm-prefork - Apache HTTP Server - traditional non-threaded model apache2-mpm-worker - Apache HTTP Server - high speed threaded model
I am not sure what apache2-mpm-event is, but apache2-mpm-prefork and apache2-mpm-worker, when installed, will automatically uninstall (!) the other one and automatically make changes to /etc/apache2/apache2.conf to turn on respective mode, like that:
# worker MPM # StartServers: initial number of server processes to start # MaxClients: maximum number of simultaneous client connections # MinSpareThreads: minimum number of worker threads which are kept spare # MaxSpareThreads: maximum number of worker threads which are kept spare # ThreadsPerChild: constant number of worker threads in each server process # MaxRequestsPerChild: maximum number of requests a server process serves <IfModule mpm_worker_module> StartServers 2 MaxClients 150 MinSpareThreads 25 MaxSpareThreads 75 ThreadsPerChild 25 MaxRequestsPerChild 0 </IfModule>
All that remains is to add somewhere to apache configuration
PythonOption mod_python.session.session_type MemorySession
... and verify it in Python code
if req.get_options().get('mod_python.session.session_type') == MemorySession and \ apache.mpm_query(apache.AP_MPMQ_IS_THREADED) != 0 # using persistent storage session = Session.Session(req) ......................
And that's it - the processing which used to take half a minute now done in a fraction of a second!