I've been meaning to post this here - a week ago I was in charge of the monthly Baypiggies night in Mountain View. I'd suggested on the mailing list that we have a "Tools Night" and of course was "volunteered" to do a presentation and round up the other speakers. We had a great night (and a great post-meeting dinner at the Tied House) and I've put my slides up on my blog over at simeonfranklin.com. Check it out if you're interested in virtualenv (python development environments), pip (a new python installer which plays nicely with virtualenv and has some very nice features for developers), or fabric (the Pythonic remote deployment tool)...
I abandoned Baypiggies for an evening last week and instead went to the first meeting of the new DjangoSF group. Eric Florenzano started a mailing list on November 30th and a meetup group a few days later and on December 10th some forty-odd people crowded into a conference room at Six Apart.
I was a few minutes late (I missed the Bart train I wanted and had to jog from the Powell St station back six or seven blocks to Six Apart). The sign on the door said to text Eric to come down and open the door...
Once I got up to the conference room it was standing room only - I'd guess about 6-10 people ended up standing in the two entrances. I had decent view of the slides as Andrew Badr of disqus (who provide the comments for my blog) talked about scaling Django.
All in all this was a great time. It's presenting me with some problems, however, as I'd like to go back to DjangoSF but definitely can't make both Baypiggies and DjangoSF on back-to-back nights. Additionally I found out after the Baypiggies meeting that we're losing the Google Mountain View location as host. This could be good for me - I'd love some place accessible from BART so I don't have to drive quite so far... I haven't heard anything about a new spot yet though; if any readers have access to a largish conference room in the Bay Area I'd love to hear about it.
I'm also conflicted because I volunteered to head up a Python dev tools night in March. So far I have two Speakers and am considering doing a talk myself on using Fabric/Pip/Virtualenv to package and deploy python server apps and environments. I'm just switching from easy_install to Pip (after seeing James Bennet's recent posts on the virtues of Pip) and have been using fabric to deploy for a while. I've only been using virtualenv on my dev box but am intrigued by the idea of easily packaging a whole virtualenv environment with Pip. I'm still soliciting talks (no text editors please - tools should be things you use to develop, deploy, test, document, etc your Python applications...) so comment if you'd like to give a talk and are going to be in the vicinity of the Bay Area on March 10th...
Anyways, with that on the schedule I can hardly abandon Baypiggies... I may just have a full schedule of user group attendance in the next year...
Aaaaaaaaargh!
Ok - I feel a little better.
I hate computers. Specifically I hate being a systems administrator. I don't really have the experience or aptitude to be a sysadmin - while I don't mind debugging my own software (whose inner workings I designed) tweaking software settings trying to get something to work right is just frustrating to me. In the interest of sparing my humble reader any of my pain...
Some of the sites I run are on a VPS. It's a basic Cpanel/CentOS setup which comes with the ancientvenerable Apache 1.3. I've got some sites running PHP on it and a few running Django. Now fast_cgi wasn't installed, upgrading to Apache 2.x wasn't an option (I'm guessing it would cause Cpanel breakage) and when I first set this up a year ago I wasn't aware of mod_wsgi. Plus I was in the mood to experiment. So.... I have lighttpd running solely in order to fastcgi my django instances and a mod_rewrite rule to proxy requests that should be dynamic (everything not starting with /media/ or /admin_media/) from apache to the lighttpd instances. It's more or less worked for me and let me play around with the django deployment side of things without messing with the stable and working Apache. An example of a sample lighty section and .htaccess file follows:
.htaccess:
RewriteCond % !-f
RewriteCond % !(django.fcgi)
RewriteRule ^(.*)$ http://127.0.0.1:9006/$1 [QSA,P]
and lighttpd.conf
$SERVER["socket"] == "127.0.0.1:9006" {
server.document-root = "/home/simeon/public_html"
fastcgi.server = (
"/lighty.fcgi" => (
"main" => (
"socket" => "/home/simeon/mysite.sock",
"check-local" => "disable",
)
),
)
url.rewrite-once = (
"^(/.*)$" => "/lighty.fcgi$1",
)
}
Pretty simple stuff. And that's the way I like it (did I mention that I'm not a sys-admin?)
Ok - the first problem I just solved wasn't all that painful and was really my own fault to boot. I've been running my Django sites off of trunk but stopped tracking back when NFA merged. I figured I'd wait for 1.0 before doing any backwards compatibility breaking stuff. So the arrival of 1.0 was great for me - I'd played with NFA and set up a couple of Django sites on dedicated servers with mod_python. No surprises, no problems.
Until I went to deploy a 1.0 site on the VPS with Apache proxied to lighty. All of the sudden the lighty.fcgi started showing up as the root url for my django site. The {% url %} tag was prefixing paths ( /lighty.fcgi/foo/ instead of just plain /foo/). I tried modifying the lighttpd.conf file to use a blank string or just "/" as my root url, I tried adding uri-strip clauses to the config... Nothing worked.
Let's stop right there to see how the 45 minutes of pain are all my fault so far. Where do we go boys and girls when we upgrade Django and stuff that used to work no longer does? That's right - the official list of Backwards Incompatible Changes. Now in fairness this list is getting pretty long (which is what happens when you go so long between releases - things should be better going forwards) but halfway down the list is item 52 - Changed the way URL paths are determined which explains various servers break the SCRIPT_NAME and PATH_INFO variables in various ways - Django now does the right thing by paying attention to SCRIPT_NAME but has introduced a new setting FORCE_SCRIPT_NAME so that you can override this if your particular choice of server software is doing dumb things. The added line to my settings.py
FORCE_SCRIPT_NAME = "" #not "/" as they suggested, interestingly enough
put me back in business. No more phantom script names.
The next problem was harder and I still don't understand what's wrong. I have fixed it however... and I hope to spare some future tormented soul a bout of frenzied swearing. Here's the first manifestation of the problem I noticed: my Django sites are slow! This isn't a performance issue - memory usage is fine and the processor isn't loaded at all. It takes 5 seconds, however, for even a simple page to display. All the static media (served by Apache) comes across fast (100ms), but the main request that is proxied to lighttpd takes 5 seconds!
After some poking around I discover that running the apache benchmark tool directly on the lighty instance (ab -n10 http://127.0.0.1:9006/) from a shell session on the VPS shows millisecond response times (<100ms) but running it on the actual domain (ab -n10 http://simeonfranklin.com/) generates response times that are all greater than 5 seconds! The issue isn't django, it isn't lighty, it's the apache proxy => lighty interaction that somehow causing the slowdown.
How do you go about troubleshooting this? My error logs are clean so I don't have any immediate clues. Googling "apache proxy slow" yields a host of non-helpful complaints. I started thinking about things that might be causing a delay and spent some time turning off KeepAlives and checking every setting that involved keeping an HTTP connection open. No joy.
Eventually (I'm seriously embarrassed to admit how much time I've spent on this) I find a post to the zope mailing list with a possible answer - if the Apache is forced to do DNS resolution it may cause consistent time delays on proxy requests. Aha! I add the FQDN to my /etc/hosts file, verify my resolv.conf check my httpd.conf file to make sure there aren't any DNS problems... Still no joy! I replace hostnames with IP addresses and fiddle with Apache settings to keep it from doing DNS lookups. Still the same frustrating maddening delay!
Fine. I get it. I'm not going to be able to fix this and I start looking at compiling mod_wsgi for Apache 1.3. First I put the domain names back in my httpd.conf file and for some reason feel moved to try a domain in my RewriteRule instead of an IP address. Un-be-lievable.
When my .htaccess file looks like this:
RewriteRule ^(.*)$ http://127.0.0.1:9006/$1 [QSA,P]
requests take 5+ seconds to return. When I put in this:
RewriteRule ^(.*)$ http://localhost:9006/$1 [QSA,P]
I suddenly start getting sub 200ms response times.
Does that make sense to anybody? Me neither... Forcing DNS resolution (well - localhost presumably gets looked up in /etc/hosts) instead of using an IP directly results in an order of magnitude speedup? I hate computers. Oh and if anybody else finds themselves in the same situation and this is helpful - email me for my home address - I like the chocolates with creamy fillings.
Yesterday I went to the Django 1.0 Alpha sprint at Whiskey Media in Sausalito.
I didn't actually contribute much to the Alpha goal - Jacob Kaplan-Moss (django BDFL) and Malcolm Tredinnick (django committer) seemed to be busy committing batches of backwards incompatible patches to SVN and coordinating with others who were doing the same. I see Brian Rosner merged the newforms-admin branch yesterday at 5pm so it looks like the 1.0 is rolling.
The rest of the non-committers there mostly worked on low-hanging fruit in the bug list. I was only there from 10AM-3:30pm (it takes a while to drive there and back from Modesto) and i fixed one bug in the paginator class and patched some failing tests after a discussion with Malcolm. (see #6997 and #6444).
Whiskey Media (who employs Jacob) was gracious to host us and I believe Industrial Light and Magic paid for our lunch - one of the other devs works for them and paid for our pizzas.
I had fun and snapped a few pics during the day -

Jacob Explaining things in the middle. On the left, haloed by the light Malcolm is coding away.
Jacob is eating pizza. Leah Culver is working behind the pizza boxes. I didn't catch the guy's name on the Left but he bought us lunch (and talked about Python at IL&M).

David Kellerman sat next to me and admired the time it took (multiple hours) to run the entire django test suite under windows.
Early in the afternoon I went across the street to get Coffee with Jacob & David and Jacob insisted on paying - the least he could do, he said, since we're donating a days work. I wonder how many days of his work I've used? At any rate it was nice to put faces to some the names I knew and fun to help out in a small way... I don't think I'll be in person at any of the other sprints (unless somebody wants to pay my way to Dallas or Lawrence) but I'll probably be on IRC. 1.0, here we come!
OK, you're killing me out there!
I'm turning into a crank, I know, but the quality of conversation
around PHP development lately is really bugging me.
I ranted at length in the comments at the Sitepoint blog post about
microbenchmarks and thought I got it out of my system.
No such luck - this morning in my feeds was a link to a blog post
titled in_array
is quite slow. DO NOT WANT!
I'm going to explain one more time what's bugging me and then start ignoring it...
First take a look at the really badly titled in_array
post. Basically the author is repeatedly importing a large dataset
from XML to a database and wants to eliminated duplicate records. The
script starts running slowly (as in hours) and he discovers that the
duplicate elimination logic consists of building an array of all the
existing unique id's in the database and another array of the all the
ids from the XML data. Then for each id in the list of new ids it
searches the array of existing ids to see if the id already exists;
inserting it only if it's new ...
With me so far? Notice how I cleverly didn't use the words
"in_array". In fact, let's write some sample code to
accomplish this algorithm without resorting to in_array.
foreach($new_ids as $new_id)
{
$found = false;
foreach($old_ids as $old_id)
{
if($old_id == $new_id)
{
$found = true;
break;
}
}
if(!$found) insert_new_id($new_id);
}
So what's wrong with that code? Nothing, if my datasets stay
small. The problem is that it's an O(n2) algorithm. This is
Big O
notation and completely worth reading about if you are or intend
to be a programmer. Big O notation gives you a way to categorize the
speed of algorithms. I've heard that O stands for "on the
order of" but I don't see that on the wikipedia page. In
any event, Big O analysis doesn't care about the actual speed of an
algorithm, only about the way in which the number of operations varies
as the dataset the algorithm is performed on varies. In
this case it is pretty easy to see that if each of our lists has
10 items the inner loop will run up to 10x10 times (say if there were
no matches between the two lists). If there are 1000 items in each
list it will run 1000x1000 times. And to abstract this - if there are
"n" items in each list the inner loop will run n2
times - hence the Big O label. N squared is really bad because
exponential growth doesn't scale the way we intuitively want things to
scale. Instinctively I always assume algorithms are linear. If
processing 1000 items in a list takes 1 second, processing 2000 items
ought to take 2 seconds, right? That's true for O(n) algorithms
only!
So back to the blog post - what's wrong with in_array() that makes
it run so slowly? Nothing - in_array() is a search function - probably
more sophisticated than my for loop code but essentially the same
idea. Using in_array() in an inner loop gives you n-squared
runtimes!
Now using the $old_ids list as a hash-table instead (values in the
keys) lets you convert this back to a linear
runtime. Replacing the for loop (or in_array) with
if(!isset($old_ids[$new_id])) insert_new_id($new_id);
Makes the code run much faster. Down from hours to .8 seconds in
this case. But notice that this has nothing to with in_array - instead
it has to do with choices of data-structures and algorithms. It
doesn't help that PHP's hashtable and list type are the same structure
(that's not per se bad, as long as you are aware that there are two
different such data structures)! We now have an O(n) algorithm because
the speed of a hashtable lookup is independent of the number of items
it contains. The whole "inner loop" becomes a single constant time
operation that doesn't vary with the length of the input data. And of
course a better blog title would be "Picking the right data structure"
or perhaps even "Duh! Searching is slower than lookup!"
And here's my where my rant comes in. I'm not going to speculate
about the causes (ease of use and ubiquitous deployment, in my book,
plus the lingering awkwardness of the language that causes some of the
solid developers to defect (eg: I still have Paul Bissex and Simon
Willison filed under PHP in my RSS reader...)) but the quality of
commentary in the PHP community is bothering me. Shallowness and
bikeshedding abounds. The sitepoint post I mentioned at the beginning
was pointing to yet another PHP microbenchmark discussion - should you
use single quotes or double quotes around strings? Is while faster
than foreach? To reference or not to reference when passing data
structures around... All discussed un-ironically as "PHP best
practices".
Stop it already people!
Here's some semi-constructive advice. Don't ever write about syntax
unless you use the term newbie in the title. (The "for loop" for
newbies is just fine). For everybody else - syntax is not
programming! Saying you are a programmer because you "know
PHP" (ie understand the syntax of the language) is like me saying
I'm a painter because I can name all the colors. Especially don't pair
discussions of syntax and speed - in fact don't talk about speed at
all! Unless of of course you cite the rules for optimization:
- Don't!
- (for experts only) Don't yet
Or at least Knuth's aphorism ("premature optimization is the root of all
evil"). In fact... I'm coining my own aphorism here (naming your statements makes
them sound more official). Henceforth metapundit's law must be respected: don't optimize (or
blog about optimization) unless you know who Knuth is. And if you've got some solid additions to my PHP
feeds (is Harry Fuecks still writing?) I'd be glad to hear them.
Archive: [1] « 2 | 3 | 4 » [16]