This week we're writing about a particularly tech detailed situation we have with a client. The symptoms: architecture that has slowed down and behaving poorly. There are warnings about too much memory usage, requests are timing out after 30 seconds — causing some pages to be completely unresponsive for clients with more than 300 offices.
Currently, the client has the following arguments being passed to waitress:
$ waitress-serve --threads=8 --port=$PORT --send-bytes=50 wsgi:application
This tells waitress to spawn 8 threads per worker on a port Heroku passes us, and clear the buffer after
50 bytes have been queued for sending. The reason
for this small queue is to fix a problem with Heroku and not sending out streaming responses until
a buffer is filled. Also, default worker count in gunicorn is
(2 * CPU_COUNT) + 1 for reference.
The problem, we can see from these nice Heroku analytics, is (a) memory usage growing too high over time and (b) responses hitting the 30s time limit Heroku allows for generating a response.
(a) memory problems
Why are we using so much memory?
Probably because we're using 8 threads. This application is probably much more IO bound than CPU bound, so using threads makes sense, but in our case it's probably just plain-too-many. As an aside: the reason many threads on a single CPU makes sense, is that we're spending a lot of time waiting for a read from disk or network. We're not waiting for the CPU to crunch numbers, it's like the CPU is waiting an eternity for a phone call.
How to fix?
To fix this problem, I'll spawn less threads so we should use much less memory. I found this handy document from Heroku giving us a suggested amount of workers per dyno. We'll go with 3 for now to see if that stops memory usage problems.
I bet the source of timeouts is probably an API request generating way too many queries to Postgres -- or -- potentially some deadlock is happening (events out of sync and something is forced to wait indefinitely).
To start investigating this problem, I had to switch our staging server into a "debug mode" which allowed me to print out
all of the SQL queries that could be causing problems. After browsing a moment I found some culprits:
/api/orders is generating
176 queries, with lots of duplicates that could be potentially avoided:
SELECT FROM "businesses_address" WHERE ("businesses_address"."user_id" AND "businesses_address"."active" = true) LIMIT 1 Duplicated 20 times
The offending query is generated by our django-rest-framework serializer
class OrderSerializer(serializers.ModelSerializer): class Meta: model = Order fields = ( 'date', 'office', ... )
However, this time the culprit seems to be just plain grabbing too much data back from the database. We were getting companies and
offices of each user who made an order, not just their name and address! Query count plumetted from 176 to 42 in one change. A
couple more fixes (a nice use of
select_related) and we're down from 42 to... 4! Very acceptable.
In the most astonishing instance, I added another flag and dropped queries from 511 to 4.
Common slow downs in Django
These common things didn't end up slowing down the server in this case, but do come up often:
- How are you serving static files? Make sure you're using whitenoise with Django on Heroku.
- Use Redis to leverage the speed advantage of caching
- Make sure your static assets are marking themselves to be cached
- Serve from a CDN when possible (although this won't cause a Heroku timeout, a good tip!)
Thanks for following along, it's always fun to leverage the same amount of resources but get a quick 5x performance boost!