Django – Correctly Wiring to AWS CloudFront for Static and Media Files

When it comes to “How To Setup a CDN for Django” most people suggest the following “half way” setup that can lead to stale cache problems and slowness in the admin.

  1. Configure Django to upload media and static files to S3. That is done by installing django-storages and boto, then setting DEFAULT_FILE_STORAGE to your S3 bucket.
  2. Configuring your AWS credentials.
  3. Setup your CloudFront distribution to pull from your S3 bucket.
  4. Set Django’s MEDIA_ROOT and STATIC_ROOT settings to point to the public URL of the CloudFront distribution.

For simple sites this can be an okay way to go. Wiring it up takes just a few minutes and you get automatic backups of all uploaded media. This approach is required in some hosting environments, such as Heroku, where you don’t get a permanent local disk to store files on.

I’m not sold on it though…

Using S3 as the Media / Static Root Can Be a Bad Idea:

1) CDNs cache everything aggressively AND pass expires headers to clients that tell them to cache aggressively too.

Unless you are actively managing the state of the CDN cache (which takes forethought with CloudFront) if the content of a file changes some of your users are guaranteed to get the stale version and see a messed up version of your site.

This will happen pretty much every time you push new code.

There is a mechanism in CloudFront to invalidate files, but if you do over 1000 operations per month it starts costing you, and isn’t instantaneous.

For the kind of high profile Django sites I work on, it would be detrimental if the user got to a page that was a franken-build mix of new and old assets because of stale cache.


2) Write performance to S3 is slow.

Many of the sites I’ve worked on use VersatileImageField to generate a handful of cropped / resized versions of images after they get uploaded through the admin (large, medium, thumbnail). When S3 is the default file storage the admin runs slow because of the intensive IO work needed to generate the images and push them to S3. With VersatileImage field, every time you save the model, it revalidates all those files, which causes it to scan the S3 bucket (really slow).

So if you are doing a lot of interesting things with your media files, your site will bog down while Django talks back and forth to S3.

 

How I Setup CloudFront with Django:

1) Don’t use S3 to store files…

Instead configure MEDIA_ROOT and STATIC_ROOT to be on your server like you would normally.

There’s a twist for STATIC_ROOT, see below.

2) Setup your server as normal, like you would without CloudFront, so it serves /media/ and /static/ on its own, but do set some extra headers.

Configure your server so /media/ and /static/ are being served as normal from your domain. It is helpful to set the CORS header Header set Access-Control-Allow-Origin "*" so things like custom fonts work correctly.

Example Apache 2.4 configuration for serving media and static files, with CORS header:

# STATIC DIRECTORY
Alias /static /var/mysite/static
<Directory "/var/mysite/static/">
   Options +FollowSymLinks -Indexes -MultiViews
   Require all granted
   # add the CORS header so the AWS CloudFront distribution can serve all files, including fonts
   Header set Access-Control-Allow-Origin "*"
</Directory>

# MEDIA DIRECTORY
Alias /media /var/mysite/media
<Directory "/var/mysite/media/">
   Options +FollowSymLinks -Indexes -MultiViews
   Require all granted
   # add the CORS header so the AWS CloudFront distribution can serve all files, including fonts
   Header set Access-Control-Allow-Origin "*"
</Directory>

Note: this is only an option if your hosting configuration has access to a local partition, not always the case (Heroku for example).

3) Setup two CloudFront distributions using your server as the origin.

Create one distribution for the media files and a second distribution for the static files. This improves download performance during page load.

4) Add a variable into the static path / url that busts the cache for each deploy.

AWS calls this strategy Using Versioned Object Names.

The trick is to configure Django’s STATIC_ROOT and STATIC_URL so they include a configuration variable that gets updated every deploy.

STATIC_ROOT = '/var/mysite/static/' + BUILD_VERSION + '/'
STATIC_URL = 'https://{static-distribution-id}.cloudfront.net/' + BUILD_VERSION + '/'

When the deploy happens and collectstatic is executed all the files go into a new folder just for that version of the app. Those files are totally new to the CDN and all your users. When your users hit your site, they are guaranteed to get the new version of all static files.

Note, for media files I’m not setting up a dynamic path. This is okay for most uses cases. If a user uploads a file or image with the same name Django will take care of giving it a unique name.

Example settings.py file:

BUILD_VERSION = 'mysite-1.0'

#############################################
# MEDIA & STATIC FILES
#############################################

# store media files locally on the server
MEDIA_ROOT = '/var/mysite/media/'

# Serve media files from the CloudFront distribution
MEDIA_URL = 'https://{media-distribution-id}.cloudfront.net/'

# When collect static runs, it will copy the files to a path in appending the BUILD_VERSION
# this way each deploy gets a fresh URL folder, no need to worry about invalidating stale caches in the CDN
STATIC_ROOT = '/var/mysite/static/' + BUILD_VERSION + '/'

# serve static files from the CloudFront distribution
STATIC_URL = 'https://{static-distribution-id}.cloudfront.net/' + BUILD_VERSION + '/'

This requires updating BUILD_VERSION before each deploy. I like having that as a pre-deploy step.  I update that value at the same time I minify the CSS/JavaScript files, then commit it all. Makes a nice marker in the commit history along with a new version tag.

For systems with automated builds updating that variable can easily be part of the overall build / test / deploy process.

How It Works:

Initially, your site’s CSS file would live at:
https://mysite.com/static/mysite-1.0/css/mysite.min.css

But it would be served through the CDN at:
https://{static-distribution-id}.cloudfront.net/mysite-1.0/css/mysite.min.css

As part of the next deploy, you’d change BUILD_VERSION to mysite-1.1. After the deploy the next page request to your site would generate a request for the CSS file at:
https://{static-distribution-id}.cloudfront.net/mysite-1.1/css/mysite.min.css

The CDN would pull it off your server as a totally fresh file and cache it across the entire network.

Why I like it:

I opt for a simple solution that works well all of the time.

Other solutions to busting the cache do things like adding a query string parameters to CSS / JS files. For example:
https://mysite.com/static/css/mysite.min.css?version=mysite-1.0.1

In practice, that works okay but not 100% of the time because some proxy servers ignore the query string for the purposes of the cache! This is true even if you are forwarding query strings with CloudFront. That is the kind of bug you only run into once in a blue moon and it is impossible to replicate.

One minor drawback is, each time the site is deployed collect static copies all the files to a new folder. At some point someone or something will need to cleanup the old folders, but only when you know it is safe to do so. Storage is cheap and the static directory is usually pretty small so I don’t see this as a huge problem.

Where not to use this idea:

If you have load balancers with multiple servers this solution may not be for you.

If your media files are only changed through the admin then it may work, but you still have to make sure the bucket’s origin points to a location on your side that will always see the newly uploaded files. That can defeat the purpose of load balancing…

If you have lots of users uploading media content across all your servers it would be better to send the new uploads directly into S3 (or save locally at first then push to S3 in the background). In that case it would be important to avoid file name collisions – where user A uploads cat.gif and user B uploads another cat.gif in the same instant, and user A’s cat.gif gets overwritten.  In theory Django will automatically check for file name collisions when uploading new media, but S3 has such high latency I’d take extra precautions.

Another approach would be to setup an NFS share that all your load balanced servers tie into for media storage. Then you could continue with the setup above. But setting up an NFS adds complexity and another potential point of failure.

Thank you Alberto for bringing up this scenario.

Closing thoughts…

No matter what it is absolutely essential to setup a CDN. It is dirt cheap, it will make the site load very fast for all your users no matter where they are, and doing so is based on best practices. Just don’t cause bugs you will never see or be able to replicate due to a stale cache…

Some tools I use to test website performance:

 

This entry was posted in Application Development and tagged , , . Bookmark the permalink.

Comments are closed.