The truth is painful. No matter how precisely your application is tested, your code will eventually fail at some point. This can be anything—unexpected exception, resource exhaustion, crash of some backing service, network outage, or simply an issue in the external library. Some of the possible issues (such as resource exhaustion) can be predicted and prevented in advance with proper monitoring. Unfortunately, there will always be something that passes your defenses, no matter how much you try.
What you can do instead is to prepare for such scenarios and make sure that no error passes unnoticed. In most cases, any unexpected failure scenario results in an exception raised by the application and logged through the logging system. This can be stdout, stderr, log file, or whatever output you have configured for logging. Depending on your implementation, this may or may not result in the application quitting with some system exit code.
You could, of course, depend solely on the log files stored in the filesystem for finding and monitoring your application errors. Unfortunately, observing errors in plain textual form is quite painful and does not scale well beyond anything more complex than running code in development. You will eventually be forced to use some services designed for log collection and analysis. Proper log processing is very important for other reasons (that will be explained a bit later) but does not work well for tracking and debugging errors. The reason is simple. The most common form of error logs is just Python stack trace. If you stop only on that, you will shortly realize that it is not enough in finding the root cause of your issues. This is especially true when errors occur in unknown patterns or in certain load conditions.
What you really need is as much context information about the error occurrence as possible. It is also very useful to have a full history of the errors that occurred in the production environment that you can browse and search in some convenient way.
One of the most common tools that gives such capabilities is Sentry (https://getsentry.com). It is a battle-tested service for tracking exceptions and collecting crash reports. It is available as open source, written in Python, and originated as a tool for backend web developers. Now, it outgrew its initial ambitions and has support for many more languages, including PHP, Ruby, and JavaScript but still stays the most popular tool of choice for many Python web developers.
It is common that web applications do not exit on unhandled exceptions because HTTP servers are obliged to return an error response with a status code from the 5XX group if any server error occurs. Most Python web frameworks do such things by default. In such cases, the exception is, in fact, handled either on the internal web framework level or by the WSGI server middleware. Anyway, this will usually still result in the exception stack trace being printed (usually on standard output).
The Sentry is available as a paid software-as-a-service model, but it is open source, so it can be hosted for free on your own infrastructure. The library that provides integration with Sentry is sentry-sdk (available on PyPI). If you haven't worked with it yet and want to test it but have no access to your own Sentry server, then you can easily sign up for a free trial on Sentry's on-premise service site. Once you have access to a Sentry server and have created a new project, you will obtain a string called Data Source Name (DSN). This DSN string is the minimal configuration setting needed to integrate your application with sentry. It contains protocol, credentials, server location, and your organization/project identifier in the following form:
'{PROTOCOL}://{PUBLIC_KEY}:{SECRET_KEY}@{HOST}/{PATH}{PROJECT_ID}'
Once you have DSN, the integration is pretty straightforward, as shown in the following code:
import sentry_sdk
sentry_sdk.init(
dsn='https://<key>:<secret>@app.getsentry.com/<project>'
)
try:
1 / 0
except Exception as e:
sentry_sdk.capture_exception(e)
The old library for Sentry integration is Raven. It is still maintained and available on PyPI but is being phased out, so it is best to start your Sentry integration using the newer python-sdk package. It is possible though that some framework integrations or Raven extensions haven't been ported to new SDK, so in such situations, integration using Raven is still a feasible integration path.
Sentry SDK has numerous integrations with most popular Python frameworks such as Django, Flask, Celery, or Pyramid to make integration easier. These integrations will automatically provide additional context that is specific to the given framework. If your web framework of choice does not have a dedicated support, the sentry-sdk package provides generic WSGI middleware that makes it compatible with any WSGI-based web servers, as shown in the following code:
from sentry_sdk.integrations.wsgi import SentryWsgiMiddleware
sentry_sdk.init(
dsn='https://<key>:<secret>@app.getsentry.com/<project>'
)
# ...
# note: application is some WSGI application object defined earlier
application = SentryWsgiMiddleware(application)
The other notable integration is the ability to track messages logged through Python's built-in logging module. Enabling such support requires only the following few additional lines of code:
import logging
import sentry_sdk
from sentry_sdk.integrations.logging import LoggingIntegration
sentry_logging = LoggingIntegration(
level=logging.INFO,
event_level=logging.ERROR,
)
sentry_sdk.init(
dsn='https://<key>:<secret>@app.getsentry.com/<project>',
integrations=[sentry_logging],
)
Capturing of logging messages may have caveats, so make sure to read the official documentation on that topic if you are interested in such a feature. This should save you from unpleasant surprises.
The last note is about running your own Sentry as a way to save some money. There ain't no such thing as a free lunch.
You will eventually pay additional infrastructure costs and Sentry will be just another service to maintain. Maintenance = additional work = costs! As your application grows, the number of exceptions grow, so you will be forced to scale Sentry as you scale your product. Fortunately, this is a very robust project, but will not give you any value if overwhelmed with too much load. Also, keeping Sentry prepared for a catastrophic failure scenario where thousands of crash reports per second can be sent is a real challenge. So you must decide which option is really cheaper for you, and whether you have enough resources to do all of this by yourself. There is, of course, no such dilemma if security policies in your organization deny sending any data to third parties. If so, just host it on your own infrastructure. There are costs, of course, but ones that are definitely worth paying.
Next, we will discuss the monitoring system and application metrics.