June 4, 2024 • Written By Tim Liu
TL;DR: Flask and FastAPI are great technologies with thriving communities, but they were designed for IO-intensive applications (web applications). ML workloads require a different architecture and set of features because they are often compute and memory intensive.
It’s no secret that lots of ML models today are being deployed with Flask and FastAPI. Both provide simple interfaces and battle-tested deployment patterns. In fact, the initial version of our product at BentoML was built on top of Flask because at the time, it was considered the most straightforward framework for deploying APIs.
But, after years of supporting BentoML deployments backed by Flask, we came to the conclusion that Flask and its successor, FastAPI, are actually not the best tools to serve ML models at scale.
Why? The reason brings me to a key learning I’ve had as a developer: All technologies come with tradeoffs, and making the best choice is often dependent on the use case.
For ML use cases where the workload is compute-intensive and requires specialized hardware, you need a different architecture than what’s required for a typical web application.
In this article, we address where both Flask and FastAPI don’t hold up for deploying ML models from an architecture and feature perspective.
Flask is a web framework that makes it really simple to deploy web applications and has a rich ecosystem around it. But it was released in 2010, and is unfortunately starting to show its age.
There’s often one key difference that comes up while comparing Flask to newer frameworks: Flask does not implement the latest ASGI standard for web serving.
Flask implements an older standard called WSGI, meaning that every request blocks until it is completed, limiting the number of parallel requests that it can handle. This design is called asynchronous request architecture. While Flask provides thread support (which kind of helps), in Python, multithreading can produce unintuitive and often counterproductive results because of the Global Interpreter Lock.
ASGI, on the other hand, gives the ability to queue more requests at the same time using “asynchronous” endpoints, allowing the framework to handle more requests at scale. Modern web frameworks that implement the ASGI standard are not only more efficient, but also provide backward compatibility for WSGI in case it’s needed.
When Flask was first released, the ecosystem of production-level apps that applied DevOps best practices was small, and still in its infancy. Because Flask has such a large community, you can still, for the most part, find a module for any integration. However, these best practices are now table stakes for any production-worthy service, resulting in newer frameworks like BentoML integrating them as first-class citizens.
Data validation
For applications that handle large amounts of data, it’s important that the data you’re about to process is clean and in a known format. Because Flask does not provide data validation, developers often need to write large if/else statements to check the data in the endpoint.
As an aside, in the first version of BentoML, we built a data validation layer on top of Flask using pydantic to enable users to easily create data quality checks on their API endpoints in a much more standard and data driven way.
Documentation and testing
In the last few years, the OpenAPI specification has become the de facto standard for exposing, documenting, and testing an API-based service. Implementing this specification allows frameworks to easily integrate and expose helpful tools like Swagger UIs that allow you to test and share endpoints, making APIs way more accessible. Newer frameworks support the OpenAPI standard out of the box and can easily generate a host of documentation and interactive UIs without needing the developer to configure anything.
Monitoring and debugging
These days, metric monitoring and alerting are standard DevOps practices. Technologies like Prometheus and Grafana are commonplace. They’ve proven to be simple practices that allow you to keep track of production systems and alert the right people when things aren’t looking good.
BentoML implements other open standards as well like Open Telemetry which offers the ability to trace multiple levels of calls within an individual service as well as across microservices. Many times these ML services are only one piece of the puzzle and with different microservices calling to them, it’s important to have trace IDs to correlate an individual request across services for easier debugging.
The drawbacks of Flask now brings us to one of the biggest innovations in Python in the last few years: FastAPI. FastAPI was designed to be a modern web framework that solved for many of Flask’s aging architecture designs. FastAPI implements ASGI as well as having integrated niceties like Swagger UIs and pydantic data validation support.
A lot has been written about how FastAPI is a better solution than Flask for ML serving, but even so, it was still written with web applications in mind — again, failing to implement the right patterns to easily scale an ML service.
As the more modern alternative to Flask, FastAPI is built on the ASGI standard, allowing it to handle many requests fairly efficiently. However, when scaling beyond a single process, an ASGI uses the same strategy as WSGI where multiple copies of the main process are created as workers to service more requests in parallel. This means that if you have multiple CPUs on the same machine, multiple workers can take advantage of all CPUs available rather than using only one. The problem is that for an ML application, the model itself could have a fairly large memory footprint and/or is very computationally intensive.
• For large models, you’ll want to make sure you run fewer copies of the model in memory and potentially more web request workers to handle the incoming and outgoing transformations.
• For computationally-intensive models, you may want to run your model worker on a GPU and only run as many model workers as there are GPUs.
FastAPI has a few easy ways to configure these types of scenarios. While there are several different methods to use your own executor pools or potentially use shared memory for a large model, all of these solutions are not first-class solutions for ML use cases and are difficult to implement.
Because FastAPI is a generic web framework, it does not incorporate ML specific features. This is a big downside because certain features can greatly improve performance for ML specific workloads.
No support for micro-batching
One of the first things Data Scientists learn as they run predictions is to avoid the use of loops. That’s because most ML libraries support vectorized inference, combining many inputs into a batch and more efficiently calculating the results. This specialized technique combines framework-level features with specialized hardware like GPUs, making parallel computation more efficient than sending one input at a time for inference. It’s one of the best ways to increase performance on a high throughput system and FastAPI does not implement a feature like this. Instead, you’re forced to handle one input at a time. An online serving system should be able to combine multiple parallel inputs into a single batch before sending it to a model for inference. This technique is used to scale up a high-throughput system and makes the best use of your GPU cycles.
No support for async inference requests
While FastAPI does support async calls at the web request level, there is no way to call model predictions in an async manner. This matters because prediction requests are often compute-intensive and still bound to a synchronous native library. If this is the case, even if the endpoint is async, the inference request will block the main Python event loop, thus blocking all other requests from being processed in the main web serving process. In addition, if you are running the inference in a synchronous manner, there is no opportunity to micro-batch between requests.
This issue compounds if you’re trying to efficiently split your request processing between the CPU (which handles request transformation and validation) and running your predictions on a GPU (purpose-built hardware to run ML inference). It could effectively block your CPU from processing requests when the inference process isn’t even running there, but on the GPU instead.
It was clear to us at BentoML that both Flask and FastAPI weren’t quite fully serving the needs of ML practitioners in terms of performance or ease of use — which is why the BentoML framework exists today. (Much respect to the folks at both Flask and FastAPI for their contributions to the wider Python ecosystem though.)
We built BentoML on a little known framework called Starlette (which we discovered is the same framework that FastAPI is built off of) for a couple reasons:
1. It provides performance primitives like async endpoints and basic HTTP functionality
2. It isn’t opinionated about validation, compute or scaling (which all need special attention for ML systems)
BentoML builds upon Starlette with optional patterns for data validation, documentation, monitoring and more. The real key is the ML specific primitives that BentoML brings to the table on top of Starlette.
BentoML Services provide an ML-specific abstraction that allows you to encapsulate the functionalities of one or multiple models. This modular design allows each Service to be independently deployed and scaled according to its specific resource needs, such as CPU or GPU requirements.
Creating a BentoML Service is as simple as adding a decorator to a standard Python class, where you can specify the necessary computational resources. For example, certain models may require GPU support for inference tasks, while others can run with CPU resources. These Services can then be assigned to the optimal infrastructure on our AI inference platform BentoCloud.
Flask and FastAPI were built to be generic web frameworks that optimize for IO-heavy workloads. In doing so, they don’t address some of ML’s most annoying problems.
Dependency management and versioning
Dependency management and ML model versioning are problems unique to the ML industry. Our library makes sure that your model is reproducible in production — regardless of where you’re saving your model or the specific framework you’re using.
Ability to deploy anywhere
When it comes to deployment, ML models come with a whole host of challenges unknown to deploying typical web applications. This is because the software implementation is much more coupled to the underlying native libraries and hardware than most web applications. To solve for these challenges, BentoML automatically reads your dependencies and assembles an appropriate docker image with the correct versions of native libraries for your use case. If you’ve ever experienced “cuda hell” (a notoriously difficult native library which takes advantage of Nvidia’s GPUs), BentoML automatically provisions the right versions of the library and creates an easily deployable docker image along with your prediction service.
At BentoML, we want to provide ML practitioners with a practical model serving framework that’s easy to use out-of-the-box and able to scale in production. Next time you’re building an ML service, be sure to give our open source framework a try! For more resources, check out our GitHub page and join our Slack group. We’re innovating every day to build a better framework based on the needs of our community.