As you know from “A Brief History of Presto”, development of Presto started at Facebook. Ever since it was open sourced in 2013, its use has picked up and spread widely across a large variety of industries and companies.
In this chapter, you’ll see a few key numbers and characteristics that will give you a better idea about the potential for your own use of Presto. Keep in mind that all these companies started with a smaller cluster and learned on the go. Of course, many smaller and larger companies are using Presto. The data you find here should just give you a glimpse of how your Presto use can grow.
Also keep in mind that many platforms embed Presto. These platforms don’t even necessarily expose the fact that Presto is under the hood. And these platforms do not typically expose user numbers, architecture, and other characteristics.
The cited numbers and stats in this chapter are all based on publicly available information. The book repository contains links to sources such as blog posts, presentations from conferences, slide decks, and videos (see “Book Repository”). As you read this book, the data may have become outdated or inaccurate. However, growing use of Presto and the general content gives you a good understanding of what is possible with Presto and how other users successfully deploy, run, and use it.
Where and how you run your Presto deployment is an important factor. It impacts everything from low-level technical requirements to high-level user-facing aspects. The technical aspects include the level of direct involvement necessary to run Presto, the tooling used, and the required operations know-how. From a user perspective, the platform influences aspects such as overall performance, speed of change, and adaptation to different requirements.
Last but not least, other aspects might influence your choice, such as use of a specific platform in your company and the expected costs. Let’s see what common practices are out there.
Where does Presto run? Here are some points to consider:
Clusters of bare-metal servers are becoming rather rare.
Virtual machines are most commonly used now.
Container use is moving to be the standard and is posed to overtake VMs.
As a modern, horizontally scaling application, Presto can be found running in all the common deployment platforms:
Private on-premises clouds such as OpenStack
Various public cloud providers including AWS, GCP, and Azure
Mixed deployments
Here are some examples:
Lyft runs Presto on AWS.
Pinterest runs Presto on AWS.
Twitter runs Presto on a mix of on-premises cloud and GCP.
The industry trend of moving to containers and Kubernetes has made an impact on Presto deployments. An increasing number of Presto clusters run in that environment and in the related public offerings for Kubernetes use.
The size of some Presto clusters running at some of the larger users is truly astounding, even in this age of big data everywhere. Presto has proven to handle scale in production for many years now.
So far, we’ve mostly talked about running a Presto cluster and have hinted at the fact that you can run multiple clusters.
When you look at real-world use of Presto at scale, you find that most large deployments use multiple clusters. Various approaches to using multiple clusters are available in terms of Presto configuration, data sources, and user access:
Identical
Different
Mixed
Here are some advantages you can gain from identical clusters:
Using a load-balancer enables you to upgrade the cluster without visible downtime for your users.
Coordinator failures do not take the system offline.
Horizontal scaling for higher cluster performance can include use of multiple runtime platforms; for example, you might use an internal Kubernetes cluster at all times and an additional cluster running in a Kubernetes offering from a public cloud provider for peak usage.
Separate clusters, on the other hand, allow you to clearly separate different use cases and tune the cluster in various aspects:
Available data sources
Location of cluster near data sources
Different users and access rights
Tuning of worker and coordinator configuration for different use cases—for example, adhoc interactive queries compared to long-running batch jobs
The following companies are known to run several Presto clusters:
Comcast
Lyft
Most other organizations mentioned in this chapter probably also use multiple clusters. We just did not find a public reference to that fact.
After having a look at the number of clusters, let’s look at the number of nodes. There are some truly massive deployments and others at a scale you might end up reaching yourself in the future:
Facebook: more than 10,000 nodes across multiple clusters
FINRA: more than 120 nodes
LinkedIn: more than 500 nodes
Lyft: more than 400 nodes, 100–150 nodes per cluster
Netflix: more than 300 nodes
Twitter: more than 2,000 nodes
Wayfair: 300 nodes
Yahoo! Japan: more than 600 nodes
Probably still the most common use case to adopt Presto is the migration from Hive to allow compliant and performant SQL access to the data in HDFS. This includes the desire to query not just HDFS, but also Amazon S3, S3-compatible systems, and many other distributed storage systems.
This first use case for Presto is often the springboard to wide adoption of Presto for other uses.
Companies using Presto to expose data in these systems include Comcast, Facebook, LinkedIn, Twitter, Netflix, and Pinterest. And here is a small selection of numbers:
Facebook: 300 PB
FINRA: more than 4 PB
Lyft: more than 20 PB
Netflix: more than 100 PB
Beyond the typical Hadoop/Hive/distributed storage use case, Presto adoption is gaining ground for many other data sources. These include use of the connectors available from the Presto project but also have significant use of other data sources with third-party connectors from other open source projects, internal development, and commercial vendors:
Here are some examples:
Comcast: Apache Cassandra, Microsoft SQL Server, MongoDB, Oracle, Teradata
Facebook: MySQL
Insight: Elasticsearch, Apache Kafka, Snowflake
Wayfair: MemSQL, Microsoft SQL Server, Vertica
Last but not least, you can learn a bit more about the users of these Presto deployments and the number of queries they typically cause. Users include business analysts working on dashboards and ad hoc queries, developers mining log files and test results, and many others.
Here is a small selection of user counts:
Arm Treasure Data: approximately 3,500 users
Facebook: more than 1,000 employees daily
FINRA: more than 200 users
LinkedIn: approximately 1,000 users
Lyft: approximately 1,500 users
Pinterest: more than 1,000 monthly active users
Wayfair: more than 200 users
Queries often range from very small, simple queries to large, complex analysis queries or even ETL-type workloads. As such, the number of queries tells only part of the story, although you nevertheless learn about the scale Presto operates:
Arm Treasure Data: approximately 600,000 queries per day
Facebook: more than 30,000 queries per day
LinkedIn: more than 200,000 queries per day
Lyft: more than 100,000 queries per day and more than 1.5 million queries per month
Pinterest: more than 400,000 queries per month
Slack: more than 20,000 queries per day
Twitter: more than 20,000 queries per day
Wayfair: up to 180,000 queries per month
What a wide range of scale and usage! As you can see, Presto is widely used across various industries. As a beginning user, you can feel confident that Presto scales with your demands and is ready to take on the load and the work that you expect it to process.
We encourage you to learn more about Presto from the various resources, and especially to also join the Presto community and any community events.