The magic behind Uber’s data-driven success
Uber, the ride-hailing giant, is a household name worldwide. We all recognize it as the platform that connects riders with drivers for hassle-free transportation. But what most people don’t realize is that behind the scenes, Uber is not just a transportation service; it’s a data and analytics powerhouse. Every day, millions of riders use the Uber app, unwittingly contributing to a complex web of data-driven decisions. This blog takes you on a journey into the world of Uber’s analytics and the critical role that Presto, the open source SQL query engine, plays in driving their success.
Uber’s DNA as an analytics company
At its core, Uber’s business model is deceptively simple: connect a customer at point A to their destination at point B. With a few taps on a mobile device, riders request a ride; then, Uber’s algorithms work to match them with the nearest available driver and calculate the optimal price. But the simplicity ends there. Every transaction, every cent matters. A ten-cent difference in each transaction translates to a staggering $657 million annually. Uber’s prowess as a transportation, logistics and analytics company hinges on their ability to leverage data effectively.
The pursuit of hyperscale analytics
The scale of Uber’s analytical endeavor requires careful selection of data platforms with high regard for limitless analytical processing. Consider the magnitude of Uber’s footprint.1 The company operates in more than 10,000 cities with more than 18 million trips per day. To maintain analytical superiority, Uber keeps 256 petabytes of data in store and processes 35 petabytes of data every day. They support 12,000 monthly active users of analytics running more than 500,000 queries every single day.
To power this mammoth analytical undertaking, Uber chose the open source Presto distributed query engine. Teams at Facebook developed Presto to handle high numbers of concurrent queries on petabytes of data and designed it to scale up to exabytes of data. Presto was able to achieve this level of scalability by completely separating analytical compute from data storage. This allowed them to focus on SQL-based query optimization to the nth degree.
What is Presto?
Presto is an open source distributed SQL query engine for data analytics and the data lakehouse, designed for running interactive analytic queries against datasets of all sizes, from gigabytes to petabytes. It excels in scalability and supports a wide range of analytical use cases. Presto’s cost-based query optimizer, dynamic filtering and extensibility through user-defined functions make it a versatile tool in Uber’s analytics arsenal. To achieve maximum scalability and support a broad range of analytical use cases, Presto separates analytical processing from data storage. When a query is constructed, it passes through a cost-based optimizer, then data is accessed through connectors, cached for performance and analyzed across a series of servers in a cluster. Because of its distributed nature, Presto scales for petabytes and exabytes of data.
The evolution of Presto at Uber
Beginning of a data analytics journey
Uber began their analytical journey with a traditional analytical database platform at the core of their analytics. However, as their business grew, so did the amount of data they needed to process and the number of insight-driven decisions they needed to make. The cost and constraints of traditional analytics soon reached their limit, forcing Uber to look elsewhere for a solution.
Uber understood that digital superiority required the capture of all their transactional data, not just a sampling. They stood up a file-based data lake alongside their analytical database. While this side-by-side strategy enabled data capture, they quickly discovered that the data lake worked well for long-running queries, but it was not fast enough to support the near-real time engagement necessary to maintain a competitive advantage.
To address their performance needs, Uber chose Presto because of its ability, as a distributed platform, to scale in linear fashion and because of its commitment to ANSI-SQL, the lingua franca of analytical processing. They set up a couple of clusters and began processing queries at a much faster speed than anything they had experienced with Apache Hive, a distributed data warehouse system, on their data lake.
Continued high growth
As the use of Presto continued to grow, Uber joined the Presto Foundation, the neutral governing body behind the Presto open source project, as a founding member alongside Facebook. Their initial contributions were based on their need for growth and scalability. Uber focused on contributing to several key areas within Presto:
Automation: To support growing usage, the Uber team went to work on automating cluster management to make it simple to keep up and running. Automation enabled Uber to grow to their current state with more than 256 petabytes of data, 3,000 nodes and 12 clusters. They also put process automation in place to quickly set up and take down clusters.
Workload Management: Because different kinds of queries have different requirements, Uber made sure that traffic is well-isolated. This enables them to batch queries based on speed or accuracy. They have even created subcategories for a more granular approach to workload management.
Because much of the work done on their data lake is exploratory in nature, many users want to execute untested queries on petabytes of data. Large, untested workloads run the risk of hogging all the resources. In some cases, the queries run out of memory and do not complete.
To address this challenge, Uber created and maintains sample versions of datasets. If they know a certain user is doing exploratory work, they simply route them to the sampled datasets. This way, the queries run much faster. There may be inaccuracy because of sampling, but it allows users to discover new viewpoints within the data. If the exploratory work needs to move on to testing and production, they can plan appropriately.
Security: Uber adapted Presto to take users’ credentials and pass them down to the storage layer, specifying the precise data to which each user has access permissions. As Uber has done with many of its additions to Presto, they contributed their security upgrades back to the open source Presto project.
The technical value of Presto at Uber
Analyzing complex data types with Presto
As a digital native company, Uber continues to expand its use cases for Presto. For traditional analytics, they are bringing data discipline to their use of Presto. They ingest data in snapshots from operational systems. It lands as raw data in HDFS. Next, they build model data sets out of the snapshots, cleanse and deduplicate the data, and prepare it for analysis as Parquet files.
For more complex data types, Uber uses Presto’s complex SQL features and functions, especially when dealing with nested or repeated data, time-series data or data types like maps, arrays, structs and JSON. Presto also applies dynamic filtering that can significantly improve the performance of queries with selective joins by avoiding reading data that would be filtered by join conditions. For example, a parquet file can store data as BLOBS within a column. Uber users can run a Presto query that extracts a JSON file and filters out the data specified by the query. The caveat is that doing this defeats the purpose of the columnar state of a JSON file. It is a quick way to do the analysis, but it does sacrifice some performance.
Extending the analytical capabilities and use cases of Presto
To extend the analytical capabilities of Presto, Uber uses many out-of-the-box functions provided with the open source software. Presto provides a long list of functions, operators, and expressions as part of its open source offering, including standard functions, maps, arrays, mathematical, and statistical functions. In addition, Presto also makes it easy for Uber to define their own functions. For example, tied closely to their digital business, Uber has created their own geospatial functions.
Uber chose Presto for the flexibility it provides with compute separated from data storage. As a result, they continue to expand their use cases to include ETL, data science, data exploration, online analytical processing (OLAP), data lake analytics and federated queries.
Pushing the real-time boundaries of Presto
Uber also upgraded Presto to support real-time queries and to run a single query across data in motion and data at rest. To support very low latency use cases, Uber runs Presto as a microservice on their infrastructure platform and moves transaction data from Kafka into Apache Pinot, a real-time distributed OLAP data store, used to deliver scalable, real-time analytics.
According to the Apache Pinot website, “Pinot is a distributed and scalable OLAP (Online Analytical Processing) datastore, which is designed to answer OLAP queries with low latency. It can ingest data from offline batch data sources (such as Hadoop and flat files) as well as online data sources (such as Kafka). Pinot is designed to scale horizontally, so that it can handle large amounts of data. It also provides features like indexing and caching.”
This combination supports a high volume of low-latency queries. For example, Uber has created a dashboard called Restaurant Manager in which restaurant owners can look at orders in real time as they are coming into their restaurants. Uber has made the Presto query engine connect to real-time databases.
To summarize, here are some of the key differentiators of Presto that have helped Uber:
Speed and Scalability: Presto’s ability to handle massive amounts of data and process queries at lightning speed has accelerated Uber’s analytics capabilities. This speed is essential in a fast-paced industry where real-time decision-making is paramount.
Self-Service Analytics: Presto has democratized data access at Uber, allowing data scientists, analysts and business users to run their queries without relying heavily on engineering teams. This self-service analytics approach has improved agility and decision-making across the organization.
Data Exploration and Innovation: The flexibility of Presto has encouraged data exploration and experimentation at Uber. Data professionals can easily test hypotheses and gain insights from large and diverse datasets, leading to continuous innovation and service improvement.
Operational Efficiency: Presto has played a crucial role in optimizing Uber’s operations. From route optimization to driver allocation, the ability to analyze data quickly and accurately has led to cost savings and improved user experiences.
Federated Data Access: Presto’s support for federated queries has simplified data access across Uber’s various data sources, making it easier to harness insights from multiple data stores, whether on-premises or in the cloud.
Real-Time Analytics: Uber’s integration of Presto with real-time data stores like Apache Pinot has enabled the company to provide real-time analytics to users, enhancing their ability to monitor and respond to changing conditions rapidly.
Community Contribution: Uber’s active participation in the Presto open source community has not only benefited their own use cases but has also contributed to the broader development of Presto as a powerful analytical tool for organizations worldwide.
The power of Presto in Uber’s data-driven journey
Today, Uber relies on Presto to power some impressive metrics. From their latest Presto presentation in August 2023, here’s what they shared:
Uber’s success as a data-driven company is no accident. It’s the result of a deliberate strategy to leverage cutting-edge technologies like Presto to unlock the insights hidden in vast volumes of data. Presto has become an integral part of Uber’s data ecosystem, enabling the company to process petabytes of data, support diverse analytical use cases, and make informed decisions at an unprecedented scale.
Getting started with Presto
If you’re new to Presto and want to check it out, we recommend this Getting Started page where you can try it out.
Alternatively, if you’re ready to get started with Presto in production you can check out IBM watsonx.data, a Presto-based open data lakehouse. Watsonx.data is a fit-for-purpose data store, built on an open lakehouse architecture, supported by querying, governance and open data formats to access and share data.
Request a live demo here to see Presto and watsonx.data in action
1 Uber. EMA Technical Case Study, sponsored by Ahana. Enterprise Management Associates (EMA). 2023.