Observability at Scale: Building Uber’s Alerting Ecosystem
Uber’s alerting ecosystem is a vital component in maintaining the stability and scalability of its thousands of microservices.
The Observability team has developed two primary alerting systems: uMonitor, which focuses on metrics-based alerts, and Neris, which handles host-level infrastructure alerts.
uMonitor operates on a flexible platform, allowing for easy alert management and diverse use cases, while Neris executes alert checks directly on hosts to efficiently handle high-resolution, high-cardinality metrics.
Handling the challenge of high cardinality is central to Uber’s alerting approach.
Origami, the deduplication and notification engine, assists in managing alerts by consolidating notifications and allowing for alert aggregation based on various criteria such as city, product, or app version.
This helps in reducing noise and providing relevant alerts to engineers.
Overall, Uber’s alerting ecosystem is tailored to handle the scale and complexity of its infrastructure, with a focus on flexibility, scalability, and relevance of notifications. (Source)
Uber’s Big Data Observability and Chargeback Platform
Uber’s data infrastructure is composed of a wide variety of compute engines, execution solutions, and storage solutions.
With such a complex and diverse data infrastructure, it’s quite challenging to provide stakeholders with a holistic view of performance and resource consumption across various compute engines and storage solutions.
And this is when DataCentral comes into the picture.
It’s a comprehensive platform to provide users with essential insights into big data applications and queries.
DataCentral helps data platform users by offering detailed information on workflows and apps, enhancing productivity, and reducing debugging time.
The following are the key features of DataCentral.
It provides granular insights into performance trends, costs, and degradation signals for big data jobs.
Furthermore, DataCentral offers historical trends for metrics like costs, duration, efficiency, data read/written, and shuffle, enabling faster detection and debugging of applications.
It tracks metrics and resource usage for big data tools and engines such as Presto, Yarn, HDFS, and Kafka, allowing stakeholders to understand costs at various granularities like user, pipeline, application, schedule, and queue level.
- Consumption Reduction Programs
DataCentral powers cost reduction initiatives by providing insights into expensive pipelines, continuously failing workloads, and unnecessary computing.
A system aimed at efficiently troubleshooting failed queries and applications by improving error discoverability, identifying root causes, and providing user-friendly explanations and suggestions.
It matches exception traces against rules set by engine teams to surface relevant messages. (Source)