Modern application architectures are shifting to the "cloud-native"--containerized, multi-service, and orchestrated in environments like Kubernetes and Mesos. In this new world, where cross-service communication is a critical part of application behavior, the requirement for resilient applications becomes a requirement for resilient communication.
In this talk, we introduce the notion of a "service mesh": an infrastructure layer for cross-service communication, designed to handle unexpected load, manage tail latencies, and degrade gracefully in the presence of component failure. We describe an open source implementation called linkerd, a lightweight HTTP router and load balancer built on Finagle and Netty, used in production today at banks, AI startups, gov't labs, and more. We detail linkerd’s multi-layered approach for handling failure (and its pernicious cousin, latency), including latency-aware load balancing, failure accrual, deadline propagation, retry budgets, and nacking. Finally, we describe linkerd’s routing model and show how it can be used for complex traffic shifting strategies, including ad-hoc staging clusters, blue-green deploys, and cross-datacenter failover.