我们决定跟随业务上具有高增长特性的那些企业:Amazon, Netflix, SoundCloud, Twitter, 还有其他企业,借鉴他们的做法,将一整块代码模块/结构拆分为多个代码模块,构建SOA(面向服务)的代码架构。SOA的概念比较宽泛,涵盖的内容很多,我们主要提取采用了其中的微服务架构。所谓的微服务架构,就是一种设计模式,它强调将代码设计称为一个个小的服务,而每一个小的服务都对应了一个特定的、完好封装的领域模型/业务模型。每一个小服务都可以用适合自身的编程语言、技术框架、甚至有自己的数据库。










我们评估了现有的工具组件,并且发现Apache Thrift(因为FaceBook和Twitter的使用而名声大噪)能够最好地满足我们的需求。Thrift是专门为构建可拓展、跨语言服务而发明的,它包含了代码库,和工具组件。具体实现上:数据类型和服务接口定义在语言无差别文件(language agnostic file)中,然后生成指定语言的代码,这份代码对通讯方式和RPC消息的编码进行了抽象定义。

除了Thrift以外,对于接口的生命周期的管理,我们还使用了生命周期管理软件,把这些客户端接口提交到包管理系统,比如python的pip,node的npm。从而使的发现/定位/确定和贡献/开发这些接口,就成了一个可管理的问题。 服务客户端,在这种管理方式下,能够提供客户端接口声明的源码下载,从而多了一种学习接口定义的方式,而不是仅仅通过文档和wiki。


对于Thrift工具,最大的挑战就是安全性,Thrift通过让接口绑定严格的协议的方式,提供安全性。但是这种方法限制了太多的细节:它定义了服务程序的调用方式,哪些输入,哪些输出。比如,下面就是一个Thrift IDL,在这个接口中,我们定义了一个ZOO服务,这个服务下面有方法makeSound,参数是animalName,返回是String或抛出异常,我们来看看这个IDL:

<span style="font-size:18px;">struct Animal {
  1: i32 id
  2: string name
  3: string sound
exception NotFoundException {
  1: i32 what
  2: string why
service Zoo {
   *  Returns the sound the given animal makes.
  string makeSound(1: string animalName) throws (
      1: NotFoundException noAnimalFound








Like many startups, Uber began its journey with a monolithic architecture, built for a single offering in a single city. At the time, all of Uber was our UberBLACK option and our “world” was San Francisco. Having one codebase seemed “clean” at the time, and solved our core business problems, which included connecting drivers with riders, billing, and payments. It was reasonable back then to have all of Uber’s business logic in one place. As we rapidly expanded into more cities and introduced new products, this quickly changed.

As core domain models grew and new features were introduced, our components became tightly coupled, and enforcing encapsulation made separation of concerns difficult. Continuous integration turned into a liability because deploying the codebase meant deploying everything at once. Our engineering team experienced rapid growth and scaling, which not only meant handling more requests but also handling a significant increase in developer activity. Adding new features, fixing bugs, and resolving technical debt all in a single repo became extremely difficult. Tribal knowledge was required before attempting to make a single change.


Moving to a SOA

We decided to follow the lead of other hyper-growth companies—Amazon, Netflix, SoundCloud, Twitter, and others—and break up the monolith into multiple codebases to form a service-oriented architecture (SOA). Specifically, since the term SOA tends to mean a variety of different things, we adopted a microservice architecture. This design pattern enforces the development of small services dedicated to specific, well-encapsulated domain areas. Each service can be written in its own language or framework, and can have its own database or lack thereof.


Migrating from a monolithic codebase to a distributed SOA solved many of our problems, but it created a few new ones as well. These problems fall into three main areas: 

  1. Obviousness
  2. Safety
  3. Resilience



With 500+ services, finding the appropriate service becomes arduous. Once identified, how to utilize the service is not obvious, since each microservice is structured in its own way. Services providing REST or RPC endpoints (where you can access functionality within that domain) typically offer weak contracts, and in our case these contracts vary greatly between microservices. Adding JSON Schema to a REST API can improve safety and the process of developing against the service, but it is not trivial to write or maintain. Finally, these solutions do not provide any guarantees regarding fault tolerance or latency. There’s no standard way to handle client-side timeouts and outages, or ensure an outage of one service does not cause cascading outages. The overall resiliency of the system would be negatively impacted by these weaknesses. As one developer put it, we “converted our monolithic API into a distributed monolithic API”.

It has become clear we need a standard way of communication that provides type safety, validation, and fault tolerance. Other goals include:

  • Simple ways to provide client libraries
  • Cross language support
  • Tunable default timeouts and retry policies
  • Efficient testing and development

At this stage in our hyper-growth, Uber engineers continue to evaluate technologies and tools to fit our goals. One thing we do know is that using an existing Interface Definition Language(IDL) that provides lots of pre-built tooling from day one is ideal.

We evaluated the existing tools and found that Apache Thrift (made popular by Facebook and Twitter) met our needs best. Thrift is a set of libraries and tools for building scalable cross-language services. To accomplish this, datatypes and service interfaces are defined in a language agnostic file. Then, code is generated to abstract the transport and encoding of RPC messages between services written in all of the languages we support (Python, Node, Go, etc.)

In addition to Thrift, we’re creating lifecycle tooling to publish these clients to packaging systems (such as pip for Python and npm for Node). Discovering and contributing to the service then becomes a manageable task. Service clients also act as learning tools, in addition to docs and wikis.



The most compelling argument for Thrift is its safety. Thrift guarantees safety by binding services to use strict contracts. The contract describes how to interact with that service including how to call service procedures, what inputs to provide, and what output to expect. In the following Thrift IDL we have defined a service Zoo with a function makeSound that takes a string animalName and returns a string or throws an exception.

  struct Animal {
  1: i32 id
  2: string name
  3: string sound
  exception NotFoundException {
  1: i32 what
  2: string why
  service Zoo {
  * Returns the sound the given animal makes.
  string makeSound(1: string animalName) throws (
  1: NotFoundException noAnimalFound

Adhering to a strict contract means less time is spent figuring out how to communicate with a service and dealing with serialization. In addition, as a microservice evolves we do not have to worry about interfaces changing suddenly, and are able to deploy services independently from consumers. This is very good news for Uber engineers. We’re able to move on to other projects and tools since Thrift solves the problem of safety out of the box.



Lastly, we drew inspiration from fault tolerance and latency libraries in other companies facing similar challenges, such as Netflix’s Hystrix library and Twitter’s Finagle library, to tackle the problem of resiliency. With those libraries in mind, we wrote libraries that ensure clients are able to deal with failure scenarios successfully (which will be discussed in more detail in a future post).


Tradeoffs and Where We’re Headed

Of course, no solution is perfect and all solutions have challenges. Unfortunately, Thrift’s toolset is relatively young and tools for Python and Node are not abundant. There is a risk that a lot of time will be invested in creating these tools. Additionally, there is no higher-level support for headers. Authentication and cross-service tracing, for example, are two challenging problems since higher level meta-data would be passed in every time.

Dismantling our well-worn monolith has been a long time coming. While it has been a key component that enabled our explosive growth in the past, it has grown cumbersome and difficult to scale further and maintain.

Our goal for the remainder of 2015 is to get rid of this repo entirely—promoting clear ownership, offering better organizational scalability, and providing more resilience and fault tolerance through our commitment to microservices.




