What I learned about Software Design

Lesson 1 – Run your services as a remote service from the service client.

The biggest thing I learned was that it seems everyone has pulled their service implementations out of their applications and into separate proceses. We have, traditionally, always been nervous about doing this as we have been concerned about the performance impact. Local calls will always be faster. However, over the course of building our redesigned site managing revisions of our service implementations has proven to be a huge challenge. Everytime we fix a critical defect of add new functionality to our service implementations we have to update every single web application that uses that service and redeploy it. We have almost twenty web applications and numerous other deployment artifcats and for our core services an upgrade could take our RM group all day to roll through all the applications that need updating.

At this point, I would rather be a bit worried about performance than deal with the ridgitiy that our current system forces us into. Running remotely means each service client only needs the service interface as a dependancy and some generic, shared client code for accessing the service.

Therefore, Lesson 1 is run your services as a remote process from the service client. An adjunct to lesson 1 is that using hadwired host and port numbers is a bad idea, especially in a cloud environement, use a naming service and a URI to locate services.

One question still remains, if it is remote, how much visibility should the developer have into the fact of remoteness or should it be hidden? Opinions appear, as usual, to differ.

Lesson 2 – Layers and Abstraction are your friend

This is a lesson that we need to constantyl relearn. Often we find ourselves running too fast and we end up breaking through layers or crossing within layers to quickly make something work. At the end of the day we create a "Big Ball of Mud" no one wants to, but we have a hard time escaping from it. We need to constantly remind ourselves to look at our abstractions and ensure they are right and that we are not creating crossed dependencies.

An interesting concept that Netflix presented was having a type system that allows a separation between domain models for backend and front end concerns. Using an adaptable type model one can ensure that each problem domain gets the object the way they need it. For example, Adrian Cockcroft from Netflix used the example of adapting a video object for front end presentation: video.asA(PresentationVideo). The sling project from Day, available on Apache, had a similar idea in their Adaptable interface, however, it seems that Netflix takes a slightly different approach by creating a type manager to help manage conversions.

Lesson 3 – Degrade gracefully all the way down the stack

Our front end team does a great job of degrading gracefully, however, in the middle tier we do not. If a service is slow or unavailable requests time out or fail. eBay presented a lot of great ideas for managing failure and argued that given the right frameworks developers don't even have to recode the failure recovery.

Failure recover can mean many things and could result in pages with components that simply do not render, components that render with stale cached data, or even rendering with default data. The worst thing to do is just e.printStackTrace(), something that happens a lot more often than one may care to admit (ok fine, sometimes people use throw new RuntimeException(e)!).

By degrading gracefully when a lower tier service times out the response time from the callee service becomes linear. Rather than hanging the calls above them you, essentially, guarantee that the slowest repsonse time is the time to timeout.

In our messaging system we already have a exponential backoff coded into our transport. However, this is an area we are not as good at with our own code.

Lesson 4 – Use tracing not profiling to detect performance problems

When we run into performance issues we tend to hook a profiler up to a vm in production and watch call times. Instead of doing this Shopzilla, eBay, Netflix, and others all use frameworks to enable tracing for a percentage of traffic. By tracing through all calls for a subset of traffic you can diagnose performance issues just like with a profiler and narrow issues down to a small set of calls. However, you also now have a monitoring tool to help notice errors before they get bad. By monitoring the tracing you can see deviations in the norms and work to address them.

Lesson 5 – Caching is your friend

This is an old lesson, however, in the past we have abused caching to hide underlying performance issues. Having been bitten by laziness we chose to focus on performance over all and not have any caching. It is time to revisit this approach and realize that we can use caching for scale. Our current system is fast, but at the end of the day, scaling it as is will be much more expensive than scaling with caching.

 Next up…Pretotyping Rocks!

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to What I learned about Software Design

  1. Great roundup!
    I agree that Lesson 1 (run your services as a remote service from the service client) was a big revelation, considering how unintuitive it was. Also, Lesson 5 (caching) could address any performance implications that might arise from implementing remote services. So in a way, the drawback to Lesson 1 is taken care of by Lesson 5.
    Randy Shoup of eBay made a statement in one of his presentations that I thought was very interesting. He said, “the consumer needs to be able to recover when its dependencies are down.” If an application (i.e. consumer) is using a remote service, how could it recover gracefully when that service is down? Could a local cache at the consumer layer be an effective failover mechanism, especially for read-only service calls? Should it just stop working and notify the user that an error has occurred?
    These are some of the questions we’ll have to tackle as we open up our platform next year. The fact that we expect 3rd-party developers to use our services in a way answers your question about how much visibility the developer needs to have into the architecture of our services. I think that answer is: none. All the developer needs to care about is what methods to call and what data is returning and in what format. The fun details should be hidden, I think.

  2. Paddy Hannon says:

    Failing gracefully is a lesson that I am still trying to wrap my head around. Both Adrain Cockcroft and Randy Shoup discussed the concept and talked about returning a default value, returning nothing, perhaps a cached value. However, the specifics of what to return and how are tricky. One thing is for sure, an error 500 page is bad so is the fail whale, however, it may be unavoidable.
    An example, if our vehicle service fails what should the client that needs a list of all new Toyota model names do? Should the service wrapper return nothing? If the service wrapper has cached data that would be ideal, but it may not, then what? As we explore untangling our service layer I am puzzled by these questions and excited about trying to find the answers.

  3. Maybe another way of thinking about is asking the following question: what are the unacceptable outcomes when a service request times out or fails? Maybe by clearly identifying what we “don’t” want to happen we can find out what the best solution is.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s