Lesson 1 – Run your services as a remote service from the service client.
The biggest thing I learned was that it seems everyone has pulled their service implementations out of their applications and into separate proceses. We have, traditionally, always been nervous about doing this as we have been concerned about the performance impact. Local calls will always be faster. However, over the course of building our redesigned site managing revisions of our service implementations has proven to be a huge challenge. Everytime we fix a critical defect of add new functionality to our service implementations we have to update every single web application that uses that service and redeploy it. We have almost twenty web applications and numerous other deployment artifcats and for our core services an upgrade could take our RM group all day to roll through all the applications that need updating.
At this point, I would rather be a bit worried about performance than deal with the ridgitiy that our current system forces us into. Running remotely means each service client only needs the service interface as a dependancy and some generic, shared client code for accessing the service.
Therefore, Lesson 1 is run your services as a remote process from the service client. An adjunct to lesson 1 is that using hadwired host and port numbers is a bad idea, especially in a cloud environement, use a naming service and a URI to locate services.
One question still remains, if it is remote, how much visibility should the developer have into the fact of remoteness or should it be hidden? Opinions appear, as usual, to differ.
Lesson 2 – Layers and Abstraction are your friend
This is a lesson that we need to constantyl relearn. Often we find ourselves running too fast and we end up breaking through layers or crossing within layers to quickly make something work. At the end of the day we create a "Big Ball of Mud" no one wants to, but we have a hard time escaping from it. We need to constantly remind ourselves to look at our abstractions and ensure they are right and that we are not creating crossed dependencies.
An interesting concept that Netflix presented was having a type system that allows a separation between domain models for backend and front end concerns. Using an adaptable type model one can ensure that each problem domain gets the object the way they need it. For example, Adrian Cockcroft from Netflix used the example of adapting a video object for front end presentation: video.asA(PresentationVideo). The sling project from Day, available on Apache, had a similar idea in their Adaptable interface, however, it seems that Netflix takes a slightly different approach by creating a type manager to help manage conversions.
Lesson 3 – Degrade gracefully all the way down the stack
Our front end team does a great job of degrading gracefully, however, in the middle tier we do not. If a service is slow or unavailable requests time out or fail. eBay presented a lot of great ideas for managing failure and argued that given the right frameworks developers don't even have to recode the failure recovery.
Failure recover can mean many things and could result in pages with components that simply do not render, components that render with stale cached data, or even rendering with default data. The worst thing to do is just e.printStackTrace(), something that happens a lot more often than one may care to admit (ok fine, sometimes people use throw new RuntimeException(e)!).
By degrading gracefully when a lower tier service times out the response time from the callee service becomes linear. Rather than hanging the calls above them you, essentially, guarantee that the slowest repsonse time is the time to timeout.
In our messaging system we already have a exponential backoff coded into our transport. However, this is an area we are not as good at with our own code.
Lesson 4 – Use tracing not profiling to detect performance problems
When we run into performance issues we tend to hook a profiler up to a vm in production and watch call times. Instead of doing this Shopzilla, eBay, Netflix, and others all use frameworks to enable tracing for a percentage of traffic. By tracing through all calls for a subset of traffic you can diagnose performance issues just like with a profiler and narrow issues down to a small set of calls. However, you also now have a monitoring tool to help notice errors before they get bad. By monitoring the tracing you can see deviations in the norms and work to address them.
Lesson 5 – Caching is your friend
This is an old lesson, however, in the past we have abused caching to hide underlying performance issues. Having been bitten by laziness we chose to focus on performance over all and not have any caching. It is time to revisit this approach and realize that we can use caching for scale. Our current system is fast, but at the end of the day, scaling it as is will be much more expensive than scaling with caching.
Next up…Pretotyping Rocks!