What I learned about Infrastructure

I learned a lot from Andrew Shaft's talk on building a private cloud. My biggest takeaway, perhaps not Andrew's main point, perhaps not even a point he was making, regardless… Clouds are not about virtualization. Virtualization is an implementation of a cloud infrastructure, one could just as easily build a cloud infrastructure with physical boxes.

A cloud infrastructure is more about automation and the ability to grow and manage one's infrastructure without physically logging into machines to manage them. Clouds are about tooling, not about a specifc tool for managing services. Machines are basically worthless, services provide all the value, focus on the service not the hardware or software. Scott McNealy brought up a simaar point years ago when I heard his keynote at JavaOne, back in 1999 (or 2000?), the JavaOne where we all bought real cheap Palm Vs! He pointed out that hardware by iteself and software by itself has no value. However, when you take a Sun box + Sendmail you get a service that has real value to people. Ok, so clouds ==  deployment and management of services via tools.

Everyone that talked about "the cloud" stated categorically that unless you are treating your infrastructure like you treat your code there is no way you are ready for the move. Obviously, this assumes that your code is managed in source control, you have plenty of autoamted tests, and that you have a repeatable build and release process. What we struggle with, here at Edmunds, is the configuration and deployment aspects. We have gotten a lot better at configuring our software, however, we have a long way to go with the deployment of our applications, the configuration of our systems, and the configuration of our environments in a repeatable manner.  Too often we find ourselves SSH'ing into a box to check or change a configuration. Once in the cloud you won't have enough time in the day to check all the boxes. For that matter, once we have additional data centers we won't be able to do this either.

Puppet was the tool of the day, everyone talked about Puppet, however, regardless of the tool, the goal is to have your infrastructure be self describing. With Puppet this means the Puppet DSL. The interesting point for me was that by using a tool that is close to the tooling your developers are using the dev and ops teams start using the same language and working through similar processes. Just like the code, the configuration is checked into source control, versioned, labeled, built, tested, and deployed. There is even a testing harness for Puppet using Cucumber!
Who writes the configuration? While the ops team traditionally manages all configuration it would make sense to have developers write configuration for their applications. After all, they wrote it they should know how to configure and deploy it. The real answer is both, just like code the configuiration, now in a language common to both teams, is a shared responsibility. There is no owner of our code base and there would be no owner of our configuration.
So we get it, configuration as code. We understand we can and should test our configurations and use all the standard change control systems we use for code. However, just like without a test there is no code, with infrastructure without a monitor there is no deployment. Monitoing is often a bunch of data exposed to the operations team that then has to figure out how to chart and alert off the data. The whole time the developer basically knows what a workng system looks like and what a broken system looks like. Rather than exposing a ton of data, why not start with a simple hey I'm working piece of data? The alternative is just let ops ping your service…really?!??
The final two lessons I learned (the first is really a recap but it is important):
  1. If you don't have automated configuration management you can't build a cloud, once you have that you will probably realize it is cheaper to use a provider!
  2. People and culture are usually the biggest barriers to appraoching infrastructure as code! (no suprise there same as with tdd!)
Next up…Continuous Deployments
Posted in Management, Process, Quality, Release Management | Tagged , , | Leave a comment

What I learned about Pretotyping

Patrick Copeland from Google gave a facinating talk about a concept called Pretotyping (or Pretend-o-typing). His talk dovetailed nicely into the conversations we've had at work about design thinking and illustrated in a powerful way how to use design thinking to push you towards making the right it.

For me the biggest take away was that innovators beat ideas. Everyone talks about ideas, but ideas in an of themselves are worthless. We all have ideas, and, frankly, can generate them a dozen a minute (or faster!). It is not really as much about the idea as the person behind the idea, the innovator. Innovators are a much rarer commidity. Innovators do not need to come up with a new idea, but what they do is more important. They test, refine, and reject ideas until they figure out what works, what the "it" than needs to be built is. As the pretotyping.org site says:

Because the number of ideas is practically infinite while the number of innovators is very finite, the innovators to ideas ratio is – for all practical purposes – close to zero. That makes innovators incredibly valuable.

Thus, Mr. Copeland claims (rightly!) that "Innovators beat Ideas."

At Edmunds, we are about to embark on a Design Thinking approach to building products. With this approach, we will interview, ideate, paper prototype, iterate, etc. What hit me was a very simple idea:  use the paper prototype over and over to test if the idea flys, don't just test it once on a user, pretend to use it, pretend-o-type!

As you use the pretend application, Mr. Copeland stressed tracking usage and really finding out how much the prototype is used and what the return rate is for users. Taking this a step farther, even after a product is launched, one should measure the new vs. repeat vistor rates. If repeat users are falling off, the idea is probably not a useful one. Here is where the courage comes, if the idea is not working don't agonize over it, kill it and move on.

Kill it and move on, is a scary concept for most businesses. Even the greats at Google, I am sure, spent a lot of time agonzing over the failure of Google Wave. However, in the end, they did kill it and they are moving on. The learning here was to makc a flop metric and stick to it and don't be afraid to admit failure and move on!

A side note: one can over test. Mr. Copeland alluded to Google tesing 100 shades of blue, probably a bit overkill.

 Back to pretend-o-typing, how does one go about testing ides quickly? Mr. Copeland showed a picture of the andriod "prototyping" kit he handed out. It consisted of a pad of paper and a pen! Draw a quick UI for your idea and stick the paper on your andriod phone. Track how much you use your own idea. If you use it a lot for the first week, how much do you use it the second? The third? If you notice your usage drift towards zero, you probably need to move on. Total investment to find out if your idea was good? Almost zlich. Compare that to the normal working prototype and the time and engery spent in building it. With the investment in a normal prototype one becomes very attached to their idea.

The next step is to remember that ideas beat features. The original Gmail and Facebook had a lot fewer features than their competitors, however, they where better ideas and thus took off. If the Gmail or Facebook team spent the time to deliver what they have today they would have missed the boat and probably would have built the wrong features. Your users will tell you what they want, but without something in front of them they can't.

So here is what I learned:

1) Innovators are priceless

2) Paper + Data == Awesome idea killer

3) Don't be afraid to kill your idea

4) Get your idea out there and test it and refine it – don't try make something perfect…you will fail.

 

Next up, Infrastructure.

Posted in Management, Process | Tagged , , , | 1 Comment

What I learned about Software Design

Lesson 1 – Run your services as a remote service from the service client.

The biggest thing I learned was that it seems everyone has pulled their service implementations out of their applications and into separate proceses. We have, traditionally, always been nervous about doing this as we have been concerned about the performance impact. Local calls will always be faster. However, over the course of building our redesigned site managing revisions of our service implementations has proven to be a huge challenge. Everytime we fix a critical defect of add new functionality to our service implementations we have to update every single web application that uses that service and redeploy it. We have almost twenty web applications and numerous other deployment artifcats and for our core services an upgrade could take our RM group all day to roll through all the applications that need updating.

At this point, I would rather be a bit worried about performance than deal with the ridgitiy that our current system forces us into. Running remotely means each service client only needs the service interface as a dependancy and some generic, shared client code for accessing the service.

Therefore, Lesson 1 is run your services as a remote process from the service client. An adjunct to lesson 1 is that using hadwired host and port numbers is a bad idea, especially in a cloud environement, use a naming service and a URI to locate services.

One question still remains, if it is remote, how much visibility should the developer have into the fact of remoteness or should it be hidden? Opinions appear, as usual, to differ.

Lesson 2 – Layers and Abstraction are your friend

This is a lesson that we need to constantyl relearn. Often we find ourselves running too fast and we end up breaking through layers or crossing within layers to quickly make something work. At the end of the day we create a "Big Ball of Mud" no one wants to, but we have a hard time escaping from it. We need to constantly remind ourselves to look at our abstractions and ensure they are right and that we are not creating crossed dependencies.

An interesting concept that Netflix presented was having a type system that allows a separation between domain models for backend and front end concerns. Using an adaptable type model one can ensure that each problem domain gets the object the way they need it. For example, Adrian Cockcroft from Netflix used the example of adapting a video object for front end presentation: video.asA(PresentationVideo). The sling project from Day, available on Apache, had a similar idea in their Adaptable interface, however, it seems that Netflix takes a slightly different approach by creating a type manager to help manage conversions.

Lesson 3 – Degrade gracefully all the way down the stack

Our front end team does a great job of degrading gracefully, however, in the middle tier we do not. If a service is slow or unavailable requests time out or fail. eBay presented a lot of great ideas for managing failure and argued that given the right frameworks developers don't even have to recode the failure recovery.

Failure recover can mean many things and could result in pages with components that simply do not render, components that render with stale cached data, or even rendering with default data. The worst thing to do is just e.printStackTrace(), something that happens a lot more often than one may care to admit (ok fine, sometimes people use throw new RuntimeException(e)!).

By degrading gracefully when a lower tier service times out the response time from the callee service becomes linear. Rather than hanging the calls above them you, essentially, guarantee that the slowest repsonse time is the time to timeout.

In our messaging system we already have a exponential backoff coded into our transport. However, this is an area we are not as good at with our own code.

Lesson 4 – Use tracing not profiling to detect performance problems

When we run into performance issues we tend to hook a profiler up to a vm in production and watch call times. Instead of doing this Shopzilla, eBay, Netflix, and others all use frameworks to enable tracing for a percentage of traffic. By tracing through all calls for a subset of traffic you can diagnose performance issues just like with a profiler and narrow issues down to a small set of calls. However, you also now have a monitoring tool to help notice errors before they get bad. By monitoring the tracing you can see deviations in the norms and work to address them.

Lesson 5 – Caching is your friend

This is an old lesson, however, in the past we have abused caching to hide underlying performance issues. Having been bitten by laziness we chose to focus on performance over all and not have any caching. It is time to revisit this approach and realize that we can use caching for scale. Our current system is fast, but at the end of the day, scaling it as is will be much more expensive than scaling with caching.

 Next up…Pretotyping Rocks!

Posted in Uncategorized | 3 Comments

QCon 2010 Lessons I learned

I just got back from the 2010 QCon SF conference. It was a very exciting week filled with great and some not so great talks. As usually happens after a conference, I am filled with new ideas and excited about how we can apply what I heard to our work at Edmunds. Sometimes the ideas one comes away with work and sometimes they don't, regardless one always learns something new and trying new ideas even those that don't work teaches you valuable lessons.

Over the course of the next week I'll be covering "What I learned at QCon" focusing on the following:

  • Software Design
  • Pretotyping Rocks
  • Infrastructure
  • Continuous Deployments

Each post will cover several speakes as well as discuss thoughts for how to apply what I heard to Edmunds.

Posted in Uncategorized | Tagged , , , | Leave a comment

Another Coherence Video

We just posted another "Paddy Does Coherence" video on YouTube. Check it out:

Also Karim Qazi, one of our engineering directors, posted a great post on data versioning and coherence. You can find it on the technology.edmunds.com blog: http://technology.edmunds.com/blog/2010/10/keeping-data-backward-compatible-with-coherence-pof.html

Posted in Java | Leave a comment

Coherence video

We just posted a short video describing coherence that was shot a while back for our technology teams.

 

Posted in Uncategorized | Leave a comment

Data Services Topology

We are currently using Oracle's Coherence product as our primary data store behind our website. Coherence provides a scalable, fast data grid that allows for developers to easily access data and allows our data to be versioned so that structures can morph over time. 

To date we have deployed our grid following the same pattern we used for our relational databases creating a separate grid for each integration environment. As integration environments grow we need to create more and more grids. Our ultimate goal for environments is to have a virtual stack that development teams can spin up and down as needed, given our current deployment architecture this would greatly multiply the number of grids we need. 

A potential solution is to treat our data services more like a shared service that can be used by anyone. In this model there would be one production data grid that provides tested, approved data services to development, qa, and production environments.

The upside is that data consistency is easy to maintain and deployment and management would be easier. The data services would roll through their own deployment model and have internal integration environments that the services team could use for testing prior to releasing a new version of their service. Service upgrades would be independent of all other code (in many cases they already are) and be instantly available to all consumers. The instant availability would also cause the services team to focus on backwards compatibility. 

There are also many downsides, most notably non-production applications could negatively impact our production website. Such impacts could be mitigated by ensuring that only production released data services clients are used by applications (our services all have client access libraries), however, that may not be enough to prevent a rouge process from impact end user performance and thus, revenue. 

Perhaps there is a hybrid approach based on SLAs, such a hybrid approach could be the use of an internal facing production grid and an external facing production grid. Regardless of the eventual topology our move towards virtualization has an enormous impact on "shared" resources and I believe that as technologists we need to ensure we are taking in the big picture to ensure that small decisions seemingly disconnected from one another do not lead us towards a future in which trap ourselves in a corner. Every day we make small decisions that without regard to the bigger picture lead us towards an architecture that is not thought through and can have long term negative impacts. As a technology manager/leader it is my job to ensure that the big picture questions are at least asked.

Posted in Java, Management, Process, Quality, Release Management | Leave a comment