Unrelated to Software…

My wife and I have been working on a new project for the last year called Machines for Freedom. Our goal is to create a cycling brand for women to provide the best technical apparel on the market for women. Check us out at MachinesForFreedom.com.

Posted in Uncategorized | Leave a comment

More Oozie Learnings

I just finished moving the code from my last post from our dev and qa boxes to production. During the time I was running in qa I did not, but should have, done testing running the process as a cron style job. Running the process as a cron style job has taught me a lot about how Oozie handles time. I learned three key lessons:

  1. All the times entered in Oozie regardless of the time zone specified in the configuration file are in GMT
  2. ${coord:current(0)} was putting the prior day in because of when it was running with relationship to GMT. Once I figured out the GMT thing I had to change to ${coord:current(-1)}.
  3. Manipulating the initial-instance variable for the data sets allowed me to back fill prior days. In oder to give our DWH team more test data I set it a week prior to the first actual run I wanted to do.

Tonight I will have another test run, fingers crossed everything should work well.

Posted in Data Warehouse, Hadoop | Leave a comment

Using Oozie to Process Daily Logs

At Edmunds we are working to move our existing data warehouse system to a new system based on Hadoop and Netezza. At first, our data warehouse team focused on delivering ad impression data from DoubleClick DART as the first production deliverable. Last week, we achieved a major milestone: DART data is now being delivered on a nightly basis through our new Hadoop/Netezza infrastructure. Once in Hadoop the data is scrubbed and dumped into files that mirror our Netezza table structure. In order to handle the DART processing, we wrote a fair amount of code to deal with daily log rotation and things of that nature.

Toward the end of the project we started using Oozie to coordinate our processing workflows. We chose to use Oozie because we wanted a system that would allow us to schedule jobs, coordinate workflows and allow us to have better visibility about what is running when.

Recently I took on the task of processing some of the logs we generate internally. These internal logs are rotated out daily. Given the functionality provided by Oozie, we thought it would be great to remove our code that handles log rotation, file names, and date calculations and use Oozie to do the work. As powerful as Oozie is for handling date based processing, getting it to work was another story.

This post describes the configuration I used to get our jobs running using Oozie. I will also go into some of the mistakes I made so that others can save time and effort using Oozie.

Background

My goal was to process files that are delivered to Hadoop every night. The files are delivered to HDFS in a set of directories organized by date and server name. All the files all have the same name. The directory format is shown below:

/log_archive/prod/2011/06/06/prodserver/app_server_events.log
/log_archive/prod/2011/06/06/prodserver2/app_server_events.log

I have a simple mapper (no reducer) that take the input files and convert them to a format that our data loaders can handle. Based on the directory format I set the mapper configuration’s mapred.input.dir to:

/log_archive/prod/2011/06/06/*/app_server_events.log

I wanted to maintain the date based directory structure for the output files, but since I was dealing with files from all the servers I set the mapred.output.dir to:

/log_archive/prod/2011/06/06/app_server_events_processed.log

To complete the project, after all processing is done, I need to deliver the processed file to an NFS mounted filesystem.

Using Hard coded values

My first stab at getting the above process working was to just hard code everything. I figured that once I had it working I could back into a more dynamic system with calculated file names based on run dates. In addition, to keep things simple, I didn’t worry about how I was going to get the data out of Hadoop and onto the mount where my user needed it I just left the processed files in HDFS.

In order to allow Oozie to run my process I had to get the mapper jar and its dependencies into Hadoop. To do this I create a directory in the Oozie user’s home directory.  The directory structure is shown in listing 1 below.

+-/user/oozie/server-events-workflow/
  +-coordinator.xml
  +-workflow.xml
  +-lib/
    +-*.jar

Listing 1 – Workflow job directory structure

Since I don’t want to run things from the Oozie user’s home directory, prior to moving to production I will move the directory structure to something more generic like /apps/workflow/WORKFLOW-NAME. For now, the structure in listing 1 above structure works. I could move the coordinator outside of the workflow, but for this example I do not have to coordinate multiple workflows. So, I chose to place the coordinator and workflow files together. To package up my code and dependencies I relied on Maven and the Maven Assembly plugin’s fileSet and dependencySet directives to assemble Oozie artifacts into a directory and zip file for easy deployment.

The coordinator.xml and workflow.xml files have almost everything hardcoded. I knew that in the future I will want to have more dynamic, date based file names so, I choose to move variable definitions for file and directory names into the coordinator.xml file. In addition, I moved some standard location definitions such as the workflow path, name node, start date, and end date to a separate property file that is passed into the Oozie job at run time. Putting the configuration into an external text file allowed me to move from my local environment to our integration and production servers easily.

Listing 2 is a sample of the command line script that is used to execute the workflow. Note the use of the -config flag to pass in the properties file to configure default Oozie actions.

#!/bin/bash

#########################################
# Run for Dart Workflow Job once        #
#########################################

abspath=`dirname "$(cd "${0%/*}" 2>/dev/null; echo "$PWD"/"${0##*/}")"`
echo $abspath

whoami=`whoami`

if [ "x${whoami}" != "xoozie" ] ; then
	echo "ERROR: This script should only be run by the oozie user"
	exit 127
fi

oozie job -oozie http://localhost:11000/oozie -config $abspath/coordinator-manual.properties -run

Listing 2 – Shell script to run an Oozie job

The properties file defines key common items that pass variables to the coordinator.xml. To help with testing I created two properties files. One file sets the start and end dates so that I can run the Oozie job manually and the other allows me to run the job on a daily basis.

oozie.coord.application.path=hdfs://namenode:8020/user/oozie/apps/server-events-workflow
workflowApplicationPath=hdfs://namenode:8020/user/oozie/apps/server-events-workflow
jobTracker=jobtracker:8021
nameNode=hdfs://namenode:8020
maxErrors=1000
startDate=2011-06-01T01:00Z
endDate=2011-06-01T01:00Z

Listing 3 – Sample of the manual file properties file

The coordinator.xml will be passed the start and end dates as well as name node and other configuration items. The coordinator sets the input and output parameters that will be passed into the workflow.xml definition.

<coordinator-app name="server-event-processor" frequency="${coord:days(1)}" start="${startDate}" end="${endDate}" timezone="America/Los_Angeles" xmlns="uri:oozie:coordinator:0.1">
    <action>
        <workflow>
            <app-path>${workflowApplicationPath}</app-path>
            <configuration>
                <property>
                    <name>jobTracker</name>
                    <value>${jobTracker}</value>
                </property>
                <property>
                    <name>nameNode</name>
                    <value>${nameNode}</value>
                </property>
                <property>
                    <name>queueName</name>
                    <value>default</value>
                </property>
                <property>
                    <name>maxErrors</name>
                    <value>${maxErrors}</value>
                </property>
                <property>
                    <name>inputs</name>
                    <value>${nameNode}/user/oozie/log_archive/prod/2011/06/06/*/server_events.log</value>
                </property>
                <property>
                    <name>outputs</name>
                    <value>${nameNode}/user/oozie/log_archive/prod/2011/06/06/server_events_processed.log</value>
                </property>
                <property>
                    <name>oozie.wf.workflow.notification.url</name>
                    <value>http://eventpublisher:8080/hadoop-event-publisher-webapp/service/oozieNotification/updateWorkflow?jobId=$jobId&amp;status=$status</value>
                </property>
            </configuration>
        </workflow>
    </action>
</coordinator-app>

Listing 4 – Hard coded coordinator.xml

Since, the process is a simple file conversion and does not have the local file move functionality, yet, the workflow file in listing 4 has a single action defined. The action only specifies a mapper, does not specify a reducer, and sets the number of reducers to zero. The mapred queue name is set to the queueName property set by the coordinator.xml file. The mapred.input.dir and mapred.output.dir properties are set to the input and output properties defined in the coordinator.xml file. Note that I use the prepare directive to delete any output from previous runs and to make debugging and testing easier.

<workflow-app xmlns="uri:oozie:workflow:0.1" name="server-events-informatica-wf">
    <start to="ServerInformaticaProcessor"/>
    <action name="ServerInformaticaProcessor">
        <map-reduce>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${outputs}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
                <property>
                    <name>mapred.mapper.class</name>
                    <value>com.edmunds.dwh.processor.serverevent.ServerEventMapJob$ServerEventMapClass</value>
                </property>
                <property>
                    <name>mapred.speculative.execution</name>
                    <value>false</value>
                </property>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapred.input.format.class</name>
                    <value>org.apache.hadoop.mapred.TextInputFormat</value>
                </property>
                <property>
                    <name>mapred.output.format.class</name>
                    <value>org.apache.hadoop.mapred.TextOutputFormat</value>
                </property>
                <property>
                    <name>mapred.output.key.class</name>
                    <value>org.apache.hadoop.io.Text</value>
                </property>
                <property>
                    <name>mapred.output.value.class</name>
                    <value>org.apache.hadoop.io.Text</value>
                </property>
                <property>
                    <name>mapred.reduce.tasks</name>
                    <value>0</value>
                </property>
                <property>
                    <name>mapred.input.dir</name>
                    <value>${inputs}</value>
                </property>
                <property>
                    <name>mapred.output.dir</name>
                    <value>${outputs}</value>
                </property>
                <property>
                    <name>informaticaSupport</name>
                    <value>true</value>
                </property>
                <property>
                    <name>validate</name>
                    <value>false</value>
                </property>
            </configuration>
        </map-reduce>
        <ok to="end"/>
        <error to="fail-server-informatica"/>
    </action>
    <kill name="fail-server-informatica">
        <message>Informatica Processor failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

Listing 5 – workflow.xml file with properties supplied by coordinator.xml and cleanup prior to execution of workflow

At this point I had a coordinator and workflow definition that would execute my mapper with a single hard coded file name. I ran into a problems when I first tried to run everything. My coordinator ran and exited with a status of SUCCEEDED. Yet, my workflow was never executed. For the longest time, I could not figure out why. While I was looking at my coordinator properties file, I noticed the start and end date values highlighted in listing 6 below.

oozie.coord.application.path=hdfs://namenode:8020/user/oozie/apps/server-events-workflow
workflowApplicationPath=hdfs://namenode:8020/user/oozie/apps/server-events-workflow
jobTracker=jobtracker:8021
nameNode=hdfs://namenode:8020
maxErrors=1000
startDate=2011-06-01T01:00Z
endDate=2011-06-01T00:00Z

Listing 6 – Properties file

Since I was trying to run the workflow manually, I had just copied a start and end date from some other file. My blind copying created a problem. Oozie diligently noticed that the end date was before the start date. So, it exited. Once I updated the end date to be after the start date, everything worked fine.

I now had two problems remaining: 1) I needed to get the output from HDFS to a specific mount point and 2) I needed to run this job daily and update the path for inputs and outputs accordingly. As the file move seemed the easier of the two problems, I decided to start with getting data out of HDFS onto a local file system.

Moving files from HDFS to local file system

I spent a bunch of time pouring through the Oozie documentation to see if there was a built in function to move mapred output from HDFS to a local file system. While Oozie does have have an Fs action it only works with HDFS. The HDFS only restriction meant I had to figure out my own way of doing the file copy.

Oozie has support for Java actions. Java actions are run by a single mapper that executes it main() method. The Java action support passing arguments and setting java opts. The java action gave me a hook by which I could create a simple Java class to do the move for me. The next thing I needed to figure out was how to take the multiple file outputs from my mapper and copy them to a single file on a non-HDFS filesystem.

I knew there was a method, moveToLocalFile(), on the Hadoop FileSystem class. However, the method only appears to work for a single file. I searched through the Hadoop JavaDoc and found a FileUtil class that has a static method called copyMerge(). The method, copyMerge(), takes a source filesystem, source path, destination file system, destination path, and a flag to delete the source path on success. The copyMerge() method sounded like something that would work for me. So, I created a simple class that is constructed with a source and destination file system and contains a single method: move(). The move() method takes a source and destination path. If the source path is determined to be a directory it calls copyMerge() with the delete flag set to true. If the source path is a simple file the move() method calls the moveToLocalFile() method on the source file system object.

The Oozie java action expects a main() method. So, I created a simple main() method that creates a configuration and two file systems. It then calls move on the paths it created from the passed in file names. The code I created is shown in listing 7 below. Granted, it could use some nicer error handling and logging, but for now it a good start.

package com.edmunds.dwh.processor;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;

import java.io.IOException;

/**
 * This class moves a file or directory from HDFS to a local file or directory.
 * Copyright (C) 2011 Edmunds.com
 * <p/>
 * Date: 6/9/11:10:27 AM
 *
 * @author phannon
 */
public final class HdfsToLocalFileMove {
    private final FileSystem sourceFileSystem;
    private final FileSystem destFileSystem;
    private final Configuration conf;

    public HdfsToLocalFileMove(Configuration conf, FileSystem sourceFileSystem, FileSystem destFileSystem) throws IOException {
        this.sourceFileSystem = sourceFileSystem;
        this.destFileSystem = destFileSystem;
        this.conf = conf;
    }

    public void move(Path source, Path destination) throws IOException {
        if (destFileSystem.exists(destination)) {
            destFileSystem.delete(destination, true);
        }
        if (sourceFileSystem.getFileStatus(source).isDir()) {
            FileUtil.copyMerge(sourceFileSystem, source, destFileSystem, destination, true, conf, null);
        } else {
            sourceFileSystem.moveToLocalFile(source, destination);
        }
    }

    public static void main(String[] args) throws IOException {
        Path source = new Path(args[0]);
        Path dest = new Path(args[1]);
        Configuration conf = new Configuration();
        FileSystem fsSource = source.getFileSystem(conf);
        FileSystem fsDest = FileSystem.getLocal(conf);
        HdfsToLocalFileMove move = new HdfsToLocalFileMove(conf, fsSource, fsDest);
        move.move(source, dest);
    }
}

Listing 7 – HDFS File Move Class

Now that I had a java class that could be called by the Oozie java action I had to wire the action into the workflow (I’ll skip covering my build and recopy into HDFS for the libraries and configuration files).

I wanted to make sure that when I move away from hard coded paths my workflow.xml file won’t need to change. Also, I want to ensure that whenever I run the job I can pass in the local directory to store the output to. To make sure I can have the flexibility I need I added the directory name to my properties file. Also, I’ll build the output path in the coordinator.xml file. The changes to each file are outlined below in listing 8.

localDir=file:///misc/phannon

Listing 8 – New property in the property file.

Note the path is using normal URI syntax. The file:// is important, leave this off and your file will end up in HDFS.

<coordinator-app name="server-event-processor" frequency="${coord:days(1)}" start="${startDate}" end="${endDate}"
                 timezone="America/Los_Angeles"
                 xmlns="uri:oozie:coordinator:0.1">
    <action>
        <workflow>
            <app-path>${workflowApplicationPath}</app-path>
            <configuration>
            ...SNIP...
                <property>
                    <name>localDest</name>
                    <value>${localDir}/server_events_processed_2011_06_06.log</value>
                </property>
            ...SNIP...
            </configuration>
        </workflow>
    </action>
</coordinator-app>

Listing 9 – Exposing new property for the workflow

The additional property in listing 9 allowed my workflow.xml file to access a parameter called ${localDest} which is used to configure the java file move action. The new workflow is shown below in listing 10. The new file move action is highlighted. In order for the file move action to be called, I updated the ok tag on the file processor action to point to the move file action. This made the workflow continue to the file move action on success of the file processor action.

<workflow-app xmlns="uri:oozie:workflow:0.1" name="server-events-informatica-wf">
    <start to="ServerInformaticaProcessor"/>

    <action name="ServerInformaticaProcessor">
        <map-reduce>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${outputs}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
                <property>
                    <name>mapred.mapper.class</name>
                    <value>com.edmunds.dwh.processor.serverevent.ServerEventMapJob$ServerEventMapClass</value>
                </property>
                <property>
                    <name>mapred.speculative.execution</name>
                    <value>false</value>
                </property>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapred.input.format.class</name>
                    <value>org.apache.hadoop.mapred.TextInputFormat</value>
                </property>
                <property>
                    <name>mapred.output.format.class</name>
                    <value>org.apache.hadoop.mapred.TextOutputFormat</value>
                </property>
                <property>
                    <name>mapred.output.key.class</name>
                    <value>org.apache.hadoop.io.Text</value>
                </property>
                <property>
                    <name>mapred.output.value.class</name>
                    <value>org.apache.hadoop.io.Text</value>
                </property>
                <property>
                    <name>mapred.reduce.tasks</name>
                    <value>0</value>
                </property>
                <property>
                    <name>mapred.input.dir</name>
                    <value>${inputs}</value>
                </property>
                <property>
                    <name>mapred.output.dir</name>
                    <value>${outputs}</value>
                </property>
                <property>
                    <name>informaticaSupport</name>
                    <value>true</value>
                </property>
                <property>
                    <name>validate</name>
                    <value>false</value>
                </property>
            </configuration>
        </map-reduce>
        <ok to="move-to-local"/>
        <error to="fail-server-informatica"/>
    </action>
    <action name="move-to-local">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <main-class>com.edmunds.dwh.processor.HdfsToLocalFileMove</main-class>
            <arg>${outputs}</arg>
            <arg>${localDest}</arg>
            <capture-output/>
        </java>
        <ok to="end"/>
        <error to="fail-move-to-local"/>
    </action>
    <kill name="fail-server-informatica">
        <message>Informatica Processor failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <kill name="fail-move-to-local">
        <message>Copy to local failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

Listing 10 – workflow.xml with file move action added

At this point I had a hard coded coordinator and workflow configuration that translates a file from one format to another. Also, the coodinator and workflow configuration have been updated to move the output from HDFS to a local file system. This left me with one final problem: I needed to run this job daily and update the path for inputs and outputs accordingly. Luckily, Oozie provides all the functionality I needed to solve the problem.

Moving to dynamically calculated file names

The input files I receive are sorted into directories based on year, month, and day. I wanted the output files to be stored with year, month, and day appended to the file name. In order to get Oozie to generate the file names, first I had to use the dataset tags to specify the file format and then use input and output events to specify the specific instances of the dataset files that I needed to read or generate. I had a bit of a hard time with this section, Oozie does provide documentation. However, it didn’t really stick how it worked until I played with the settings for a while.

The dataset specification was easy enough. I simply added the lines in listing 11 to the top of my coorinator.xml file.

   <datasets>
        <dataset name="logs" frequency="${coord:days(1)}"
                 initial-instance="2011-01-01T24:00Z" timezone="America/Los_Angeles">
            <uri-template>${nameNode}/user/oozie/log_archive/prod/${YEAR}/${MONTH}/${DAY}</uri-template>
            <done-flag></done-flag>
        </dataset>
        <dataset name="processed" frequency="${coord:days(1)}"
                 initial-instance="2011-01-01T24:00Z" timezone="America/Los_Angeles">
            <uri-template>${nameNode}/user/oozie/log_archive/prod/${YEAR}/${MONTH}/${DAY}/server_events_processed.log</uri-template>
        </dataset>
        <dataset name="localDir" frequency="${coord:days(1)}"
                 initial-instance="2011-01-01T24:00Z" timezone="America/Los_Angeles">
            <uri-template>${localDir}/${environment}/${YEAR}/${MONTH}/${DAY}/server_events_processed.log</uri-template>
        </dataset>
    </datasets>

Listing 11 – Data set definitions for input and output uri templates

Then I created input event blocks right underneath the declaration for the datasets.

    <input-events>
        <data-in name="input" dataset="logs">
            <instance>${coord:current(0)}</instance>
        </data-in>
    </input-events>
    <output-events>
        <data-out name="output" dataset="processed">
            <instance>${coord:current(0)}</instance>
        </data-out>
        <data-out name="localOutput" dataset="localDir">
            <instance>${coord:current(0)}</instance>
        </data-out>
    </output-events>

Listing 12 – Event block definitions

To pass these to the workflow.xml I then updated the properties for the workflow in the coordinator.xml.

                <property>
                    <name>inputs</name>
                    <value>${coord:dataIn('input')}/*/server_events.log</value>
                </property>
                <property>
                    <name>outputs</name>
                    <value>${coord:dataOut('output')}</value>
                </property>
                <property>
                    <name>localDest</name>
                    <value>${coord:dataOut('localOutput')}</value>
                </property>

Listing 13 – Updated properties for calculated file names

The complete xml file is shown in listing 14 below.

<coordinator-app name="server-event-processor" frequency="${coord:days(1)}" start="${startDate}" end="${endDate}"
                 timezone="America/Los_Angeles"
                 xmlns="uri:oozie:coordinator:0.1">
    <datasets>
        <dataset name="logs" frequency="${coord:days(1)}"
                 initial-instance="2011-01-01T24:00Z" timezone="America/Los_Angeles">
            <uri-template>${nameNode}/user/oozie/log_archive/prod/${YEAR}/${MONTH}/${DAY}</uri-template>
            <done-flag></done-flag>
        </dataset>
        <dataset name="processed" frequency="${coord:days(1)}"
                 initial-instance="2011-01-01T24:00Z" timezone="America/Los_Angeles">
            <uri-template>${nameNode}/user/oozie/log_archive/prod/${YEAR}/${MONTH}/${DAY}/server_events_processed.log</uri-template>
        </dataset>
        <dataset name="localDir" frequency="${coord:days(1)}"
                 initial-instance="2011-01-01T24:00Z" timezone="America/Los_Angeles">
            <uri-template>${localDir}/${environment}/${YEAR}/${MONTH}/${DAY}/server_events_processed.log</uri-template>
        </dataset>
    </datasets>
    <input-events>
        <data-in name="input" dataset="logs">
            <instance>${coord:current(0)}</instance>
        </data-in>
    </input-events>
    <output-events>
        <data-out name="output" dataset="processed">
            <instance>${coord:current(0)}</instance>
        </data-out>
        <data-out name="localOutput" dataset="localDir">
            <instance>${coord:current(0)}</instance>
        </data-out>
    </output-events>
    <action>
        <workflow>
            <app-path>${workflowApplicationPath}</app-path>
            <configuration>
                <property>
                    <name>jobTracker</name>
                    <value>${jobTracker}</value>
                </property>
                <property>
                    <name>nameNode</name>
                    <value>${nameNode}</value>
                </property>
                <property>
                    <name>queueName</name>
                    <value>default</value>
                </property>
                <property>
                    <name>maxErrors</name>
                    <value>${maxErrors}</value>
                </property>
                <property>
                    <name>inputs</name>
                    <value>${coord:dataIn('input')}/*/server_events.log</value>
                </property>
                <property>
                    <name>outputs</name>
                    <value>${coord:dataOut('output')}</value>
                </property>
                <property>
                    <name>localDest</name>
                    <value>${coord:dataOut('localOutput')}</value>
                </property>
                <property>
                    <name>oozie.wf.workflow.notification.url</name>
                    <value>http://eventpublisher:8080/hadoop-event-publisher-webapp/service/oozieNotification/updateWorkflow?jobId=$jobId&amp;status=$status</value>
                </property>

            </configuration>
        </workflow>
    </action>
</coordinator-app>

Listing 14 – Complete coordinator.xml

Since I already setup the workflow.xml with variables, I didn’t have to make any changes to the workflow file. The dataset block has some useful markers like DAY, MONTH, YEAR to help with URI construction.

Notice I did not put the /*/server_events.log in the dataset beacuse I do not think that it made sense for the dataset definition. I felt /*/server_events.log made more sense in the workflow configuration. I might try to revisit that decision if wildcards are supported in the dataset block, from the documentation I cannot tell. I was hoping that the markers DAY, MONTH, YEAR could be used in other blocks within the coordinator file. However, once I figured out that the input-events block select an actual instance of the URI pattern, I understood better that the DAY, MONTH, YEAR el was more like a macro than an actual variable that could be used elsewhere. If I wanted to access date information I could have also used the ${cood:current(int)} EL function. In my next revision of the mapper I will use the current EL to allow me to validate that the log data does is within the date range the log file is supposed to contain.

The first time I ran this code the coordinator job went into a waiting state. I had no idea why. After staring at the logs I noticed that every few seconds Oozie was checking for the existence of my files. At first I didn’t notice, but Oozie was actually looking for a specific file called “_SUCCESS” in my source directory. I remember seeing something about that in the documentation but didn’t pay attention. Looking back at the documentation I realized that the “_SUCCESS” file is a flag that lets Oozie know the source data was ready. Since our system does not have such a flag I had to add the empty done-flag markup you see in the coordinator dataset definition.

Next, I missed the, now obvious, fact that my properties file had to change to a more realistic start and stop time so that the coordinator could correctly calculate today, yesterday, etc. Our logs are rotated at midnight and my test set was for 06/06/2011 so, I set my start time to 06/07/2011 01:00 and my stop time to 06/07/2011 02:00. I figured that once my start date was set I could use current(-1) to give me the previous day. However, current does not appear to work that way, or at least current with the frequency setting I was using. Reading the documentation states:

The ${coord:days(int n)} EL function returns the number of minutes for ‘n’ complete days starting with the day of the specified nominal time for which the computation is being done.

I guess that makes sense? So, I set my frequency to ${coord:days(1)} because I wanted the process to run once every day.

Next, I had to feed a specific instance of my data set as the input event trigger. The documentation for input events can be found here. There is a special EL function called coord:current. The documentation states that

${coord:current(int n)} represents the nth dataset instance for a synchronous dataset, relative to the coordinator action creation (materialization) time. The coordinator action creation (materialization) time is computed based on the coordinator job start time and its frequency. The nth dataset instance is computed based on the dataset’s initial-instance datetime, its frequency and the (current) coordinator action creation (materialization) time.

n can be a negative integer, zero or a positive integer.

I figured that ${coord:current(-1)} for a start date of 06/07/2011 01:00 would resolve to 06/06/2011. It does not. However, ${coord:current(0)} does. I think if I had used a start date of 06/07/2011 24:00 it would work that way. Go figure.

Conclusion

Once I figured all this out I ran my script for a final (for that day) time. Everything worked! I expected a bit more of a rush. However, it didn’t feel like I accomplished much for the time spent. Hence, my blog post to ensure others don’t spend a day working though all this stuff. In the end reading though the configurations it all basically makes sense:

  • You have to write custom code to move stuff out of hdfs
  • Start and stop dates and times matter
  • You can have multiple output events for different actions
  • If you want to process logs from 06/06/2011 set the start date to 06/07/2011 and use current(0) (unless, I guess you set the start date to 06/07/2011 24:00)
  • Oozie allowed me to replace a bunch of custom code with some fairly simple xml
Posted in Data Warehouse, Hadoop, Java | 5 Comments

Cliff Click at SM JUG

The first ever Santa Monica JUG meetup will be held at Shopzilla's offices on the 17th at 7pm. Mark your calendars and check us out on meetup: http://www.meetup.com/SM-JUG.

See you there!

Paddy

Posted in Java | Leave a comment

Santa Monica Java User Group

Over the past few week a few of us from Edmunds, Shopzilla, and eHarmony have been discussing trying to foster a broader developer community here on the west side. More and more Santa Monica, and the west side in general, has become a center for high tech web focused businesses, however, there is not a feeling of a strong developer community. In order to help create more of a community we will be holding monthly java user group meetings rotating between three locations.

The first meeting will be at Shopzilla in Feburary, followed by one at Edmunds in March, and a third at eHarmony in April. We will hold meetings in a variety of formats, however, we will always remained focused on the developers.

I will keep you all updated as more details are fleshed out.

Posted in Java | Leave a comment

What I learned about Continuous Deployment

While at QCon I took a great class from the fellows at ThoughtWorks on Continuous Deployments. The single most important thing I learned from the class was that if something is painful, do it a lot! Having just sat through at least three war rooms in the last two months to manage deployments this statement is true!

Strangelove_war_room

 However, when I mentioned this late one night to the team I had a lot of blank looks and confusion. Given our current environment and processes I don't blame them for the skepticism. In fact there was a lot of if we did this all the time none of us would be here for long. They are right in more ways than they know! The trick with doing something painful a lot is that you should learn what the painful parts are and work to make them pain free. 

The biggest part of the problem is that we had a monumental amount of change that has gone into our last few releases. Starting with the first release of our beta site we tried to move to a three week release schedule. For the most part we did, however, the problem was that the time from code complete on the first release to the code complete for the second and third releases was months. We held off from putting a lot of change into our second release and looked at the third release as the one that introduced most of the business facing changes. What ended up happening is that we crammed months of work into a three week release window. Not fun and not the point of the three week cycle. If we had made us of the techniques that the continuous improvements clan is pushing we could have greatly reduced the painfulness of our process.

The ways to get to a short or continuous release cycle are:

  • Automate testing through the whole stack including acceptance testing
  • Create feature flags to allow functionality to be released but turned off
  • Keep everything in mainline, don't branch
  • Keep configuration in scm with your code but deployed separately
  • Use green/blue or red/black or a/b type deployment paradigm

At Edmunds we have started to automate testing, however, we have a ways to go. We have automated unit, or fine grained tests, automated UI tests using selenium, and mostly automated load tests using JMeter. However, we are missing more coarse grained or integration style testing, and our ability to create automated acceptance tests needs work. There is a mythical land where all acceptance tests are fully automated, once one has arrived in this land one can push code with the confidence that nothing has broken. 

I am still trying to figure out how to get to that mythical land. We are setting 80% code coverage numbers for our unit tests. Next we need to build up our integration tests. My current thinking is that we build integration tests based on pre-production and production bugs and incidents. As these types of issues come up the idea would be to create tests to reproduce the problem and then prove that the problem has been fixed. Our selenium tests have pretty good coverage, but do not account for business user acceptance criteria. BDD style frameworks such as jBehave may help, however, this is the area I know the least about.

One of the goals for my group next year will be to work with our QA staff to figure out how we are going to address testing to eliminate part of our deployment pain.

Part of the goal of continuous deployments is that every build could be pushed to production. Even with good test coverage there are times when you need to delay the release of a feature because it is not ready for some reason. To allow any build to be deployed to production one needs a mechanism by which to turn these features off. Feature flags is a concept that allows one to do this. Using a configuration file, or other mechanism, one would simply put checks in the code to determine if a feature should be on or off. Such an implementation feels heavy weight. I think we need to spend some time figuring out how to make a more elegant implementation that a bunch of hard coded if/else statements. Perhaps we could use annotations? For us the simplest might be to use some of the a/b functionality we have built into our cms. By switching out templates or modules on the site we could, basically, build new functionality as an extension of an existing module and switch the module include based on feature flag configuration files. We have done something like this in the past using flags to enable or disable features.

Another goal for my team next year will be to design a feature flag system for our front end applications to use. 

Once you have feature flags and comprehensive tests one can stop using feature branches. Eliminating feature branches eliminates the complexity of managing our code base and ensures that we have a higher probability of having builds that are ready for production. We have already begun the move towards mainline only development.

An area we have spent a lot of time with and still need more work is configuration. We have leveraged DNS for certain configuration items using DNS text records. Using text records we let our systems know what environment they are running in. A simple use case of environment name based configuration is URL writing. For all non-production environments we use a url like qa-www.edmunds.com while production is http://www.edmunds.com. To make this switch we have a simple tag library that prepends environment name + "-" to all urls and images. Prior to this setup we had a tool that would switch your host file based on what environment you wanted to look at. We also make extensive use of configuration files outside of our code base and use Spring to wire in the configuration if an external file is found. We try our best to do configuration by exception using a combination of these two techniques. Where we need to improve is on the SCM part of the configuration. Many of the one off files are kept on machines and are not checked in. During deployments this practice leads to confusion and often results in missed configuration steps. We will be moving configuration into SCM and working on deploying the configuration as part of our roll-out. Our use of DNS names for environments should help with this as our configuration could be pulled down automatically. Additionally, our use of zookeeper can provide a more robust configuration system, we just need to build it :-)

The last piece of the puzzle is something we are currently working on (in fact I am in a war room right now launching the first B version of our server farm). Having two full production stacks is a luxury to be sure, however, it gives us the ability to provide rapid fail back in case of a problem with a build. Again, make something painful less painful! By having an A/B set of servers we can deploy to one, give it traffic and watch for failure. In the case of failure we can can easily fall back. In addition, for deployments we do not have to worry about being as safe so we can build fast upgrade processes. 

Doing something painful more often is a recipe for disaster if you keep doing what you have always done. However, if you use the practice as a mechanism to improve, it will be a great benefit. I look forward of the course of 2011 to implementing more of the practices and moving us closer to the nirvana that potentially awaits ;-)

 

Posted in Uncategorized | Leave a comment

Edmunds.com Redesign Goes Live :-)

After about two years of development we have actually pulled off something that is usually considered a terrible idea that will lead to utter disaster…We have rewritten our entire technology stack and redesigned our website in one shot. We have tried in the past to do incremental updates, however, we never where able to pull it off. Eventually, the weight of our technology stack combined with years of incremental site improvements made the decision to start over not terribly difficult from either a business or engineering perspective.

Some highlights

  • Our ops team built a new process that significantly reduced the amount of time it takes to provision servers
  • Moved from a monolithic deployment to an extremely modular system with more than fifty deployment artefacts
  • Introduced a Coherence based data grid for storing and retrieving data for page rendering
  • Introduced Solr as our search engine
  • Built a new CMS
  • Built and open sourced a content based routing mechanism, available on github: https://github.com/edmunds
  • Built a new message based content and data publishing system
  • Introduced Hadoop for part of our ETL process
  • Introduced Netezza as a reporting platform

This was a huge release, and we took on a lot of risk. The whole company really pulled together together to make the redesign a reality. It is an incredible achievement. I am very proud of everyone!

Posted in Uncategorized | Leave a comment