hadoop-test-cluster

This package provides a docker setup and command-line tool for setting up and managing several realistic Hadoop clusters. These can be useful for demo purposes, but their main intended use is development and testing. The Hadoop documentation is pretty emphatic about the need for testing:

If you don’t test your YARN application in a secure Hadoop cluster, it won’t work.

Hadoop’s security model is notoriously tricky to get correct. So hard, one of the main contributors wrote an online book comparing its difficulties to Lovecraftian Horrors. The clusters provided here have settings selected to hit potential issues in Hadoop and YARN applications so you can fix your code early rather than in production. If your code runs on these clusters, it should (hopefully) run on any cluster.

Demo

Highlights

  • Easy to use command-line tool

  • Supports both Hadoop 2 (CDH 5) and Hadoop 3 (CDH 6)

  • Supports both Kerberos and Simple authentication modes

  • Comes with common development and testing dependencies pre-installed (miniconda, maven, git, …)

  • Designed to minimize resource usage - runs fine both on laptops and on common CI servers such as Travis CI

  • Configured with a set of options designed to expose bugs in Hadoop applications. If your code runs on these clusters, it should (hopefully) run anywhere.

Installation

hadoop-test-cluster is available on PyPI:

$ pip install hadoop-test-cluster

You can also install from source on github:

$ pip install git+https://github.com/jcrist/hadoop-test-cluster.git

Docker and Docker Compose are required to already be installed on your system - consult their docs for installation instructions.

Overview

The main entry point for this package is the htcluster command-line tool. This tool can be used to start, stop, and interact with test Hadoop clusters.

Startup a Cluster

To start a cluster, use the htcluster startup command. Two parameters are used to describe which cluster to run:

  • --image: which docker image to use

  • --config: which Hadoop configuration to run the cluster with

Images

Currently supported images are:

To determine which version of Hadoop a cluster is running, see the HADOOP_TESTING_VERSION environment variable (will be set to either cdh5 or cdh6).

Configurations

Currently two different configurations are supported:

  • simple: uses simple authentication (unix user permissions)

  • kerberos: uses kerberos for authentication

To determine which configuration a cluster is running, see the HADOOP_TESTING_CONFIG environment variable (will be set to either simple or kerberos).

Examples

Start a CDH 5 cluster with simple authentication:

$ htcluster startup --image cdh5 --config simple

Start a CDH6 cluster with kerberos authentication:

$ htcluster startup --image cdh6 --config kerberos

Start a cluster, mounting the current directory to ~/workdir:

$ htcluster startup --image cdh5 --mount .:workdir

Login to a Cluster

For interactive work, you can log in to a cluster using the htcluster login command.

Examples

Login to the edge node as the default user:

$ htcluster login

Login to the master node as root:

$ htcluster login --user root --service master

Execute a Command on a Cluster

Instead of logging in, you can also run a command on a cluster using the htcluster exec command.

Examples

Run py.test on a cluster:

$ htcluster exec -- py.test /path/to/my/tests

Follow the HDFS Namenode logs:

$ htcluster exec --user root --service master \
  -- tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode.log

Shutdown a Cluster

To shutdown a cluster, use the htcluster shutdown command.

Example

$ htcluster shutdown

Cluster Details

Here we provide more details on what each cluster supports.

Layout

Each cluster has three containers:

  • One master node running the hdfs-namenode and yarn-resourcemanager, as well as the kerberos daemons. Hostname is master.example.com.

  • One worker node running the hdfs-datanode and yarn-nodemanager. Hostname is worker.example.com.

  • One edge node for interacting with the cluster. Hostname is edge.example.com.

Installed Packages

All clusters provide the following packages:

Additional packages can be installed at runtime with yum, conda, or pip.

Users

One default user account has also been created for testing purposes:

  • Login: testuser

  • Password: testpass

When using kerberos, a keytab for this user has been put at /home/testuser/testuser.keytab, so you can kinit easily like kinit -kt /home/testuser/testuser.keytab testuser.

An admin kerberos principal has also been created for use with kadmin:

  • Login: root/admin

  • Password: adminpass

Ports

The following ports are exposed:

Master Node

  • NameNode RPC: 9000

  • NameNode Webui: 50070

  • ResourceManager Webui: 8088

  • Kerberos KDC: 88

  • Kerberos Kadmin: 749

Worker Node - DataNode Webui: 50075 - NodeManager Webui: 8042

Edge Node - User Defined: 8888 - User Defined: 8786

The full address for accessing these is dependent on the IP address of your docker-machine driver, which can be found at:

$ docker-machine inspect --format {{.Driver.IPAddress}})

If you frequently want access to the WebUI’s, it’s recommended to add the following lines to your /etc/hosts:

<docker-machine-ip>    edge.example.com
<docker-machine-ip>    master.example.com
<docker-machine-ip>    worker.example.com

Authenticating with Kerberos from outside Docker

With the kerberos configuration, the Web UI’s are secured by kerberos, and so won’t be accessible from your browser unless you configure things properly. This is doable, but takes a few steps:

  1. Kerberos/SPNEGO requires that the requested url matches the hosts domain. The easiest way to do this is to modify your /etc/hosts and add a line for master.example.com:

    # Add a line to /etc/hosts pointing master.example.com to your docker-machine
    # driver ip address.
    # Note that you probably need to run this command as a super user.
    $ echo "$(docker-machine inspect --format {{.Driver.IPAddress}})  master.example.com" >> /etc/hosts
    
  2. You must have kinit installed locally. You may already have it, otherwise it’s available through most package managers.

  3. You need to tell kerberos where the krb5.conf is for this domain. This is done with an environment variable. To make this easy, htcluster has a command to do this:

    $ eval $(htcluster kerbenv)
    
  4. At this point you should be able to kinit as testuser:

    $ kinit testuser@EXAMPLE.COM
    
  5. To access kerberos secured pages in your browser you’ll need to do a bit of (simple) configuration. See this documentation from Cloudera for information on what’s needed for your browser.

  6. Since environment variables are only available for processes started in the environment, you have three options here:

    • Restart your browser from the shell in which you added the environment variables

    • Manually get a ticket for the HTTP/master.example.com principal. Note that this will delete your other tickets, but works fine if you just want to see the webpage

      $ kinit -S HTTP/master.example.com testuser
      
    • Use curl to authenticate the first time, at which point you’ll already have the proper tickets in your cache, and the browser authentication will just work. Note that your version of curl must support the GSS-API.

      $ curl -V  # Check your version of curl supports GSS-API
      curl 7.59.0 (x86_64-apple-darwin17.2.0) libcurl/7.59.0 SecureTransport zlib/1.2.11
      Release-Date: 2018-03-14
      Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp
      Features: AsynchDNS IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz UnixSockets
      
      $ curl --negotiate -u : http://master.example.com:50070  # get a HTTP ticket for master.example.com
      

    After doing one of these, you should be able to access any of the pages from your browser.