Installing single node Hadoop in Windows machine

1) Install GCC compiler with the help of the cygwin executable by using the following command:
C:\cygwin64>setup-x86_64.exe -q -P wget -P gcc-g++ -P make -P diffutils -P libmpfr-devel -P libgmp-devel -P libmpc-devel

2)Download and install the prerequisite for the Hadoop (you can find this in the Building.txt file when you download Hadoop)

* JDK 1.6+
* Maven 3.0 or later
* Findbugs 1.3.9 (if running findbugs)
* ProtocolBuffer 2.5.0 (ProtoBuf)
* Windows SDK or Visual Studio 2010 Professional
* Unix command-line tools from GnuWin32 or Cygwin: sh, mkdir, rm, cp, tar, gzip
* zlib headers (if building native code bindings for zlib) >> Make sure to have the zlib1.dll and all the header files (downloaded from Zlib Site) inside the ZLIB_HOME which should be added to the PATH
* Internet connection for first build (to fetch all Maven and Hadoop dependencies)

3) set platform from console (admin console is used, so make sure the environment variables are in SYSTEM not in USER variables)
set Platform=x64 (when building on a 64-bit system)
set Platform=Win32 (when building on a 32-bit system)

4)Download the Hadoop source (Hadoop Repository) which contains all the resources required to build (e.g., the cluster, hdfs, mapreduce etc.)

5)Navigate to the downloaded Hadoop folder and build hadoop with maven using the following command:
mvn package -Pdist,native-win -DskipTests -Dtar
Possible ERRORS:

1)[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (pre-dist) on project hadoop-project-dist: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program "sh" (in directory "C:\JAVA\hadoopsource\hadoop-2.6.0-src\hadoop-project-dist\target"): CreateProcess error=2, The system cannot find the file

Fixes:
a)Add cygwin to path
b) Add protobuf to path

2)[ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:2.5.0-SNAP SHOT:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: 'protoc --version' did not return a version -> [Help 1]

Fixes: This means the protoc.exe is not available in the path. Make sure you get the protoc.exe in the path

3) [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2:exec (compile-ms-winutils) on project hadoop-common: Command execution failed. Process exited with an error: 1(Exit value: 1) -> [Help 1]

Fixes: The zlib1.dll and the header files (download from Zlib site) should be in ZLIB_HOME and added to the PATH.

After the build is successful, the hadoop-dist\target folder will have the corresponding SNAPSHOT.tar file. 
Lets extract it to c:\hdp

NOTE:
I ran the console (windows cmd) in the administration mode to execute the steps.

References:
  1. Apache Wiki
  2. Marius Blog
Testing:
  1. We'll be configuring Hadoop for a Single Node (pseudo-distributed) Cluster.
  2. As part of configuring HDFS, update the files:
    1. near the end of "\hdp\etc\hadoop\hadoop-env.cmd" add following lines:
      1
      2
      3
      4
      set HADOOP_PREFIX=c:\hdp
      set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
      set YARN_CONF_DIR=%HADOOP_CONF_DIR%
      set PATH=%PATH%;%HADOOP_PREFIX%\bin
    2. modify "\hdp\etc\hadoop\core-site.xml" with following:
      1
      2
      3
      4
      5
      6
      <configuration>
        <property>
          <name>fs.default.name</name>
          <value>hdfs://0.0.0.0:19000</value>
        </property>
      </configuration>
    3. modify "\hdp\etc\hadoop\hdfs-site.xml" with:
      1
      2
      3
      4
      5
      6
      <configuration>
        <property>
          <name>dfs.replication</name>
          <value>1</value>
        </property>
      </configuration>
    4. Finally, make sure "\hdp\etc\hadoop\slaves" has the following entry:
      1
      localhost
    5. and create c:\tmp directory as the default configuration puts HDFS metadata and data files under \tmp on the current drive
  3. As part of configuring YARN, update files:
    1. add following entries to "\hdp\etc\hadoop\mapred-site.xml", replacing %USERNAME% with your Windows user name:
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      <configuration>
        <property>
          <name>mapreduce.job.user.name</name>
          <value>%USERNAME%</value>
        </property>
        <property>
          <name>mapreduce.framework.name</name>
          <value>yarn</value>
        </property>
        <property>
          <name>yarn.apps.stagingDir</name>
          <value>/user/%USERNAME%/staging</value>
        </property>
        <property>
          <name>mapreduce.jobtracker.address</name>
          <value>local</value>
        </property>
      </configuration>
    2. modify "\hdp\etc\hadoop\yarn-site.xml", with:
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      <configuration>
        <property>
          <name>yarn.server.resourcemanager.address</name>
          <value>0.0.0.0:8020</value>
        </property>
        <property>
          <name>yarn.server.resourcemanager.application.expiry.interval</name>
          <value>60000</value>
        </property>
        <property>
          <name>yarn.server.nodemanager.address</name>
          <value>0.0.0.0:45454</value>
        </property>
        <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
        </property>
        <property>
          <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
          <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
          <name>yarn.server.nodemanager.remote-app-log-dir</name>
          <value>/app-logs</value>
        </property>
        <property>
          <name>yarn.nodemanager.log-dirs</name>
          <value>/dep/logs/userlogs</value>
        </property>
        <property>
          <name>yarn.server.mapreduce-appmanager.attempt-listener.bindAddress</name>
          <value>0.0.0.0</value>
        </property>
        <property>
          <name>yarn.server.mapreduce-appmanager.client-service.bindAddress</name>
          <value>0.0.0.0</value>
        </property>
        <property>
          <name>yarn.log-aggregation-enable</name>
          <value>true</value>
        </property>
        <property>
          <name>yarn.log-aggregation.retain-seconds</name>
          <value>-1</value>
        </property>
        <property>
          <name>yarn.application.classpath</name>
          <value>%HADOOP_CONF_DIR%,%HADOOP_COMMON_HOME%/share/hadoop/common/*,%HADOOP_COMMON_HOME%/share/hadoop/common/lib/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*</value>
        </property>
      </configuration>
  4. because Hadoop doesn't recognize JAVA_HOME from "Environment Variables" (and has problems with spaces in pathnames)
    1. copy your JDK to some dir (eg. "c:\hdp\java\jdk1.8.0_40")
    2. edit "\hdp\etc\hadoop\hadoop-env.cmd" and update
      1
      set JAVA_HOME=c:\hdp\java\jdk1.8.0_40
    3. initialize Environment Variables by running cmd in "Administrator Mode" and executing: "c:\hdp\etc\hadoop\hadoop-env.cmd"
  5. Format the FileSystem
    1
    c:\hdp\bin\hdfs namenode -format
  6. Start HDFS Daemons
    1
    c:\hdp\sbin\start-dfs.cmd
  7. Start YARN Daemons
    1
    c:\hdp\sbin\start-yarn.cmd
  8. Run an example YARN job
    1
    c:\hdp\bin\yarn jar c:\hdp\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.0.jar wordcount c:\hdp\LICENSE.txt /out
  9. Check the following pages in your browser:
    1
    2
    3
    Resource Manager:  http://localhost:8088
    Web UI of the NameNode daemon:  http://localhost:50070
    HDFS NameNode web interface:  http://localhost:8042


2 comments:

  1. Thanks, it worked for me.

    ReplyDelete
  2. This is great tutorial and is easy to follow. Thanks man for your effort.

    ReplyDelete