Installing Hadoop on Mac

Hadoop is the open source implementation of the Google File System and the basis of the present hype in the Big Data. If want to explore and are seeking for the carrer in the Big Data ecosystem then learning about the Hadoop is must. The process of installing the Hadoop is not straightforward as it requires setting up the specific version of the java and right configuration. The steps written below specifies the method that I used to deploy Hadoop on my personal Mac.

Installing the Homebrew Brew is the package manager for Mac which is like the Mac’s app store but is not controlled by Apple and has the open source software. We can install brew using the following command.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Installing Java. Download the java using the above command

brew cast install java

Configure SSH Key Hadoop uses ssh key to authenticate the user so we must be able to ssh to the mac. Please verify if the ssh has been enabled by using the following command.

sudo systemsetup -getremotelogin

If the above command points out the remote login is off. Please set the remotelogin on using the following command.

sudo systemsetup -setremotelogin on

If you dont have ssh-key please create the ssh-key using the following command. By default it will create private key in .ssh/id_rsa and public key in .ssh/id_rsa.pub.

ssh-keygen -t rsa

We then have to copy the public key to the authorized key.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Install hadoop We can install any program from brew using brew install <program_name>. Please use the following command to install.

brew install hadoop

Configuring Hadoop env variables Hadoop environment variables are stored in hadoop-env.sh. In my present environment it is stored on ‘/usr/local/Cellar/hadoop/3.1.1/libexec/etc/hadoop/hadoop-env.sh’. You can find the location of the hadoop-env.sh by using the following command.

find / -name "hadoop-env.sh"

We then need to modify the HADOOP_OPTS and JAVA_HOME environment vairable.

export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk<version>.jdk/Contents/Home"

The version of the java can be found from the above command.

ls /Library/Java/JavaVirtualMachines/

Configuring Core-site.xml The above setting file is used to define the HDFS address and the port number. This recides in same directory as hadoop-env.sh.

<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
  </property>
</configuration>

Configuring mapred-site.xml The mapred-site.xml file is used to configure jobtracker address and port number in map-reduce.

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>
</configuration>

Configuring hdfs-site.xml Configure the replication factor of the hdfs using hdfs-site.xml the default value is 3 which is changed to 1 for our standalone hdfs developement server.

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Formatting HDFS Before starting to use the HDFS we need to format the HDFS. Please use the command below to format the HDFS.

hdfs -namenode format

Launching Hadoop The package also provides us with some usefull script to start and shutdown the resources in the cluster which is located in the location /usr/local/sbin

cd /usr/local/sbin
,/start-dfs.sh

You can then stop the dfs using the following command.

./stop-dfs.sh