Installing and Configuring Hadoop and HDFS on Linux

Published on: April 3, 2025


Hadoop has become a foundational tool in the big data ecosystem, widely used for distributed storage and processing of large datasets. This article walks through the installation and configuration of Hadoop 3.4.1 and HDFS on a Kali Linux system. Designed for students and professionals alike, the steps below outline a hands-on, local setup using a non-root user—ideal for experimentation, learning, and lightweight development.

Understanding the Basics

Hadoop is an open-source framework that allows distributed processing of large data sets across clusters of computers using simple programming models. Its core modules include:

  • HDFS (Hadoop Distributed File System): For scalable, fault-tolerant storage
  • MapReduce: A computation model for parallel processing
  • YARN: Yet Another Resource Negotiator, used for cluster resource management

Before diving into setup, it's essential to understand the significance of each component and why configuration matters. Setting up Hadoop locally on Kali Linux (or any Unix-based OS) provides valuable insights into its core mechanisms and how real-world clusters are managed.

Prerequisites

Before diving into Hadoop, ensure your system is ready:

1. Update System & Install Java

Hadoop requires Java. We'll use OpenJDK 8.

BASH
sudo apt update
sudo apt install openjdk-8-jdk -y
java -version; javac -version

2. Install SSH

SSH is used by Hadoop daemons for communication.

BASH
sudo apt install openssh-server openssh-client -y

3. Create a Hadoop User

Avoid using root. Create a dedicated user:

BASH
sudo adduser hdoop
sudo adduser hdoop sudo
su - hdoop

4. Setup Passwordless SSH

This enables internal Hadoop communication.

BASH
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost

Download and Extract Hadoop

BASH
wget https://downloads.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
tar xzf hadoop-3.4.1.tar.gz

Configure Environment Variables

Edit

BASH
sudo nano .bashrc

Add these lines at the end of code:

BASH
export HADOOP_HOME=/home/hdoop/hadoop-3.4.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Hadoop Architecture Diagram
Hadoop Architecture Diagram

Then reload the file:

BASH
source ~/.bashrc


Configure Hadoop Core Files

Edit hadoop-env.sh
BASH
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add this line:

BASH
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64


Edit core-site.xml
BASH
nano $HADOOP_HOME/etc/hadoop/core-site.xml
XML
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hdoop/tmpdata</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

Edit hdfs-site.xml
BASH
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
XML
<configuration>
  <property>
    <name>dfs.name.dir</name>
    <value>/home/hdoop/dfsdata/namenode</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/home/hdoop/dfsdata/datanode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Edit mapred-site.xml
BASH
cp mapred-site.xml.template mapred-site.xml
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
XML
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

Edit yarn-site.xml
BASH
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
XML
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>127.0.0.1</value>
  </property>
</configuration>

Start Hadoop Services

Format the NameNode

BASH
hdfs namenode -format

Start HDFS

BASH
start-dfs.sh

Start YARN

BASH
start-yarn.sh

Check Running Processes

BASH
jps

You should see processes like

You should see NameNode, DataNode, ResourceManager, and NodeManager among the listed processes.
You should see NameNode, DataNode, ResourceManager, and NodeManager among the listed processes.


By following the steps in this guide, you've successfully installed and configured Hadoop and HDFS in a standalone environment. This local setup is perfect for learning Hadoop's core components like MapReduce, HDFS, and YARN and for running small test jobs.

Whether you're preparing for a data engineering role, experimenting with distributed computing, or just curious about big data infrastructure, this Kali Linux-based Hadoop setup offers a controlled, educational sandbox.

From here, you can:

  • Write MapReduce programs in Java or Python
  • Explore data ingestion tools like Apache Sqoop or Flume
  • Connect Hadoop to Hive or Spark for advanced analytics

This practical foundation prepares you to take on larger-scale deployments in the cloud or on actual clusters with confidence.