Advent of Hadoop
To Understand Hadoop better let’s start with an
Example let say we have a site for selling out mobile phones called MobileZone
and our technical system looks something like below.
These Data which is coming from different sources
and in different formats is called as Bigdata.
We need something new to store and process this Bigdata. That’s when Hadoop came in to picture.
Hadoop
In a crude way think of Hadoop as a very big datawarehouse
which takes data from any source and in
any format.It host a master and many
nodes and it gives us 2 services ie. Storage and Processing.
Now after Hadoop Processes the data, The processed
data can be loaded to analytics and for reporting there by we can predict or
decide on the future sales.
Hadoop is a framework for distributed processing of
large data sets that uses a clusters of computers which has simple programming
models for data processing.
Architecture
Hadoop Processing:
Let us say we have a site and we need to create a
dashboard which will show us how many liked or viewed the site. So our first
task is to set up the Cluster, so we use the Hadoop admin to set up the cluster
with one Master Node also called as the NameNode and 4 Data Nodes.We will see
more about the Name node and Data nodes further. Once the Cluster is set up the
data is ingested to the Hadoop.
For eg: We have the data facebook.json(640mb) which when
ingested in Hadoop , it is broken into 128mb blocks each.Now each 128 mb block is
replicated 3 times to avoid fault tolerance. This makes Hadoop so reliable. So
totally we have 15 blocks.
To Process the data of facebook.json, the data
about the data that is loaded in to
dataNodes will be stored in Namenode and the Actual data will be stored in the
respective datanodes.
NameNode has a Service called Job tracker,and the data nodes will have a service called TaskTracker.Once the Data is loaded, the tasktracker will read the metadata stored in namenodes and assigns the respective tasks to respective tasktrackers which will perform their jobs locally Once the data is processed
NameNode :
The NameNode is the Master Node in a HDFS file system. It maintains the data about the file system,and keeps track of the metadata stored across the datanodes . It only stores the Metadata of the data and not the data itself. Client applications converses to the NameNode whenever a request is recieved to locate a file, or when they want to add/copy/move/delete a file and in response to that the NameNode will return a list of relevant DataNode servers where the data lives.
Secondary Data Node :
The NameNode is a Single Point of Failure for the HDFS Cluster as the Metadata is stored only on the name node. This makes HDFS not a High Availability system. When the NameNode goes down, the file system goes down. We can host a optional SecondaryNameNode on a separate machine. It creates checkpoints of that namespace by merging the edits file into the fsimage file and hence does not provide any real redundancy. Hadoop 0.21+ has this BackupNameNode that the user can configure to make it Highly Available.
DataNodes :
An HDFS cluster can have many DataNodes. DataNodes stores the blocks of data and blocks from different files can be stored on the same DataNode. Each DataNode marks its presence or activeness by sending a signal message like "I am Alive." Periodically. This helps the NameNode to keep track of the data nodes and maintain the metadata accordingly.
JobTracker Service :
The JobTracker is the service within Hadoop that runs the MapReduce tasks in the respective nodes in the cluster acoording to the client task,The nodes that have the data, or at least are in the same rack.
TaskTracker :
A TaskTracker is service that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker. Every TaskTracker will be configured with a set of slots, from this we will know the number of tasks that it can accept.To schedule a task, the job tracker first looks for an empty slot on the same server that hosts the DataNode containing the data, if not, it looks in on the same machine in the same rack.The TaskTracker starts a separate JVM processes to do the actual task to ensure that process failure does not take down the task
tracker. The TaskTracker monitors these processes, capturing the output and exit codes. Once the process is completed, either success or not, the tracker notifies the JobTracker accordingly.The TaskTrackers also send out heartbeat messages to the JobTracker, to ensure that they are still alive and active, so that the jobtracker can update its metadata about the empty slots.
Mapreduce Execution Process :
- The job or the task is submitted to the Job trackers.
- The JobTracker connects to the NameNode to find the location
of the data
- The JobTracker locates TaskTracker nodes with the available nodes
and other data respectivily
- The JobTracker assign the task to the identified available TaskTracker
nodes.
- The TaskTracker nodes will be monitored for their heartbeat
signals every min ,if they seem to have failed and the task is assigned on a
different TaskTracker.
- Job Tracker will be notified if at all if a taskfails.The
jobtracker will then resubmit the job or avoids that specific record from processing
or it may blacklist the tasktracker as unreliable.
- The status is updated once the task is completed by the
tasktracker.
- Client applications will request the JobTracker for
information on the task processed.
- The JobTracker is the point of failure for MapReduce service. All the jobs are halted if it goes down
0 Comments