What is a JobTracker in Hadoop? How many instances of
JobTracker run on a Hadoop Cluster? JobTracker
is the daemon service for submitting and tracking MapReduce jobs in Hadoop.
There is only One Job Tracker process run on any hadoop cluster. Job Tracker
runs on its own JVM process. In a typical production cluster its run on a
separate machine. Each slave node is configured with job tracker node location.
The JobTracker is single point of failure for the Hadoop MapReduce service. If
it goes down, all running jobs are halted. JobTracker in Hadoop performs
following actions(from Hadoop Wiki:)
Client applications submit jobs to the
Job tracker. The JobTracker talks to the NameNode to determine the location
of the data The JobTracker locates TaskTracker nodes with available slots at or
near the data The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals
often enough, they are deemed to have failed and the work is scheduled on a
different TaskTracker. A TaskTracker will notify the JobTracker when a task
fails. The JobTracker decides what to do then: it may resubmit the job
elsewhere, it may mark that specific record as something to avoid, and it may
may even blacklist the TaskTracker as unreliable. When the work is completed,
the JobTracker updates its status. Client applications can poll the JobTracker
for information.
How JobTracker schedules a task? The
TaskTrackers send out heartbeat messages to the JobTracker, usually every few
minutes, to reassure the JobTracker that it is still alive. These message also
inform the JobTracker of the number of available slots, so the JobTracker can
stay up to date with where in the cluster work can be delegated. When the
JobTracker tries to find somewhere to schedule a task within the MapReduce
operations, it first looks for an empty slot on the same server that hosts the
DataNode containing the data, and if not, it looks for an empty slot on a
machine in the same rack.
What is a Task Tracker in Hadoop? How many
instances of TaskTracker run on a Hadoop Cluster A TaskTracker is a slave
node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle
operations) from a JobTracker. There is only One Task Tracker process run on
any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker
is configured with a set of slots, these indicate the number of tasks that it
can accept. The TaskTracker starts a separate JVM processes to do the actual
work (called as Task Instance) this is to ensure that process failure does not
take down the task tracker. The TaskTracker monitors these task instances,
capturing the output and exit codes. When the Task instances finish,
successfully or not, the task tracker notifies the JobTracker. The TaskTrackers
also send out heartbeat messages to the JobTracker, usually every few minutes,
to reassure the JobTracker that it is still alive. These message also inform
the JobTracker of the number of available slots, so the JobTracker can stay up
to date with where in the cluster work can be delegated.
What is a Task instance in Hadoop? Where
does it run? Task instances are the actual MapReduce jobs which are run on
each slave node. The TaskTracker
starts
a separate JVM processes to do the actual work (called as Task Instance) this
is to ensure that process failure does not take down the task tracker. Each
Task Instance runs on its own JVM process. There can be multiple processes of
task instance running on a slave node. This is based on the number of slots
configured on task tracker. By default a new task instance JVM process is
spawned for a task. How many Daemon processes run on a Hadoop system? Hadoop
is comprised of five separate daemons. Each of these daemon run in its own JVM.
Following 3 Daemons run on Master nodes NameNode - This daemon stores and
maintains the metadata for HDFS. Secondary NameNode - Performs housekeeping
functions for the NameNode. JobTracker - Manages MapReduce jobs, distributes
individual tasks to machines running the Task Tracker. Following 2 Daemons run
on each Slave nodes DataNode – Stores actual HDFS data blocks. TaskTracker -
Responsible for instantiating and monitoring individual Map and Reduce tasks.
What is configuration of a typical slave node on Hadoop cluster? How
many JVMs run on a slave node? Single
instance of a Task Tracker is run on each Slave node. Task tracker is run as a
separate JVM process. Single instance of a DataNode daemon is run on each Slave
node. DataNode daemon is run as a separate JVM process. One or Multiple
instances of Task Instance is run on each slave node. Each task instance is run
as a separate JVM process. The number of Task instances can be controlled by
configuration. Typically a high end machine is configured to run more task
instances.
What is the difference between HDFS and NAS ? The Hadoop Distributed File System (HDFS) is a
distributed file system designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences
from other distributed file systems are significant. Following are differences
between HDFS and NAS In HDFS Data Blocks are distributed across local drives of
all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
HDFS is designed to work with MapReduce System, since computation are moved to
data. NAS is not suitable for MapReduce since data is stored seperately from
the computations. HDFS runs on a cluster of machines and provides redundancy
usinga replication protocal. Whereas NAS is provided by a single machine
therefore does not provide data redundancy.
How NameNode Handles data node failures? NameNode periodically receives a Heartbeat and a
Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport contains a
list of all blocks on a DataNode. When NameNode notices that it has not
recieved a hearbeat message from a data node after a certain amount of time,
the data node is marked as dead. Since blocks will be under replicated the
system begins replicating the blocks that were stored on the dead datanode. The
NameNode Orchestrates the replication of data blocks from one datanode to
another. The replication data transfer happens directly between datanodes and
the data never passes through the namenode.
Does MapReduce programming model provide a
way for reducers to communicate with each other? In a MapReduce job can a
reducer communicate with another reducer? Nope, MapReduce programming model
does not allow reducers to communicate with each other. Reducers run in
isolation.
Can I set
the number of reducers to zero? Yes, Setting the number of reducers to zero is a valid
configuration in Hadoop. When you set the reducers to zero no reducers will be
executed, and the output of each mapper will be stored to a separate file on
HDFS. [This is different from the condition when reducers are set to a number
greater than zero and the Mappers output (intermediate data) is written to the
Local file system(NOT HDFS) of each mappter slave node.] Where is the Mapper
Output (intermediate kay-value data) stored ? The mapper output (intermediate
data) is stored on the Local file system (NOT HDFS) of each individual mapper
nodes. This is typically a temporary directory location which can be setup in
config by the hadoop administrator. The intermediate data is cleaned up after
the Hadoop Job completes.
Finance economy finance & insurance money derivatives wall street young money got money cash money get money
Finance economy finance & insurance money derivatives wall street young money got money cash money get money
What are combiners? When should I use a
combiner in my MapReduce Job? Combiners are used to increase the efficiency
of a MapReduce program. They are used to aggregate intermediate map output
locally on individual mapper outputs. Combiners can help you reduce the amount
of data that needs to be transferred across to the reducers. You can use your
reducer code as a combiner if the operation performed is commutative and
associative. The execution of combiner is not guaranteed, Hadoop may or may not
execute a combiner. Also, if required it may execute it more then 1 times.
Therefore your MapReduce jobs should not depend on the combiners execution. What
is Writable & WritableComparable interface? org.apache.hadoop.io.Writable
is a Java interface. Any key or value type in the Hadoop Map-Reduce framework
implements this interface. Implementations typically implement a static
read(DataInput) method which constructs a new instance, calls
readFields(DataInput) and returns the instance.
org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is
to be used as a key in the Hadoop Map-Reduce framework should implement this
interface. WritableComparable objects can be compared to each other using Comparators.
What is
the Hadoop MapReduce API contract for a key and value Class? The Key must implement the
org.apache.hadoop.io.WritableComparable interface. The value must implement the
org.apache.hadoop.io.Writable interface
. What is
a IdentityMapper and IdentityReducer in MapReduce ? org.apache.hadoop.mapred.lib.IdentityMapper
Implements the identity function, mapping inputs directly to outputs. If
MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass
then IdentityMapper.class is used as a default value.
org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all
input values directly to the output. If MapReduce programmer do not set the
Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used
as a default value.
What is the meaning of speculative
execution in Hadoop? Why is it important? Speculative execution is a way of
coping with individual Machine performance. In large clusters where hundreds or
thousands of machines are involved there may be machines which are not
performing as fast as others. This may result in delays in a full job due to
only one machine not performaing well. To avoid this, speculative execution in
hadoop can run multiple copies of same map or reduce task on different slave
nodes. The results from first node to finish are used.
Nice post and veery well explain about hadoop.
ReplyDeleteHadoop classes in Pune