Thursday, September 30, 2010

Hadoop: No namenode to stop, no tasktracker to stop error

The error is similar to what reported here:
http://www.mail-archive.com/common-user@hadoop.apache.org/msg03019.html
or here:
http://www.mail-archive.com/common-user@hadoop.apache.org/msg01935.html

How it was solved:

First, clean all the files generated by hadoop and stored in /tmp folders. These files are used to store pid of namenode, jobtracker, datanode, tasktracker... To be safe, backup them before deleting:

-bash-4.0$ mkdir HadoopBackup
-bash-4.0$ mv hadoop* HadoopBackup/

Then, check all the all the tasks running which are related to java.

-bash-4.0$ ps -ef|grep java

It turns out that there are lots of tasks related to hadoop. Given that all the versions of hadoop have been stopped, this means there are some tasks which were not correctly shut down by hadoop. When hadoop wants to start, these existing tasks somehow prevent corresponding new tasks from running. As a result, when hadoop is stopped, it cannot find corresponding service to stop.

Next, kill all the out-of-control tasks:

-bash-4.0$ ps -ef|grep java |awk '{print $2}'|xargs kill -9

And recheck:

-bash-4.0$ ps -ef|grep java

The problem should be solved so far.

Do some preparation for restarting if necessary:

-bash-4.0$ bin/hadoop namenode -format

-bash-4.0$ cd conf
-bash-4.0$ vim hadoop-env.sh
-bash-4.0$ vim hdfs-site.xml
-bash-4.0$ vim mapred-site.xml
-bash-4.0$ vim core-site.xml

-bash-4.0$ vim conf/masters
-bash-4.0$ vim conf/slaves

Restart!

-bash-4.0$ bin/start-all.sh

cd logs/
-bash-4.0$ tail -f *.log

-bash-4.0$ cd ..
-bash-4.0$ bin/hadoop dfsadmin -report

Everything is fine up to this point.

The reason that may cause this problem in my hadoop is that there are two (actually three when I was trying to debug) versions of hadoop installed on the server. These versions takes different server machines as namenode and jobtracker. Sothere might be some confusion when doing wrong operation like launching one version first, but stop with another versoin.

3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Replies
    1. Hello Felix,
      I too had the same problem.My namenode was not starting up. I followed your post and finally it started popping up in the jps results.
      I am running in a pseudo distributed mode with only a single version of hadoop s source code. I do not think the issue is due to multiple versions of hadoop.

      I also see that many of them have the same issue.Wonder why ? :)

      Delete