参考url:http://blog.chinaunix.net/uid-20682147-id-5520795.html
http://www.powerxing.com/install-hadoop-in-centos/ (mapreduce测试比较详细)
机器列表
10.45.53.143 hadoop1.i.zhihuishu.com
10.47.109.190 hadoop2.i.zhihuishu.com
10.24.31.188 hadoop3.i.zhihuishu.com
10.168.227.144 hadoop4.i.zhihuishu.com
10.168.248.3 hadoop5.i.zhihuishu.com
NameNode
hadoop1.i.zhihuishu.com hadoop2.i.zhihuishu.com
JournalNode
hadoop1.i.zhihuishu.com hadoop2.i.zhihuishu.com hadoop3.i.zhihuishu.com
DataNode
hadoop3.i.zhihuishu.com hadoop4.i.zhihuishu.com hadoop5.i.zhihuishu.com
ZooKeeper
hadoop1.i.zhihuishu.com hadoop2.i.zhihuishu.com hadoop3.i.zhihuishu.com
目录规范:
Hadoop
/data/hadoop/hadoop
zookeeper
/data/hadoop/zookeeper
JDK
JAVA_HOME=/usr/local/jdk1.8.0_45
前期准备:
10.45.53.143 hadoop1.i.zhihuishu.com rizhi143
10.47.109.190 hadoop2.i.zhihuishu.com rizhi190
10.24.31.188 hadoop3.i.zhihuishu.com rizhi188
10.25.2.16 hadoop4.i.zhihuishu.com rizhi16
10.168.248.3 hadoop5.i.zhihuishu.com rizhi3
host定向,记得要把 hostname 也要定向,因为很多是通过hostname通讯的。
ssh认证
cd ~/.ssh/
ssh-keygen -t rsa
cat id_rsa.pub >> authorized_keys
chmod 600 ./authorized_keys
记得name1 name2 有全部的控制ssh认证 dataname 只要有互相的认证即可
文件同步
mkdir -p /data/hadoop
cd /data/hadoop/
tar -zxvf hadoop.tar.gz
添加 /etc/profile
export HADOOP_HOME=/data/hadoop/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
添加slave
vi etc/hadoop/slaves
hadoop3.i.zhihuishu.com
hadoop4.i.zhihuishu.com
hadoop5.i.zhihuishu.com
~
配置文件初始化
cd /data/hadoop/hadoop
cp share/doc/hadoop/hadoop-project-dist/hadoop-common/core-default.xml etc/hadoop/core-site.xml
cp share/doc/hadoop/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml etc/hadoop/hdfs-site.xml
cp share/doc/hadoop/hadoop-yarn/hadoop-yarn-common/yarn-default.xml etc/hadoop/yarn-site.xml
cp share/doc/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml etc/hadoop/mapred-site.xml
cd /data/hadoop/hadoop/etc/hadoop
wget -O hdfs-site.xml wget http://222.66.154.110/install/hadoop/hdfs-site.xml
wget -O core-site.xml wget http://222.66.154.110/install/hadoop/core-site.xml
wget -O yarn-site.xml wget http://222.66.154.110/install/hadoop/yarn-site.xml
wget -O mapred-site.xml wget http://222.66.154.110/install/hadoop/mapred-site.xml
mkdir -p /data/hadoop/hadoop/tmp
mkdir -p /data/hadoop/hadoop/tmp/dfs/name
配置文件在附件,可以参考附件。
QJM的配置参照的官方文档:
hdfs-site.xml详细配置可参考:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml。
yarn.nodemanager.hostname如果配置成具体的IP,如10.12.154.79,则会导致每个NodeManager的配置不同。详细配置可参考:
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml。
Yarn HA的配置可以参考:
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html。
遇到的问题:
1:Namenode2 做HA启动错了,老是报无法注册nn2。
2:proxy没有配置,造成了没办法访问zhscluster。
9. 启动顺序
Zookeeper -> JournalNode -> 格式化NameNode -> 初始化JournalNode
-> 创建命名空间(zkfc) -> NameNode -> DataNode -> ResourceManager -> NodeManager。
但请注意首次启动NameNode之前,得先做format,也请注意备NameNode的启动方法。
10. 启动HDFS
在启动HDFS之前,需要先完成对NameNode的格式化。
10.1. 创建好目录
mkdir -p /data/hadoop/hadoop-2.7.2/tmp/dfs/name
10.2. 启动好zookeeper
./zkServer.sh start
注意在启动其它之前先启动zookeeper。
10.3. 创建命名空间
在其中一个namenode上执行:
./hdfs zkfc -formatZK
注意,这一步要在初始化命名空间之后执行。
10.4. 启动所有JournalNode
NameNode将元数据操作日志记录在JournalNode上,主备NameNode通过记录在JouralNode上的日志完成元数据同步。
在所有JournalNode上执行(注意是两个参数,在“hdfs namenode -format”之后做这一步):
./hadoop-daemon.sh start journalnode
注意,在执行“hdfs namenode -format”之前,必须先启动好JournalNode,而format又必须在启动namenode之前。
10.5. 初始化JournalNode
如果是非HA转HA才需要这一步,在其中一个JournalNode上执行:
./hdfs namenode -initializeSharedEdits
此命令默认是交互式的,加上参数-force转成非交互式。
在所有JournalNode创建如下目录:
mkdir -p /data/hadoop/hadoop/journal/mycluster/current
10.6. 格式化NameNode
注意只有新的,才需要做这一步,而且只需要在主NameNode上执行。
1) 进入$HADOOP_HOME/bin目录
2) 进行格式化:./hdfs namenode -format
如果完成有,输出包含“INFO util.ExitUtil: Exiting with status 0”,则表示格式化成功。
在进行格式化时,如果没有在/etc/hosts文件中添加主机名和IP的映射:“172.25.40.171 VM-40-171-sles10-64”,则会报如下所示错误:
14/04/17 03:44:09 WARN net.DNS: Unable to determine local hostname -falling back to “localhost”
java.net.UnknownHostException: VM-40-171-sles10-64: VM-40-171-sles10-64: unknown error
at java.net.InetAddress.getLocalHost(InetAddress.java:1484)
at org.apache.hadoop.net.DNS.resolveLocalHostname(DNS.java:264)
at org.apache.hadoop.net.DNS.(DNS.java:57)
at org.apache.hadoop.hdfs.server.namenode.NNStorage.newBlockPoolID(NNStorage.java:945)
at org.apache.hadoop.hdfs.server.namenode.NNStorage.newNamespaceInfo(NNStorage.java:573)
at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:144)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:845)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1256)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1370)
Caused by: java.net.UnknownHostException: VM-40-171-sles10-64: unknown error
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:907)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1302)
at java.net.InetAddress.getLocalHost(InetAddress.java:1479)
… 8 more
10.7. 启动主NameNode
1) 进入$HADOOP_HOME/sbin目录
2) 启动主NameNode:
./hadoop-daemon.sh start namenode
启动时,遇到如下所示的错误,则表示NameNode不能免密码登录自己。如果之前使用IP可以免密码登录自己,则原因一般是因为没有使用主机名登录过自己,因此解决办法是使用主机名SSH一下,比如:ssh hadoop@VM_40_171_sles10_64,然后再启动。
Starting namenodes on [VM_40_171_sles10_64]
VM_40_171_sles10_64: Host key not found from database.
VM_40_171_sles10_64: Key fingerprint:
VM_40_171_sles10_64: xofiz-zilip-tokar-rupyb-tufer-tahyc-sibah-kyvuf-palik-hazyt-duxux
VM_40_171_sles10_64: You can get a public key’s fingerprint by running
VM_40_171_sles10_64: % ssh-keygen -F publickey.pub
VM_40_171_sles10_64: on the keyfile.
VM_40_171_sles10_64: warning: tcgetattr failed in ssh_rl_set_tty_modes_for_fd: fd 1: Invalid argument
10.8. 启动备NameNode
1) ./hdfs namenode -bootstrapStandby
2) ./hadoop-daemon.sh start namenode
如果没有执行第1步,直接启动会遇到如下错误:
No valid image files found
或者在该NameNode日志会发现如下错误:
2016-04-08 14:08:39,745 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage
java.io.IOException: NameNode is not formatted.
10.9. 启动主备切换进程
在所有NameNode上启动主备切换进程:
./hadoop-daemon.sh start zkfc
只有启动了DFSZKFailoverController进程,HDFS才能自动切换主备。
注:zkfc是zookeeper failover controller的缩写。
10.10. 启动所有DataNode
在各个DataNode上分别执行:
./hadoop-daemon.sh start datanode
如果有发现DataNode进程并没有起来,可以试试删除logs目录下的DataNode日志,再得启看看。
10.11. 检查启动是否成功
1) 使用JDK提供的jps命令,查看相应的进程是否已启动
2) 检查$HADOOP_HOME/logs目录下的log和out文件,看看是否有异常信息。
启动后nn1和nn2都处于备机状态,将nn1切换为主机:
./hdfs haadmin -transitionToActive nn1
10.11.1. DataNode
执行jps命令(注:jps是jdk中的一个命令,不是jre中的命令),可看到DataNode进程:
$ jps
18669 DataNode
24542 Jps
10.11.2. NameNode
执行jps命令,可看到NameNode进程:
$ jps
18669 NameNode
24542 Jps
10.12. 执行HDFS命令
执行HDFS命令,以进一步检验是否已经安装成功和配置好。关于HDFS命令的用法,直接运行命令hdfs或hdfs dfs,即可看到相关的用法说明。
10.12.1. 查看DataNode是否正常启动
hdfs dfsadmin -report
注意如果core-site.xml中的配置项fs.default.name的值为file:///,则会报:
report: FileSystem file:/// is not an HDFS file system
Usage: hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning]
解决这个问题,只需要将fs.default.name的值设置为和fs.defaultFS相同的值。
10.12.2. 查看NameNode的主备状态
如查看NameNode1和NameNode2分别是主还是备:
$ hdfs haadmin -getServiceState nm1
standby
$ hdfs haadmin -getServiceState nm2
active
10.12.3. hdfs dfs ls
“hdfs dfs -ls”带一个参数,如果参数以“hdfs://URI”打头表示访问HDFS,否则相当于ls。其中URI为NameNode的IP或主机名,可以包含端口号,即hdfs-site.xml中“dfs.namenode.rpc-address”指定的值。
“hdfs dfs -ls”要求默认端口为8020,如果配置成9000,则需要指定端口号,否则不用指定端口,这一点类似于浏览器访问一个URL。示例:
> hdfs dfs -ls hdfs://172.25.40.171:9001/
9001后面的斜杠/是和必须的,否则被当作文件。如果不指定端口号9001,则使用默认的8020,“172.25.40.171:9001”由hdfs-site.xml中“dfs.namenode.rpc-address”指定。
不难看出“hdfs dfs -ls”可以操作不同的HDFS集群,只需要指定不同的URI。
文件上传后,被存储在DataNode的data目录下(由DataNode的hdfs-site.xml中的属性“dfs.datanode.data.dir”指定),如:
$HADOOP_HOME/data/current/BP-139798373-172.25.40.171-1397735615751/current/finalized/blk_1073741825
文件名中的“blk”是block,即块的意思,默认情况下blk_1073741825即为文件的一个完整块,Hadoop未对它进额外处理。
10.12.4. hdfs dfs -put
上传文件命令,示例:
> hdfs dfs -put /etc/SuSE-release hdfs://172.25.40.171:9001/
10.12.5. hdfs dfs -rm
删除文件命令,示例:
> hdfs dfs -rm hdfs://172.25.40.171:9001/SuSE-release
Deleted hdfs://172.25.40.171:9001/SuSE-release
10.12.6. 新NameNode如何加入?
当有NameNode机器损坏时,必然存在新NameNode来替代。把配置修改成指向新NameNode,然后以备机形式启动新NameNode,这样新的NameNode即加入到Cluster中:
1) ./hdfs namenode -bootstrapStandby
2) ./hadoop-daemon.sh start namenode
10.12.7. HDFS只允许有一主一备两个NameNode
如果试图配置三个NameNode,如:
dfs.ha.namenodes.test
nm1,nm2,nm3
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
则运行“hdfs namenode -bootstrapStandby”时会报如下错误,表示在同一NameSpace内不能超过2个NameNode:
16/04/11 09:51:57 ERROR namenode.NameNode: Failed to start namenode.
java.io.IOException: java.lang.IllegalArgumentException: Expected exactly 2 NameNodes in namespace ‘test’. Instead, got only 3 (NN ids were ‘nm1′,’nm2′,’nm3’
at org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:425)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1454)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
Caused by: java.lang.IllegalArgumentException: Expected exactly 2 NameNodes in namespace ‘test’. Instead, got only 3 (NN ids were ‘nm1′,’nm2′,’nm3’
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
10.12.8. 存储均衡start-balancer.sh
示例:start-balancer.sh –t 10%
10%表示机器与机器之间磁盘使用率偏差小于10%时认为均衡,否则做均衡搬动。“start-balancer.sh”调用“hdfs start balancer”来做均衡,可以调用stop-balancer.sh停止均衡。
均衡过程非常慢,但是均衡过程中,仍能够正常访问HDFS,包括往HDFS上传文件。
[VM2016@hadoop-030 /data4/hadoop/sbin]$ hdfs balancer # 可以改为调用start-balancer.sh
16/04/08 14:26:55 INFO balancer.Balancer: namenodes = [hdfs://test] // test为HDFS的cluster名
16/04/08 14:26:55 INFO balancer.Balancer: parameters = Balancer.Parameters[BalancingPolicy.Node, threshold=10.0, max idle iteration = 5, number of nodes to be excluded = 0, number of nodes to be included = 0]
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.231:50010
16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.229:50010
16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.213:50010
16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.208:50010
16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.232:50010
16/04/08 14:26:56 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.207:50010
16/04/08 14:26:56 INFO balancer.Balancer: 5 over-utilized: [192.168.1.231:50010:DISK, 192.168.1.229:50010:DISK, 192.168.1.213:50010:DISK, 192.168.1.208:50010:DISK, 192.168.1.232:50010:DISK]
16/04/08 14:26:56 INFO balancer.Balancer: 1 underutilized(未充分利用的): [192.168.1.207:50010:DISK] # 数据将移向该节点
16/04/08 14:26:56 INFO balancer.Balancer: Need to move 816.01 GB to make the cluster balanced. # 需要移动816.01G数据达到平衡
16/04/08 14:26:56 INFO balancer.Balancer: Decided to move 10 GB bytes from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK # 从192.168.1.231移动10G数据到192.168.1.207
16/04/08 14:26:56 INFO balancer.Balancer: Will move 10 GB in this iteration
16/04/08 14:32:58 INFO balancer.Dispatcher: Successfully moved blk_1073749366_8542 with size=77829046 from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010
16/04/08 14:32:59 INFO balancer.Dispatcher: Successfully moved blk_1073749386_8562 with size=77829046 from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.231:50010
16/04/08 14:33:34 INFO balancer.Dispatcher: Successfully moved blk_1073749378_8554 with size=77829046 from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.231:50010
16/04/08 14:34:38 INFO balancer.Dispatcher: Successfully moved blk_1073749371_8547 with size=134217728 from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010
16/04/08 14:34:54 INFO balancer.Dispatcher: Successfully moved blk_1073749395_8571 with size=134217728 from 192.168.1.231:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.231:50010
Apr 8, 2016 2:35:01 PM 0 478.67 MB 816.01 GB 10 GB
16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.213:50010
16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.229:50010
16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.232:50010
16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.231:50010
16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.208:50010
16/04/08 14:35:10 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.207:50010
16/04/08 14:35:10 INFO balancer.Balancer: 5 over-utilized: [192.168.1.213:50010:DISK, 192.168.1.229:50010:DISK, 192.168.1.232:50010:DISK, 192.168.1.231:50010:DISK, 192.168.1.208:50010:DISK]
16/04/08 14:35:10 INFO balancer.Balancer: 1 underutilized(未充分利用的): [192.168.1.207:50010:DISK]
16/04/08 14:35:10 INFO balancer.Balancer: Need to move 815.45 GB to make the cluster balanced.
16/04/08 14:35:10 INFO balancer.Balancer: Decided to move 10 GB bytes from 192.168.1.213:50010:DISK to 192.168.1.207:50010:DISK
16/04/08 14:35:10 INFO balancer.Balancer: Will move 10 GB in this iteration
16/04/08 14:41:18 INFO balancer.Dispatcher: Successfully moved blk_1073760371_19547 with size=77829046 from 192.168.1.213:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010
16/04/08 14:41:19 INFO balancer.Dispatcher: Successfully moved blk_1073760385_19561 with size=77829046 from 192.168.1.213:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010
16/04/08 14:41:22 INFO balancer.Dispatcher: Successfully moved blk_1073760393_19569 with size=77829046 from 192.168.1.213:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010
16/04/08 14:41:23 INFO balancer.Dispatcher: Successfully moved blk_1073760363_19539 with size=77829046 from 192.168.1.213:50010:DISK to 192.168.1.207:50010:DISK through 192.168.1.213:50010
10.12.9. 新增JournalNode
找一台已有JournalNode节点,修改它的hdfs-site.xml,将新增的Journal包含进来,如在
qjournal://hadoop-030:8485;hadoop-031:8485;hadoop-032:8485/test
的基础上新增hadoop-033和hadoop-034两个JournalNode:
qjournal://hadoop-030:8485;hadoop-031:8485;hadoop-032:8485;hadoop-033:8485;hadoop-034:8485/test
然后将安装目录和数据目录(hdfs-site.xml中的dfs.journalnode.edits.dir指定的目录)都复制到新的节点。
如果不复制JournalNode的数据目录,则新节点上的JournalNode会报错“Journal Storage Directory /data/journal/test not formatted”,将来的版本可能会实现自动同步。
接下来,就可以在新节点上启动好JournalNode(不需要做什么初始化),并重启下NameNode。注意观察JournalNode日志,查看是否启动成功,当日志显示为以下这样的INFO级别日志则表示启动成功:
2016-04-26 10:31:11,160 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /data/journal/test/current/edits_inprogress_0000000000000194269 -> /data/journal/test/current/edits_0000000000000194269-0000000000000194270
11. 启动YARN
11.1. 启动YARN
1) 进入$HADOOP_HOME/sbin目录
2) 在主备两台都执行:start-yarn.sh,即开始启动YARN
若启动成功,则在Master节点执行jps,可以看到ResourceManager:
> jps
24689 NameNode
30156 Jps
28861 ResourceManager
在Slaves节点执行jps,可以看到NodeManager:
$ jps
14019 NodeManager
23257 DataNode
15115 Jps
如果只需要单独启动指定节点上的ResourceManager,这样:
./yarn-daemon.sh start resourcemanager
对于NodeManager,则是这样:
./yarn-daemon.sh start nodemanager
11.2. 执行YARN命令
11.2.1. yarn node -list
列举YARN集群中的所有NodeManager,如(注意参数间的空格,直接执行yarn可以看到使用帮助):
> yarn node -list
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
localhost:45980 RUNNING localhost:8042 0
localhost:47551 RUNNING localhost:8042 0
localhost:58394 RUNNING localhost:8042 0
11.2.2. yarn node -status
查看指定NodeManager的状态,如:
> yarn node -status localhost:47551
Node Report :
Node-Id : localhost:47551
Rack : /default-rack
Node-State : RUNNING
Node-Http-Address : localhost:8042
Last-Health-Update : 星期五 18/四月/14 01:45:41:555GMT
Health-Report :
Containers : 0
Memory-Used : 0MB
Memory-Capacity : 8192MB
CPU-Used : 0 vcores
CPU-Capacity : 8 vcores
11.2.3. yarn rmadmin -getServiceState rm1
查看rm1的主备状态,即查看它是主(active)还是备(standby)。
11.2.4. yarn rmadmin -transitionToStandby rm1
将rm1从主切为备。
更多的yarn命令可以参考:
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands.html。
12. 运行MapReduce程序
在安装目录的share/hadoop/mapreduce子目录下,有现存的示例程序:
hadoop@VM-40-171-sles10-64:~/hadoop> ls share/hadoop/mapreduce
hadoop-mapreduce-client-app-2.7.2.jar hadoop-mapreduce-client-jobclient-2.7.2-tests.jar
hadoop-mapreduce-client-common-2.7.2.jar hadoop-mapreduce-client-shuffle-2.7.2.jar
hadoop-mapreduce-client-core-2.7.2.jar hadoop-mapreduce-examples-2.7.2.jar
hadoop-mapreduce-client-hs-2.7.2.jar lib
hadoop-mapreduce-client-hs-plugins-2.7.2.jar lib-examples
hadoop-mapreduce-client-jobclient-2.7.2.jar sources
跑一个示例程序试试:
hdfs dfs -put /etc/hosts hdfs://test/in/
hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount hdfs://test/in/ hdfs://test/out/
运行过程中,使用java的jps命令,可以看到yarn启动了名为YarnChild的进程。
wordcount运行完成后,结果会保存在out目录下,保存结果的文件名类似于“part-r-00000”。另外,跑这个示例程序有两个需求注意的点:
1) in目录下要有文本文件,或in即为被统计的文本文件,可以为HDFS上的文件或目录,也可以为本地文件或目录
2) out目录不能存在,程序会自动去创建它,如果已经存在则会报错。
包hadoop-mapreduce-examples-2.7.2.jar中含有多个示例程序,不带参数运行,即可看到用法:
> hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount
Usage: wordcount
> hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
修改日志级别为DEBBUG,并打屏:
export HADOOP_ROOT_LOGGER=DEBUG,console