https://hadoop.apache.org/# 这个是Hadoop官网。我们要从这里进入去查看Hadoop官方文档.
文档这里要点击Getting started才能查看,导航Documentation跳转不过去。
在开始下载hadoop之前要先下载jdk8,因为hadoop2.7开始都支持jdk8版本。
JDK8:https://www.oracle.com/java/technologies/downloads/#java8
如果系统是32位的,选择后缀中带有 i586的文件,系统是64位的,选择后缀中带有 x64的文件
要先注册账号登录才能下载。
在Download 里面解压下载的jdk :tar -zxvf jdk-8u361-linux-x64.tar.gz
将解压好的jdk移动到/usr/lib/jvm文件夹 sudo mv jdk1.8.0_361 /usr/lib/jvm
vim ~/.bashrc
在文件前面放这四条代码
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_361
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
source ~/.bashrc 保存配置
java -version 查看java配置
Ubuntu默认已安装了SSH客户端,因此,这里还需要安装SSH服务端
sudo apt-get install ssh
sudo apt-get install pdsh
设置免密登录:
cd ~/.ssh/
ssh-keygen -t rsa
cat ./id_rsa.pub >> ./authorized_keys
https://dlcdn.apache.org/hadoop/common/ hadoop不同版本官网下载
sudo tar -zxf ~/Downloads/hadoop-3.3.5.tar.gz -C /usr/local
sudo mv ./hadoop-3.3.5/ ./hadoop
sudo chown -R 用户名:所在组 ./hadoop 应该是这个,一般都是两个一样的当前用户名
./bin/hadoop version #查看当前安装的Hadoop版本。
接下来步骤属于伪分布安装:
vim hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_361
接下来操作参考hadoop伪分布文档:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
etc/core-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>local</value>
</property>
</configuration>
hadoop.tmp.dir用于保存临时文件,若没有配置hadoop.tmp.dir这个参数,则默认使用的临时目录为/tmp/hadoo-hadoop,而这个目录在Hadoop重启时有可能被系统清理掉,导致一些意想不到的问题,因此,必须配置这个参数。fs.defaultFS这个参数,用于指定HDFS的访问地址,其中,9000是端口号。
etc/hdfs-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/data</value>
</property>
</configuration>
在hdfs-site.xml文件中,dfs.replication这个参数用于指定副本的数量,因为,在分布式文件系统HDFS中,数据会被冗余存储多份,以保证可靠性和可用性。但是,由于这里采用伪分布式模式,只有一个节点,因此,只可能有1个副本,因此,设置dfs.replication的值为1。dfs.namenode.name.dir用于设定名称节点的元数据的保存目录,dfs.datanode.data.dir用于设定数据节点的数据保存目录,这两个参数必须设定,否则后面会出错。 需要指出的是,Hadoop的运行方式(比如运行在单机模式下还是运行在伪分布式模式下),是由配置文件决定的,启动Hadoop时会读取配置文件,然后根据配置文件来决定f运行在什么模式下。因此,如果需要从伪分布式模式切换回单机模式,只需要删除core-site.xml中的配置项即可。
mapred-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/tmp/hadoop-yarn/staging/history/done</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml:
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:8030</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
</configuration>
cd /usr/local/hadoop
./bin/hdfs namenode -format #格式化文件系统
cd /usr/local/hadoop
./sbin/start-dfs.sh
sudo touch /etc/pdsh/rcmd_default
sudo vim /etc/pdsh/rcmd_default
在文件中写入ssh
这里的目的是防止出现localhost: rcmd: socket: Permission denied错误。
接着jps查看启动节点即可知道是否安装成功伪分布。
配置PATH变量:
vim ~/.bashrc
export PATH=$PATH:/usr/local/hadoop/sbin:/usr/local/hadoop/bin
source ~/.bashrc 之后在任何目录下启动Hadoop,都只要直接输入start-dfs.sh
命令即可,同理,停止Hadoop,也只需要在任何目录下输入stop-dfs.sh命令即可。
Hadoop分布式文件配置
配置参考官方文档:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html
/etc/hadoop/core-site.xml: #这个主要负责设置namenode端口
fs.defaultFS #设置NameNode对外提供HDFS服务端口
hadoop查看配置的 fs.default.name名字:
hdfs getconf -confKey fs.default.name
hadoop.security.authorization #指明是否开启用户认证,默认为false
/etc/hadoop/hdfs-site.xml: #主要负责datanode和namenode存储地址
dfs.namenode.name.dir #dfs.namenode.name.dir是保存FsImage镜像的目录,作用是存放的名称节点namenode里的metadata。NameNode存储数据的文件所在的路径
dfs.datanode.data.dir #dfs.datanode.data.dir是存放HDFS文件系统数据文件的目录,作用是存放hadoop的数据节点datanode里的多个数据块。DataNode存储数据的文件路径。
dfs.namenode.secondary.http-addres #设置SecondNameNode节点地址,SecondaryNameNode的HTTP服务地址。
dfs.replication #文件副本数量, 默认是3
/etc/hadoop/mapred-site.xml:
mapreduce.framework.name #指定map reduce使用yarn资源管理器
/etc/hadoop/yarn-site.xml:
yarn.nodemanager.aux-services #附属服务名称,如果使用mapreduce,需将之配置为mapreduce-shuffle
yarn.resourcemanager.address #ResourceManager 对客户端暴露的地址。客户端通过该地址向RM提交应用程序,杀死应用程序等,默认为8032
yarn.resourcemanager.scheduler.address #ResourceManager 对ApplicationMaster暴露的访问地址。ApplicationMaster通过该地址向RM申请资源、释放资源等。默认端口8030
yarn.resourcemanager.resource-tracker.address #ResourceManager 对NodeManager暴露的地址.。NodeManager通过该地址向RM汇报心跳,领取任务等。默认端口8031
yarn.nodemanager.aux-services
指定 Nodemanager 运行时加载的服务,这里是 mapreduce_shuffle
。yarn.nodemanager.env-whitelist
指定在运行任务时允许使用的环境变量。yarn.resourcemanager.address
指定 ResourceManager 运行的地址和端口。yarn.resourcemanager.scheduler.address
指定 Scheduler 运行的地址和端口。yarn.resourcemanager.resource-tracker.address
指定 ResourceTracker 运行的地址和端口。yarn.resourcemanager.ha.enabled
指定开启 ResourceManager 的 HA 模式。yarn.resourcemanager.ha.rm-ids
指定以逗号分隔的 ResourceManager ID 列表,用于标识不同的 ResourceManager 实例。yarn.resourcemanager.hostname.X
指定以 ResourceManager ID 为前缀的属性,用于指定不同 ResourceManager 所在的机器名或 IP 地址。ha.zookeeper.quorum
指定用于 ZKFC 和 ResourceManager HA 的 ZooKeeper 集合的主机名或 IP 地址列表。yarn.resourcemanager.zk-state-store.address
指定用于 ResourceManager HA 的状态存储的 ZooKeeper 集合的连接地址。yarn.resourcemanager.zk-address
指定用于连接到 ZooKeeper 的客户端的连接地址。
这是配置hadoop分布式zookeeper时候用到的,记录下,之后还要改
我是windows下idea连接ubuntu hadoop,这里要在windows下配置hadoop环境路径,参考这位大佬博主配的:(73条消息) 关于IDEA出现报错: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.、_小小程序员呀~的博客-CSDN博客
hadoop version #查看下载好的hadoop版本
(74条消息) Hadoop3.3.5winutils-Java文档类资源-CSDN文库
这是我用的winutils
(83条消息) hadoop3.3.3-winutils_winutils-Hadoop文档类资源-CSDN文库
配置hadoop环境变量,参考这位博主csdn就好:
(75条消息) winutils解决hadoop跨平台问题_Stackflowed的博客-CSDN博客
netstat -tpnl #查看ubuntu端口情况 查看hadoop 9000端口是否对外开发
以下是连接hadoop3.3.5的配置:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.5</version>
</dependency>
<!-- hadoop hdfs -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.3.5</version>
</dependency>
<!-- hadoop mapreduce -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>3.3.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>3.3.5</version>
</dependency>
<!-- log4j -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
配置好环境之后直接连接hadoop就可以了,这里端口必须对外开放,不然hadoop连接不起来。
log4j.properties:
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.Target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{ABSOLUTE} %5p %c{1}:%L - %m%n
log4j.rootLogger=INFO, console
下面这个代码是查看hdfs根目录有哪些文件夹。
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.IOException;
public class connect {
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
conf.set("fs.defaultFS","hdfs://自己ubutunip地址:9000");
FileSystem fs = FileSystem.get(conf);
Path path = new Path("/");
FileStatus[] fileStatus = fs.listStatus(path);
for (FileStatus status : fileStatus) {
System.out.println(status.getPath().toString());
}
}
}
伪分布式式要是连接的化要在hdfs-site.xml设置
<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
</property>
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)