Hadoop ChainMapper、ChainReducer [重复]

2024-01-24

我对 Hadoop 比较陌生，并试图弄清楚如何使用 ChainMapper、ChainReducer 以编程方式链接作业（多个映射器、减速器）。我找到了一些部分示例，但没有一个完整且有效的示例。

我当前的测试代码是

public class ChainJobs extends Configured implements Tool {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            output.collect(word, one);
        }
    }
}

public static class Map2 extends MapReduceBase implements Mapper<Text, IntWritable, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    public void map(Text key, IntWritable value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken().concat("Justatest"));
            output.collect(word, one);
        }
    }
}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
    }
}

@Override
public int run(String[] args)  {

    Configuration conf = getConf();
    JobConf job = new JobConf(conf);

    job.setJobName("TestforChainJobs");
    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    JobConf map1Conf = new JobConf(false);
    ChainMapper.addMapper(job, Map.class, LongWritable.class, Text.class, Text.class, IntWritable.class, true, map1Conf);

    JobConf map2Conf = new JobConf(false);
    ChainMapper.addMapper(job, Map2.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, map2Conf);

    JobConf reduceConf = new JobConf(false);
    ChainReducer.setReducer(job, Reduce.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, reduceConf);

    JobClient.runJob(job);
    return 0;

     }

}

public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new ChainJobs(), args);
    System.exit(res);
}

但它失败了

MapAttempt TASK_TYPE="MAP" TASKID="task_201210162337_0009_m_000000" TASK_ATTEMPT_ID="attempt_201210162337_0009_m_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1350397216365" HOSTNAME="localhost\.localdomain" ERROR="java\.lang\.RuntimeException: Error in configuring object
    at org\.apache\.hadoop\.util\.ReflectionUtils\.setJobConf(ReflectionUtils\.java:106)
    at org\.apache\.hadoop\.util\.ReflectionUtils\.setConf(ReflectionUtils\.java:72)
    at org\.apache\.hadoop\.util\.ReflectionUtils\.newInstance(ReflectionUtils\.java:130)
    at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:389)
    at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:327)
    at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:268)
    at java\.security\.AccessController\.doPrivileged(Native Method)
    at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)

非常感谢任何提示或非常简单的工作示例。

我已经编写了基于链映射器的字数统计作业。该代码是在新的 API 上编写的并且运行良好:)

import java.io.IOException;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

//implementing CHAIN MAPREDUCE without using custom format




//SPLIT MAPPER
class SplitMapper extends Mapper<Object,Text,Text,IntWritable>
{
    private IntWritable dummyValue=new IntWritable(1);
    //private String content;
    private String tokens[];
    @Override
    public void map(Object key,Text value,Context context)throws IOException,InterruptedException{
        tokens=value.toString().split(" ");
        for(String x:tokens)
        {
        context.write(new Text(x), dummyValue);
        }
    }   
}




//UPPER CASE MAPPER
class UpperCaseMapper extends Mapper<Text,IntWritable,Text,IntWritable>
{
    @Override
    public void map(Text key,IntWritable value,Context context)throws IOException,InterruptedException{
        String val=key.toString().toUpperCase();
        Text newKey=new Text(val);
        context.write(newKey, value);
    }
}



//ChainMapReducer
class ChainMapReducer extends Reducer<Text,IntWritable,Text,IntWritable>
{
    private int sum=0;
    @Override
    public void reduce(Text key,Iterable<IntWritable>values,Context context)throws IOException,InterruptedException{
        for(IntWritable value:values)
        {
            sum+=value.get();
        }
        context.write(key, new IntWritable(sum));
    }
}
public class FirstClass extends Configured implements Tool{
    static Configuration cf;
    public int run (String args[])throws IOException,InterruptedException,ClassNotFoundException{
        cf=new Configuration();

        //bypassing the GenericOptionsParser part and directly running into job declaration part
        Job j=Job.getInstance(cf);

        /**************CHAIN MAPPER AREA STARTS********************************/
        Configuration splitMapConfig=new Configuration(false);
        //below we add the 1st mapper class under ChainMapper Class
        ChainMapper.addMapper(j, SplitMapper.class, Object.class, Text.class, Text.class, IntWritable.class, splitMapConfig);

        //configuration for second mapper
        Configuration upperCaseConfig=new Configuration(false);
        //below we add the 2nd mapper that is the lower case mapper to the Chain Mapper class
        ChainMapper.addMapper(j, UpperCaseMapper.class, Text.class, IntWritable.class, Text.class, IntWritable.class, upperCaseConfig);
        /**************CHAIN MAPPER AREA FINISHES********************************/

        //now proceeding with the normal delivery
        j.setJarByClass(FirstClass.class);
        j.setCombinerClass(ChainMapReducer.class);
        j.setOutputKeyClass(Text.class);
        j.setOutputValueClass(IntWritable.class);
        Path p=new Path(args[1]);

        //set the input and output URI
        FileInputFormat.addInputPath(j, new Path(args[0]));
        FileOutputFormat.setOutputPath(j, p);
        p.getFileSystem(cf).delete(p, true);
        return j.waitForCompletion(true)?0:1;
    }
    public static void main(String args[])throws Exception{
        int res=ToolRunner.run(cf, new FirstClass(), args);
        System.exit(res);
    }
}

输出部分如下所示

A       619
ACCORDING       636
ACCOUNT 638
ACROSS? 655
ADDRESSES       657
AFTER   674
AGGREGATING,    687
AGO,    704
ALL     721
ALMOST  755
ALTERING        768
AMOUNT  785
AN      819
ANATOMY 820
AND     1198
ANXIETY 1215
ANY     1232
APACHE  1300
APPENDING       1313
APPLICATIONS    1330
APPLICATIONS.   1347
APPLICATIONS.ï¿½        1364
APPLIES 1381
ARCHITECTURE,   1387
ARCHIVES        1388
ARE     1405
AS      1422
BASED   1439

您可能会看到一些特殊或不需要的字符，因为我没有使用任何清理来删除标点符号。我只是专注于链映射器的工作。谢谢：）

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

Hadoop

MapReduce

chaining

Hadoop ChainMapper、ChainReducer [重复] 的相关文章

将 hadoop fs 路径转换为 EMR 上的 hdfs:// 路径

我想知道如何将数据从 EMR 集群的 HDFS 文件系统移动到 S3 存储桶我认识到我可以直接在 Spark 中写入 S3 但原则上之后执行它也应该很简单到目前为止我还没有发现在实践中这是正确的 AWS 文档建议s3 dist cp
在 Hive 中分解一行 XML 数据

我们将 XML 数据作为名为 XML 的单个字符串列加载到 Hadoop 中我们正在尝试检索数据级别并将其标准化或分解为单行进行处理你知道就像表格一样已经尝试过分解功能但没有得到我们想要的示例 XML
Spark MLLib 存在问题，导致概率和预测对于所有内容都相同

我正在学习如何将机器学习与 Spark MLLib 结合使用目的是对推文进行情感分析我从这里得到了一个情感分析数据集 http thinknook com wp content uploads 2012 09 Sentiment Ana
Spark超时可能是由于HDFS中文件超过100万个的binary Files()

我正在通过以下方式读取数百万个 xml 文件 val xmls sc binaryFiles xmlDir 该操作在本地运行良好但在纱线上失败并显示 client token N A diagnostics Application app
Hive查询快速查找表大小（行数）

是否有 Hive 查询可以快速查找表大小即行数而无需启动耗时的 MapReduce 作业这就是为什么我想避免COUNT I tried DESCRIBE EXTENDED 但这产生了numRows 0这显然是不正确的对新手问题表示歉
如何链接 SSL 证书

有没有什么方法可以将我们自己生成的密钥对与已链接到根 CA 例如 verisign 的现有证书链接起来基本上我的问题如下图所示 Verisign Root CA gt Company XYZ certificate gt Server f
在 Hadoop 中按文件中的值排序

我有一个文件其中每行包含一个字符串然后是一个空格然后是一个数字例子 Line1 Word 2 Line2 Word1 8 Line3 Word2 1 我需要按降序对数字进行排序然后将结果放入文件中为数字分配排名所以我的输出应该
Sqoop mysql错误-通信链路故障

尝试运行以下命令 sqoop import connect jdbc mysql 3306 home credit risk table bureau target dir home sqoop username root password
Spark 上的 Hive 2.1.1 - 我应该使用哪个版本的 Spark

我在跑蜂巢2 1 1 Ubuntu 16 04 上的 hadoop 2 7 3 根据Hive on Spark 入门 https cwiki apache org confluence display Hive Hive on Spark
猪如何过滤不同的对（对）

我是猪的新手我有一个 Pig 脚本它在两个元素之间生成制表符分隔的对每行一对例如 John Paul Tom Nik Mark Bill Tom Nik Paul John 我需要过滤掉重复的组合如果我使用 DISTINCT 我会
http://localhost:50070/ 的 hadoop Web UI 不起作用

命令 jps 显示以下详细信息第5144章 5464 节点管理器 5307 资源管理器 5800 Jps 显然namenode和datanode丢失了网络用户界面位于http 本地主机 50070 http localhost 5007
Oozie SSH 操作

Oozie SSH 操作问题 Issue 我们正在尝试在集群的特定主机上运行一些命令我们为此选择了 SSH Action 我们面对这个 SSH 问题已经有一段时间了这里真正的问题可能是什么请指出解决方案 logs AUTH FAILE
将日期字符串转换为“MM/DD/YY”格式

我刚刚看到这个例子我该如何解决这个问题 Hive 元存储包含一个名为 Problem1 的数据库其中包含一个名为 customer 的表 customer 表包含 9000 万条客户记录 90 000 000 每条记录都有一个生日字段
如何在 Hadoop 中将 String 对象转换为 IntWritable 对象

我想转换String反对IntWritableHadoop 中的对象任何过程都可以进行转换 IntWritable value new IntWritable Integer parseInt someString 并处理以下可能性par
mongodb - 检索数组子集

看似简单的任务对我来说是一个挑战我有以下 mongodb 结构 services TCP80 data status 1 delay 3 87 ts 1308056460 status 1 delay 2 83 ts 1308058080
如何通过sparkSession向worker提交多个jar？

我使用的是火花2 2 0 下面是我在 Spark 上使用的 java 代码片段 SparkSession spark SparkSession builder appName MySQL Connection master spark ip
如何使用 Amazon 的 EMR 在 CLI 中使用自定义 jar 指定 mapred 配置和 java 选项？

我想知道如何指定mapreduce配置例如mapred task timeout mapred min split size等等当使用自定义 jar 运行流作业时当我们使用 ruby 或 python 等外部脚本语言运行时我们可以使
RavenDB：为什么我会在此多重映射/归约索引中获得字段空值？

受到 Ayende 文章的启发https ayende com blog 89089 ravendb multi maps reduce indexes https ayende com blog 89089 ravendb multi m
在映射器的单个输出上运行多个减速器

我正在使用地图缩减实现左连接功能左侧有大约 6 亿条记录右侧有大约 2300 万条记录在映射器中我使用左连接条件中使用的列来创建键并将键值输出从映射器传递到减速器我遇到性能问题因为两个表中的值数量都很高的映射器键很少例如分别
MiniDFSCluster UnsatisfiedLinkError org.apache.hadoop.io.nativeio.NativeIO$Windows.access0

做时 new MiniDFSCluster Builder config build 我得到这个异常 java lang UnsatisfiedLinkError org apache hadoop io nativeio NativeIO

随机推荐

升级操作系统并安装 xcode 后，cocoa pods 中出现 Ruby 错误

当我运行终端命令时出现以下错误 pod install终端中的命令 System Library Frameworks Ruby framework Versions 2 3 usr lib ruby 2 3 0 rubygems core
如何在 3D Numpy 数组中生成球体

给定一个形状为 256 256 256 的 3D numpy 数组我如何在里面制作一个实心球体形状下面的代码生成一系列递增和递减的圆圈但在其他两个维度中查看时呈菱形 def make sphere arr x pos y pos z
ASP.NET 的 IIS 7 中缺少 MIME 类型 - 404.17

当获得新配置的 Windows 7 机器时我注意到 ASP NET 默认情况下处于关闭状态经典 ASP 也是如此我收到 Web 应用程序的 404 17 错误我认为这是因为我没有 aspx MIME 类型虽然这只是一个猜测当我在
Erlang：优先接收

Erlang 中的优先接收可以很容易地实现如下 prio gt receive priority X gt X after 0 gt receive X gt X end end 我正在读一篇名为作者 Nystr m 他们在其中描述了以下
Java 菜单项在事件侦听器中启用

您好我尝试从事件侦听器中启用我的 JMenuItem 但它似乎超出了范围我是java新手所以我该如何正确地解决这个问题所述事件侦听器将更改为新视图并启用禁用的菜单项 Create and add MenuItems JMenuIte
线程“main”中的异常 java.lang.IllegalArgumentException：无法实例化接口 org.springframework.context.ApplicationContextInitializer

我面临以下错误线程 main 中的异常 java lang IllegalArgumentException 不能实例化接口 org springframework context ApplicationContextInitializ
如何通过原型实现经典类继承

我想在 JS 中实现以下行为请注意语法是符号性的这是我的家长班 class TList FList array function AddElement Ele Flist Add Ele function RemoveEle Ele
css 中的clearfix 类有什么作用？ [复制]

这个问题在这里已经有答案了我见过div标签使用clearfix小时候上课divs使用float财产 ClearFix 类如下所示 clearfix after clear both content display block height
Leiningen 是否读取 .m2/settings.xml 中的 Maven 设置？

我有几个额外的存储库 m2 settings xml 我试过lein search它在我的存储库中找不到软件包我如何告诉 leiningen 在 Maven 设置中搜索存储库您可以添加 repositories标记到您的project
RSelenium 中的滚动页面

如何使用以下命令手动滚动到页面底部或顶部 RSelenium网络驱动程序我有一个元素仅当它在页面上可见时才可用假设你有 library RSelenium startServer remDr lt remoteDriver remD
使用 mutate 中的 distm 函数计算两点之间的距离

我正在尝试计算两组经度和纬度坐标之间的距离我正在使用 geosphere 包中的函数 distm 来执行此操作如果我手动将值放入 distm 函数中它可以正常工作但我无法让它在我的 mutate 命令中工作当在 mutate 函数
gwt编译错误

Compiling module com sem Sem10 Finding entry point classes ERROR Unable to find type com sem client Sem10 ERROR Hint Pre
spring-boot-starter-oauth2-client、spring-cloud-starter-oauth2 和 spring-security-oauth2 之间有什么区别

我正在为 OAUTH2 中的 client credentials 授予类型流开发客户端应用程序我无法决定在我的项目中为此目的使用以下依赖项 spring boot starter oauth2 客户端 spring cloud star
Flutter：使用插件构建需要符号链接支持

每当我尝试安装任何依赖项时我都会在日志中收到以下错误pubspec yaml Building with plugins requires symlink support Please enable Developer Mode in y
比使用“A UNION (B in A)”更高效的 SQL？

编辑1 澄清感谢您迄今为止的回答反响令人欣慰我想稍微澄清一下这个问题因为根据答案我认为我没有正确描述问题的一个方面我确信这是我的错因为我什至很难为自己定义它问题在于结果集应仅包含 tstamp 介于 2010 01 03
如何在 iPhone 中的 uilabel 上显示倒计时？

我在视图上有一个 uilabel 和一个 uislider 我想使用滑块设置标签的时间滑块的范围是 00 00 00 到 03 00 00 意味着 3 小时滑块上的间隔是 0 5 分钟还有如何显示我希望即使应用程序关闭计时器也会运
Symfony2 在包和控制器之间传递数据

这更像是一个最佳实践问题而不是一个实际问题我正在 Symfony 2 中开发一个项目并且构建了一个包来处理我的所有 Web 服务基本上一个控制器获取一些 JSON 数据将其发送到另一个控制器以检查其是否与所描述的格式匹配然
证书文件存在时出现“CryptographicException：找不到请求的对象”

我有一个 p12证书文件我像这样创建我的证书 var certificate new X509Certificate2 certFileLocation mySecret X509KeyStorageFlags Exportable Wh
错误：“if”之前应有不合格的 id

我用谷歌搜索了这个错误直到我脸色发青但无法将任何结果与我的代码联系起来这个错误似乎通常是由于牙套父母等放错位置或丢失造成的我已经很长时间没有写过任何 C 了所以我可能遗漏了一些明显的愚蠢的东西这是我正在编写的 Qt Mobi
Hadoop ChainMapper、ChainReducer [重复]

这个问题在这里已经有答案了我对 Hadoop 比较陌生并试图弄清楚如何使用 ChainMapper ChainReducer 以编程方式链接作业多个映射器减速器我找到了一些部分示例但没有一个完整且有效的示例我当前的测试代码是

Hadoop ChainMapper、ChainReducer [重复]

Hadoop ChainMapper、ChainReducer [重复] 的相关文章

随机推荐

热门标签