hadoop streaming : how to give list of key values to reducer?

so when we use Java for writing map/reduce program, the map collects the data and reducer receives the list of values per key, like

Map(k, v) -> k1, v1 then shuffle and sort happens then reducer gets it reduce(k1, List<values>)

to work on. but is it possible to do the same with python using streaming? I used this as reference and seems like reducer gets data per line as supplied on command-line

--------------Solutions-------------

In Hadoop Streaming, the mapper writes key-value pairs to sys.stdout. Hadoop does the shuffle and sort and directs the results to the mapper in sys.stdin. How you actually handle the map and the reduce is entirely up to you, so long as you follow that model (map to stdout, reduce from stdin). This is why it can be tested independently of Hadoop via cat data | map | sort | reduce on the command line.

The input to the reducer is the same key-value pairs that were mapped, but comes in sorted. You can iterate through the results and accumulate totals as the example demonstrates, or you can take it further and pass the input to itertools.groupby() and that will give you the equivalent to the k1, List<values> input that you are used to, and which work well the the reduce() builtin.

The point being that it's up to you to implement the reduce.

May be this can help you. I found this from apache... org

Customizing the Way to Split Lines into Key/Value Pairs As noted earlier, when the Map/Reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. By default, the prefix of the line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value.

However, you can customize this default. You can specify a field separator other than the tab character (the default), and you can specify the nth (n >= 1) character rather than the first character in a line (the default) as the separator between the key and value. For example:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4

In the above example, -D stream.map.output.field.separator=. specifies "." as the field separator for the map outputs, and the prefix up to the fourth "." in a line will be the key and the rest of the line (excluding the fourth ".") will be the value. If a line has less than four "."s, then the whole line will be the key and the value will be an empty Text object (like the one created by new Text("")).

Similarly, you can use -D stream.reduce.output.field.separator=SEP and -D stream.num.reduce.output.fields=NUM to specify the nth field separator in a line of the reduce outputs as the separator between the key and the value.

Similarly, you can specify stream.map.input.field.separator and stream.reduce.input.field.separator as the input separator for map/reduce inputs. By default the separator is the tab character.

PipeReducer is the reducer implementation for Hadoop streaming. The reducer gets key/values, iterates it and sends to the STDIN as key/value and not key/values. This is the default behavior of Hadoop streaming. I don't see any option to change this, unless the Hadoop code has been modified.

public void reduce(Object key, Iterator values, OutputCollector output,
Reporter reporter) throws IOException {

.....
while (values.hasNext()) {
.....
inWriter_.writeKey(key);
inWriter_.writeValue(val);
.....
}
}

Category:python Time:2011-10-05 Views:1

Related post

  • How do I pass a parameter to a python Hadoop streaming job? 2012-03-01

    For a python Hadoop streaming job, how do I pass a parameter to, for example, the reducer script so that it behaves different based on the parameter being passed in? I understand that streaming jobs are called in the format of: hadoop jar hadoop-stre

  • How do I control output files name and content of an Hadoop streaming job? 2009-05-20

    Is there a way to control the output filenames of an Hadoop Streaming job? Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name

  • Hadoop streaming and AMAZON EMR 2010-10-27

    I have been attempting to use Hadoop streaming in AMAZON EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on amazon's EMR I took a very simplified data set too. Each text file had only one line

  • How would you suggest performing "Join" with Hadoop streaming? 2010-11-13

    I have two files, in the following formats: field1, field2, field3 field4, field1, field5 A different field number indicates a different meaning. I want to join the two files using Hadoop Streaming based on the mutual field (field1 in the above examp

  • Hadoop Streaming Job Failed (Not Successful) in Python 2011-01-18

    I'm trying to run a Map-Reduce job on Hadoop Streaming with Python scripts and getting the same errors as Hadoop Streaming Job failed error in python but those solutions didn't work for me. My scripts work fine when I run "cat sample.txt | ./p1mapper

  • Hadoop Streaming with very large size of stdout 2011-02-18

    I have two programs for Hadoop streaming. mapper (produces <k, v> pair) reducer Of course, <k, v> pairs are emitted to stdout. My question is if v in <k, v> is very large, does it run on hadoop efficiently? I guess v emitted by mapp

  • How to use Hadoop Streaming with LZO-compressed Sequence Files? 2011-02-20

    I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming. For the input files, it says "We store the da

  • hadoop streaming job fails in python 2011-04-22

    im trying to implement an algorithm in hadoop. i tried to execute part of the code in hadoop but streaming job fails $ /home/hadoop/hadoop/bin/hadoop jar contrib/streaming/hadoop-*-streaming.jar -file /home/hadoop/hadoop/PR/mapper.py -mapper mapper.p

  • Hadoop Streaming Problems 2011-08-03

    I ran into these issues while using Hadoop Streaming. I'm writing code in python 1) Aggregate library package According to the hadoop streaming docs ( http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Working+with+the+Hadoop+Aggregate+Packa

  • hadoop streaming: how to see application logs? 2011-10-25

    I can see all hadoop logs on my /usr/local/hadoop/logs path but where can I see application level logs? for example : mapper.py import logging def main(): logging.info("starting map task now") // -- do some task -- // print statement reducer.py impor

  • Does hadoop streaming use a stable sort between map and reduce phases? 2011-12-20

    This has ramifications for multi-stage jobs. For example if we sort by key "a" in phase 1 of the job and key "b" in phase 2 of the job (which takes phase 1 output as stdin), can we assume when the two phases are complete that the records are sorted b

  • Limiting the number of mappers running on Hadoop Streaming 2012-03-06

    Is it possible to limit the number of mappers running for a job at any given time using Hadoop Streaming? For example, I have a 28 node cluster that can run 1 task per node. If I have a job with 100 tasks, I'd like to only use say 20 out of the 28 no

  • Share data between hive and hadoop streaming-api output 2012-05-03

    I've several hadoop streaming api programs and produce output with this outputformat: "org.apache.hadoop.mapred.SequenceFileOutputFormat" And the streaming api program can read the file with input format "org.apache.hadoop.mapred.SequenceFileAsTextIn

  • How do I set Priority\Pool on an Hadoop Streaming job? 2009-06-15

    How can I set the Priority\Pool of an Hadoop Streaming job? It's probably a command-line jobconf parameter (e.g -jobconf something=pool.name) but I haven't been able to find any documentation on this online... --------------Solutions------------- -jo

  • Hadoop or Hadoop Streaming for MapReduce on AWS 2009-12-28

    I'm about to start a mapreduce project which will run on AWS and I am presented with a choice, to either use Java or C++. I understand that writing the project in Java would make more functionality available to me, however C++ could pull it off too,

  • Managing dependencies with Hadoop Streaming? 2010-05-19

    I have a quick Hadoop Streaming question. If I'm using Python streaming and I have Python packages that my mappers/reducers require but aren't installed by default do I need to install those on all the Hadoop machines as well or is there some sort of

  • Hadoop streaming maximum line length 2010-06-25

    I'm working on a Hadoop streaming workflow for Amazon Elastic Map Reduce and it involves serializing some binary objects and streaming those into Hadoop. Does Hadoop have a maximum line length for streaming input? I started to just test with larger a

  • Hadoop Streaming Multiline Input 2010-07-24

    I'm using Dumbo for some Hadoop Streaming jobs. I have a bunch of JSON dictionaries each containing an article (multiline text) and some meta data. I know Hadoop performs best when give large files, so I want to concat all the JSON dictionaries into

  • Including jar files in Hadoop streaming using Groovy 2010-07-30

    I love Hadoop streaming for it's ability to quickly pump out quick and dirty one off map reduce jobs. I also love Hroovy for making all my carefully coded java accessible to a scripting language. Now I'd like to put the 2 together. I'd like to take a

Copyright (C) pcaskme.com, All Rights Reserved.

processed in 3.004 (s). 13 q(s)