Found in 1 comment on Hacker News
ahalan · 2011-12-17 · Original thread
MapReduce is applicable wherever you can partition the data and process each part independently of others.

I used Hadoop/Hbase for EEG time series analysis, looking for certain oscillation patterns (basically classic time-series classification) and it was an embarrassingly parallel problem:

Map:

1. Partition the data into fixed segments (either temporal, say 1hr chunks or location based, say 10x10 blocks of pixels). Alternatively you can use a 'sliding window' and extract features as you go. In some cases you can use symbolic representation/piecewise approximation to reduce dimensionality, as in iSax: http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html , "sketches" as described here: http://www.amazon.com/High-Performance-Discovery-Time-Techni... or some other time-series segmentation techniques: http://scholar.google.com/scholar?q=time+series+segmentation

2. Extract features for each segment (either linear statistics/moments or non-linear signatures: http://www.nbb.cornell.edu/neurobio/land/PROJECTS/Complexity... ). The most difficult part here has nothing to do with MapReduce but decide which features carry the most information. I found ID3 criterion helpful: http://en.wikipedia.org/wiki/ID3_algorithm, also see http://www.quora.com/Time-Series/What-are-some-time-series-c... and http://scholar.google.com/scholar?hl=en&as_sdt=0,33&...

Reduce:

3. Aggregate the results into a hash-table where the keys are segment' signatures/features/fingerprints, and the values are arrays of pointers to corresponding segments (Based on the size this table can either sit on a single machine, of be distributed on multiple hdfs nodes)

Essentially you do time-series clustering at the Reduce stage with each 'basket' in a hash-table containing a group of similar segments. It can be used as an index for similarity or range searches (for fast in-memory retrieval you can use HBase which sits on top of HDFS). You can also have multiple indices for different feature sets.

-----

The hard part is problem decomposition, i.e. dividing work into independent units, replacing one big nested loop/sigma on the entire dataset with smaller loops that can run in parallel on parts of the dataset, when you've done that, MapReduce is just a natural way to execute the job and aggregate the results.

Fresh book recommendations delivered straight to your inbox every Thursday.