MapReduce is applicable wherever you can partition the data and process each part independently of others.
I used Hadoop/Hbase for EEG time series analysis, looking for certain oscillation patterns (basically classic time-series classification) and it was an embarrassingly parallel problem:
3. Aggregate the results into a hash-table where the keys are segment' signatures/features/fingerprints, and the values are arrays of pointers to corresponding segments (Based on the size this table can either sit on a single machine, of be distributed on multiple hdfs nodes)
Essentially you do time-series clustering at the Reduce stage with each 'basket' in a hash-table containing a group of similar segments. It can be used as an index for similarity or range searches (for fast in-memory retrieval you can use HBase which sits on top of HDFS). You can also have multiple indices for different feature sets.
-----
The hard part is problem decomposition, i.e. dividing work into independent units, replacing one big nested loop/sigma on the entire dataset with smaller loops that can run in parallel on parts of the dataset, when you've done that, MapReduce is just a natural way to execute the job and aggregate the results.
I used Hadoop/Hbase for EEG time series analysis, looking for certain oscillation patterns (basically classic time-series classification) and it was an embarrassingly parallel problem:
Map:
1. Partition the data into fixed segments (either temporal, say 1hr chunks or location based, say 10x10 blocks of pixels). Alternatively you can use a 'sliding window' and extract features as you go. In some cases you can use symbolic representation/piecewise approximation to reduce dimensionality, as in iSax: http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html , "sketches" as described here: http://www.amazon.com/High-Performance-Discovery-Time-Techni... or some other time-series segmentation techniques: http://scholar.google.com/scholar?q=time+series+segmentation
2. Extract features for each segment (either linear statistics/moments or non-linear signatures: http://www.nbb.cornell.edu/neurobio/land/PROJECTS/Complexity... ). The most difficult part here has nothing to do with MapReduce but decide which features carry the most information. I found ID3 criterion helpful: http://en.wikipedia.org/wiki/ID3_algorithm, also see http://www.quora.com/Time-Series/What-are-some-time-series-c... and http://scholar.google.com/scholar?hl=en&as_sdt=0,33&...
Reduce:
3. Aggregate the results into a hash-table where the keys are segment' signatures/features/fingerprints, and the values are arrays of pointers to corresponding segments (Based on the size this table can either sit on a single machine, of be distributed on multiple hdfs nodes)
Essentially you do time-series clustering at the Reduce stage with each 'basket' in a hash-table containing a group of similar segments. It can be used as an index for similarity or range searches (for fast in-memory retrieval you can use HBase which sits on top of HDFS). You can also have multiple indices for different feature sets.
-----
The hard part is problem decomposition, i.e. dividing work into independent units, replacing one big nested loop/sigma on the entire dataset with smaller loops that can run in parallel on parts of the dataset, when you've done that, MapReduce is just a natural way to execute the job and aggregate the results.