I’ve mentioned Google’s MapReduce algorithm in a previous post and guess what, I’ve digg deeper and found a video lecture (
http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html) released officially by Google discussing about the topic.
If you’re sick of lectures, here are some summaries collected online:
* Processing more data is going to mean using more machines at the same time, that’s because Moore’s law has maxed out and no longer applies (memory bandwidth issue)
* Anytime we’re going to cooperate between multiple processes we need to use synchornization primitives which are problematic (deadlocks)
* Real distributed system requires consideration of networking topology and recovery when individual units fail – they will fail, by definition
* NFS – Each machine is responsible to a single file. Stateless protocol
* GFS – Master keeps meta data about what chunk servers hold which data. Data is replicated across multiple chunk servers. Files are always appended to, no random access writes. Reads are streamed. All master data is kept in memory with 64kb per chunk
* Logs are used to keep the master data in tact – Write log, then write meta data with an END token. If END not there, need to rewrite.
* Fault Tolerance – High Availablity, Fast recovery, Chunk replication (default: 3 replicas), Shadow masters. Data Integrity – Checksum every 64Kb block in each chunk