Managing Massive Data

database logo iconI was really curious of how massive data are managed, especially by search engines such as Google or utilities company (electricity, water and gas) since they often involve storing data of the entire state or country. Google does release some technical papers on how they do it. For those of you who are as curious as I am, you can at least try to have a peek on how they do it.

From Google:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

From Lucene:

Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data.

Hadoop implements MapReduce, using the Hadoop Distributed File SystemHDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters.

And seriously, I wasn’t literate to study persistantly what and how this technologies were implemented. Yes, I did hold a bachelor degeee in computer systems engineering, so? You still need time sitting down to read, analyze, test and re-read what the paper trully means. It reminds me the good old days of uni, flipping tons of manuals and textbooks + webstes, calculating with the all-time-favorite engineering calculator etc etc. If you are in uni, make good use of it, or else don’t waste your time and money and QUIT!

One thought on “Managing Massive Data

Leave a Reply