Warming up..., for my foray into big data analytics

Part of preparations for my foray into big data analytics is... mapping out the field. There are so many things to learn, so I have to get organized. I also need to bind together nicely in one single place things I've read so far (for further reference / simply to offload it from my head). This is just a note of my current understanding, which will evolve (corrected / refined) along the way. Without further ado, here it is:

---
A few additional words (added 1 May 2013)
---

As the name suggests: study-note. Has inaccuracies, incompleteness, some reasonings are not as strong as I would like them to be, etc. But I don't want to fall into analysis paralysis, so here it is my write-up, result of lots of googlings :), distilled....

My exercise laundry list:

(1) In datawarehousing, there's Pentaho. My reference is Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL ( http://amzn.to/104lh8k )

(2) In the area of NoSQL: my reference is "Seven Databases in Seven Weeks" ( http://amzn.to/10XxTWe ). Already finished reading the book. Now on to doing the exercises in the book. Will be focusing on PostgreSQL (as reference point, from the familiar SQL standpoint), and MongoDB (the most popular NoSQL db).

(3) In machine learning: my reference is "Data Mining: Practical Machine Learning Tools and Techniques" ( http://amzn.to/13OjqbM ). I'm using Weka and Knime. I have to be very pragmatic, lots of mathematical rigor in the book. In the short term, I only need to get myself familiared with the underlying idea behind those algorithms, and take advantage of already-implemented programming libraries (such as weka and mahout, to name a few). This, like techniques that is more close to art, is not something I can acquire in short time, because it depends on intuition, that requires time and lots of drills to develop.

(4) Data cleansing: key skill in data analysis. Thanks Google for giving Google Refine for free  ( https://code.google.com/p/google-refine/ ). My reference: "Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work" ( http://amzn.to/10Xz1ZI ).

Now... I need to switch back this weekend to finishing my VoiceXML autotest, before taking on that laundry list :D . Cya!

---

Note: the PDF can be downloaded from https://www.box.com/s/6lbafigfkbrta5cnll5u