Warming up..., for my foray into big data analytics

Part of preparations for my foray into big data analytics is... mapping out the field. There are so many things to learn, so I have to get organized. I also need to bind together nicely in one single place things I've read so far (for further reference / simply to offload it from my head). This is just a note of my current understanding, which will evolve (corrected / refined) along the way. Without further ado, here it is:

---
A few additional words (added 1 May 2013)
---

As the name suggests: study-note. Has inaccuracies, incompleteness, some reasonings are not as strong as I would like them to be, etc. But I don't want to fall into analysis paralysis, so here it is my write-up, result of lots of googlings :), distilled....

My exercise laundry list:

(1) In datawarehousing, there's Pentaho. My reference is Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL ( http://amzn.to/104lh8k )

(2) In the area of NoSQL: my reference is "Seven Databases in Seven Weeks" ( http://amzn.to/10XxTWe ). Already finished reading the book. Now on to doing the exercises in the book. Will be focusing on PostgreSQL (as reference point, from the familiar SQL standpoint), and MongoDB (the most popular NoSQL db).

(3) In machine learning: my reference is "Data Mining: Practical Machine Learning Tools and Techniques" ( http://amzn.to/13OjqbM ). I'm using Weka and Knime. I have to be very pragmatic, lots of mathematical rigor in the book. In the short term, I only need to get myself familiared with the underlying idea behind those algorithms, and take advantage of already-implemented programming libraries (such as weka and mahout, to name a few). This, like techniques that is more close to art, is not something I can acquire in short time, because it depends on intuition, that requires time and lots of drills to develop.

(4) Data cleansing: key skill in data analysis. Thanks Google for giving Google Refine for free  ( https://code.google.com/p/google-refine/ ). My reference: "Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work" ( http://amzn.to/10Xz1ZI ).

Now... I need to switch back this weekend to finishing my VoiceXML autotest, before taking on that laundry list :D . Cya!

---

Note: the PDF can be downloaded from https://www.box.com/s/6lbafigfkbrta5cnll5u

VoiceXML Autotest: Switching to Bladeware VXML

UPDATE: Progress as of 5 May 2012 is available at: http://jananuraga.blogspot.mx/2013/05/progress-on-voicexml-autotest-using.html

---
Yep, I decided to switch to BladewareVxml to use as the basis for the VoiceXML Autotest I'm working on. Pragmatic reasons:

  • BladewareVxml is VoiceXML 2.1 compliant.
  • It's based on OpenVXI, which to my knowledge is used by many commercial products out there. So..., hopefully people would be more likely to use this tool.

What have I done so far? Very basic: get BladewareVxml compiled :) No brainer, I only had to fix a few build configurations, and one header file, VxiCommon.h, adding #include <iterator>. Download it from here: https://www.box.com/s/l9yniyq9pcbgb4nhro90

Next: I'm going to write a C++ interface and process with SWIG, first step toward integration with Python. The idea: software testers will write their test-script in Python, and that C++ interface will act as a bridge between that test script and the BladewareVXML interpreter object.

The C++ class that implements the interface will:
  • Accept (through its constructors) callback-function pointers written in python, such as:
    • readInput
    • renderPrompts
    • inputRecognized
    • inputNomatch
    • inputNoinput
    • transferPerformed
    • submit
  • Accept (through its constructors) VoiceXML platform properties.
  • Have methods that can be called from test script. These methods will interact with Bladeware VoiceXML interpreter object created during the construction:
    • runVxml
    • feedInputSpoken
    • feedInputDTMF
    • feedNoInput
    • hangup
  • Be implemented as singleton, allowing me to obtain a reference to it from any point in the Bladeware's Vxml code.
Before I get to that, I will have to study Bladeware Vxml code in debug mode. Now... where is the starting point(s)? :D Maybe I can start from these ones.... Ok, that's all for now, until next weekend.





GuiceXML & VoiceXML Autotest, (r.e).u.n.i.t.e.d. !

STOP PRESS :)  20 April 2013: I decided to use Bladeware Vxml instead of JVoiceXML as the engine for this tool. Click here to find out more about it.

***********************

Ok, quickie :)

GuiceXML: I already explained it in previous blog entry. This weekend I had a chance to make some improvement in the code, and it's available here for download: https://www.box.com/s/3bph4o8096489spnhip2 . Don't forget to download the sample VoiceXML files (just deploy them to your webserver), here: https://www.box.com/s/v0tmupsij9ogscullh1k .

I also had a chance to take a screencast of GuiceXML, so you can have a better idea of what it is. Here's the vid:



VoiceXML Autotest: it was explained in this old blog entry (actually it precedes GuiceXML). With that library you can check the VoiceXML flow against your expectations (scenario) expressed the following way:


The good things is: both things now share the same code base. :) 

Well, that's all for tonight. I'm sorry I don't have a chance tonight to write a bit about the code and where it's heading. I hope tommorrow night I'll be able to do it. Cya!