VoiceXML Automated Testing, using JVoiceXML

UPDATE: GuiceXML and VoiceXML Autotest (re)united. This way. New code, screencast to give you a (more) concrete idea of what GuiceXML is about, and a screnshot of a code to express your voicexml automated-test scenario.

----- Original Entry -----

Okay, just some quick & drafty entry here.

I needed a way to automate the testing of IVR applications (VoiceXML). Quick googling will take us to this (Microsoft) and this (Empirix). I can't make any comment on either of those options, because I haven't used any of them; Tellme retired its free-developer account service, so I can't access Tellme Studio any longer. As for Empirix, based on what I read on its webpage, I don't think it's exactly the one I was looking for (it talks about recognition error rate, prompt quality, etc., while my concern is mainly about the flow of the dialogs).


UPDATE: I guess this product named "Voiyager" is close to what I'm looking for. Link: http://www.syntellect.com/pages/products/voiyager_eng.aspx

What I want is really simple (to begin with): I want to verify (quickly) that if I press "1" in a dialog that asks "What do you want to drink? Press 1 for coffee, Press 2 for tea.", then the next prompt would be "You selected coffee. What type of coffee? Press 1 for cappuccino, press 2 for espresso".

This is based on my observations on how clients specify their expectations..., as a theater script with two actors in it, the IVR and the user. Like this:

IVR: What do you want to drink? Press 1 for coffee, Press 2 for tea. (or: audio_01.wav)
User: Press 1
IVR: You selected coffee. What type of coffee? Press 1 for cappuccino, press 2 for espresso. (or: audio_02.wav)
... (and so on).
So I thought it would be nice, if we can make a little program that takes that script and check it against a running IVR.

I've seen some people attempt to automate the test using automation tool like AutoIt, that basically (1) starts a softphone, (2) dial the IVR, (3) input DTMF (by pressing buttons in the softphone app), and that's it.

The problem with that is:
  1. There's no (easy) way to verify the prompts. Tool like AutoIT is a GUI test tool, designed for testing desktop applications, to check properties of GUI elements in the application. You can't use that to capture audio, let alone to compare it against our expectation (which would be expressed in a text format).
  2. Without a way to verify the prompts, the test is useless.
And then I came across JVoiceXML, an open-source VoiceXML interpreter, written in Java. Somebody else came up with the idea for the test-tool ( described here: http://sourceforge.net/apps/mediawiki/jvoicexml/index.php?title=UnitTest ). I simply took that idea, and implement part of it. I started with something really simple: I want to be able to express each scenario in the following format (plain text file):
i.Please enter 1 to go to formB, or 2 to go to formC%1
a.You are in prompt B
i.Please enter 34 followed by # to go to formD, or 35 followed by # to go to formE%34#
a.You are in prompt D
The line that starts with "i." means: an input collection is expected, where user will prompted with the question to right of the first "." & the left of "%", and user will respond by pressing sequence of digits specified to the right of "%". The line that starts with "a." means: user will be prompted with the question to right of the first ".".

And, to run the test (scenario), I would only have to type this command in the console:
java JVoiceXmlTest http://mywebserver/index.jsp scenario_01.txt
Where the first parameter (http://mywebserver/index.jsp) is the URL of the landing page of the IVR, and the second parameter (scenario_01.txt) is the name of the text file that contains the scenario.

So, here's what I got so far (video below). Nothing interesting :) yet, just some scrolling text in a console. What's more interesting is some findings I made when I was modifying JVoiceXml source code.



Ok, now the findings (I hope this can be a useful feedback for JVoiceXml team in refactoring effort):
  1. JVoiceXml has a dependence to RMI (i.e.: it binds itself to the JNDI during startup). It maybe fine for the intended use of JVoiceXml (as a networked application). However, for testing tool like this one here, I just want to run it as a standalone component. Particularly, I'm only interested in the voicexml interpreter core. For now, I simply comment out the lines of code related to JNDI and RMI. I hope future versions will be refactored to let us use the voicexml interpreter as a plain java object.
  2. JVoiceXML has this architecture that allows you to change the "platform factory". A platform factory is basically an object that creates other objects that know how to obtain and process the (spoken) input and (audio) output. Inputs and outputs go through an instance of "Telephony" (it's the canal).

    The good thing is JVoiceXML comes with a "text platform factory" that takes inputs as texts and produces outputs as texts. A slight modification was needed to that "text platform factory", because its telephony read input / write output from / to server socket. I don't need that; I needed to by-pass that and use simple method invocations. So I created a wrapper around it, that's the PruebaPlatformFactory.
Another finding (lengthy one, and the most challenging), the issue with semantic interpretation of grammar.

Currently the implementation of its GrammarChecker does not support semantic interpretation. So, you can not associate a (custom) value to phrases supported by your grammar. This limits the use of JVoiceXML.

In my case, for example, I need to capture a 4-digits pin number. Well, for that actually you can simply take the "utterance" and treats it as the value. Let's suppose user is required to complete the input by pressing the # after the fourth digits. Well, for that you can also simply take the utterance, and (in your "business logic" you drop the trailing #). You don't need semantic interpretation for such case.

Anyway, semantic interpretation is important; any decent VoiceXml browser must support it (conforming to a specification like SISR). So I set out to solve this, as part of an exercise. I ran into some difficulties, because I still don't have a good grasp on how the GrammarChecker and (to lesser degree) SrgsXmlGrammarParser work; what's the principles & logic. I guess the difficulties stems from the fact that the tree structure in the static model (the SRGS grammar) is transformed to a linear structure when the input is checked against the grammar.

So I just put my modifications in some sensible places in the code where I can intercept the event of " node is visited". My code simply collects the content of those (which basically is line of JavaScript code) , and stitch those lines of JS code (in some order) when the walk is completed. The code then feeds those lines of JS code to the embedded JS-engine (Rhino), and I simply take the return value of the execution of the JS code, and assign it as the "semantic interpretation".

Take the following grammar for instance (click to enlarge).










If the input from user "34#", then the JS-codes that the GrammarChecker will produce would be:
var digits = new Object();
digits.MEANING='D';
var root_rule = new Object();
root_rule.MEAN2=digits.MEANING;
root_rule.MEAN1='form';
root_rule.whereToGo=root_rule.MEAN1+root_rule.MEAN2;
The return value from the JS execution is always the value of the last line, so effectively we will get the string "formD".

My current fix is kind of hackish. I just did the minimal thing to make the cases listed in http://box.net/files#/files/0/f/0/1/f_945790295 pass. The ideal solution would be the one that passes SISR 1.0 conformance tests. For that I will need to take a closer look at the SrgsXmlGrammarParser and GrammarChecker, and the related classes & interfaces. I feel the need for refactoring in that area. The way GrammarNodes and SrgsNode are (currently) structured doesn't make it easy to navigate through the tree / walk, which might be required for efficient implementation of semantic interpreter. I was also thinking, why not use ANTLR for generating the bulk of the grammar interpreter? I guess that would be easier & produce cleaner code.

Last finding: this time about notification. I need a way to get notified whenever any of these two things occurs:
  1. The interpreter is waiting for input, so that I can put code that pro grammatically feeds the input in.
  2. The interpreter is playing a prompt, so that I can put code that compares the prompt with the one specified in the scenario file.
JVoiceXml employs strategy design pattern (see interface TagStrategy), which I exploits here to achieve those two things mentioned above; I simply implements a TagStrategy that wraps around the default strategy, so I can do some interceptions, and fire the notifications from there. Hmm, well, I was lying. I mean, that would be the right way to do it, but for now (because I don't have much time), I simply modified the implementation of the default strategies. Told you, it was hackish :).

The modified JVoiceXml is available here (it is based on JVoiceXml 0.7.4.1). Actually, it's of little use for public right now (it's really yucky!); I will have to modify it anyway sometime later, only after I get a firm understanding of the grammar interpreter, in order to make it SISR 1.0 compliant.

Okay, that's all for now!