UPDATE: GuiceXML and VoiceXML Autotest (re)united. This way. New code, screencast to give you a (more) concrete idea of what GuiceXML is about, and a screnshot of a code to express your voicexml automated-test scenario.

----- Original Entry -----

After a long hiatus, time to resurrect old projects. One of them is someting I name... GuiceXML. It's a word play of GUI + VoiceXML.... This is a cool (and useful) thing. Several companies are doing it, check out this one:

VoiceXML is a language for describing flows of dialogs. Originally it's meant for interactive VOICE response (IVR). But..., it's quite easy to conceive these same flows of dialogs can be presented VISUALLY in a smartphone. We have proliferation of VoiceXML-based IVR; It's been around since 2000, nowadays majority of IVR are written in VoiceXML, so it's pretty standard, just as HTML is to websites. Also, smartphone is commonplace. Connect the dots, you can quickly see the opportunity for this kind of thing (an adaptor, VoiceXML app -> mobile app).

Of course in practice that wouldn't be as straightforward as it sounds. To begin with, inherently there's a difference between visual and audible user interaction. What are those fundamental differences? Well, I still have to read some more about it :), so can't write much now on that matter. This link in the meantime can provide some hints: . So..., a good adaptor must provide a good API & tool to overcome this first challenge. I guess a significant amount of tinkering and work would be on tackling this challenge (note to product manager :)).

Maybe this technical description can illustrate some points. In GUI users are accustomed to seeing all the fields, that belong to the same form, on the same screen. Additionaly, in most cases users have freedom in deciding the sequence in which they fill-in the input fields. But that's normally not the case with VUI where users are kind of "forced" to give input one-by-one, in a predefined (and unflexible) sequence, eventhough there isn't dependency between each input. This is not a limitation of VoiceXML per se, but more of UI design (most VoiceXML applications are coded in that simplistic way). VoiceXML actually provides a way to implement mixed initiative dialog. Check out this link:

Alright, you might ask: why not simply apply some XSL transformations on the VoiceXML documents to produce HTML then?

Well, not that simple, and one of the reason is... something I would call the dynamism embedded in the VoiceXML spec itself, and without even requiring JavaScript (which is another thing that can be used inside a VoiceXML document). For example: an input field (), can have a "cond" attribute that specifies a Boolean expression, which must equate to 'true' in order for the content to be visited and executed. Furthermore, this expression can involve a value of another field(s). Meaning: in that case, there _is_ a dependency between fields. Contrast this to HTML where you would have to use some JavaScripts to provide similar behavior.

A good adaptor must know how to figure out whether or not there are dependencies between fields in a form, and adjust the way it asks for the inputs accordingly. We can think of parsing the fields in a form to see if any of them have "cond" attribute for example; if not, then it might be safe to presents all the fields in the form on a single screen. Now, it's easy to imagine that the task of adapting can be even more challenging if JavaScript codes exist in the VoiceXML document.

The crux of the matter: a simple, static transformation (such as the one with XSL) wouldn't work. The adaptor _must_ execute the VoiceXML (and the embedded JavaScript, if any). We don't want to require programmers to tweak existing VoiceXML code just in order to make it presentable as GUI. We want to provide 100% assurance that existing VoiceXML will processed the same way as it is on existing VoiceXML browser, that no logic inside existing VoiceXML code that might impact business operation will be ignored / droppped. In short, a VoiceXML-compliant adaptor.

For that reason, I decided to pick up JVoiceXML, an open-source VoiceXML interpreter, as the base for this adaptor. What needs to be done is extending it through various mechanism (Subscriber/Listener, injection of alternative Strategy, wrapper + delegation, etc). In some situations I even had to make modification directly in the JVoiceXML code, and I tried not to break its VoiceXML-compliance. It takes a lot of reverse engineering / debugging tricks to understand how it works and know where & how to make the modifications / extensions (link to list of modifications in JVoiceXML code). GuiceXML alpha 0.1 is based on JVoiceXML ver 0.7.5.

Here's the result so far (link to the code). You will need NetBeans to open the project (I'm using the latest version, 7.3). To run the project, create a Run configuration like this one:

The GuiceXML code belongs to a package named net.raka.jvxmltd.

This sequence of screens will give you a picture of what to expect from this alpha version. It still does thing in the most naive way, like I mentioned above, asking input one field at a time. This example requires you to deploy two VoiceXML documents (sample_01.vxml and sample_02.vxml) to your webserver. You just have to change the URL of the landing VoiceXML page (sample_01.vxml) to point to your webserver.

Also, of course this is a very rough version of GuiceXML, where I basically put hooks here and there, in order to be able to intercept and react on a few events that I'm interested in, such as announcement and input collection. In fact in one place I have to do a dirty trick of using execution stack trace, just in order to figure out whether a prompt should be played as part of input collection (i.e.: inside a ), or as an informational message (i.e.: inside a ). If it's part of input collection, that GuiceXML prints it in the text area inside the frame. Otherwise, it displays a blocking message dialog.

Actually, that dirty trick along with modifications I did directly in the JVoiceXML code, leads me to thinking that JVoiceXML code itself needs some refactoring. Ideally, to implement an adaptor / extension like GuiceXML, one shouldn't need to modify the core library (JVoiceXML). The hooks provided the library should be sufficient. However that's not the case with JVoiceXML 0.7.5.

Also, refactoring is needed in JVoiceXML to abstract away several classes on which the current GuiceXML is based / depends on. For example, JVoiceXML's TextTelephony (used by GuiceXML's TextDriverTelephony) communicates with its peer classes using socket. Well, that's not desired in GuiceXML because the idea is to have GuiceXML running in mobile environment (e.g.: J2ME / Android phone), and minimize the use of network resource (only HTTP will be needed to fetch the VoiceXML documents from the app server that hosts the IVR application). TextTelephony I think should be coded in such that way the I/O channel it uses can be injected, and the I/O channel itself is described as a java interface, where network socket is only one of the possible implementations. Click here to understand the role of Telephony in JVoiceXML.

Additional note: this visual IVR idea can be mixed with callback. Many times I get frustrated dealing with contact center on the phone (voice), for two reasons: (1) that tedious IVR and (2) long wait time to speak to an agent (when call gets transferred at the end of IVR interaction). With visual IVR, interaction can be sped up (and companies can reuse their existing IVR-app investment) and caller doesn's have to wait to be attended by the agent. The contact center agent is the one who is supposed to make the call back.

Here's a picture of a callback app concept I sketched a while back, and mentioned in this blog:

Alright, that's all for now. Drop me a line if you're interested in the project. ( raka.angga _at_ ). Cya!