Coding by voice with python

Time to read: 8 minutes

In the beginning...

I am not a programmer to trade, but rather a chemical engineer.

I picked up programming for fun in 2014 while at University, and well, I liked it so I kept on doing it.

In May 2018 I got my first programming job, building a full-stack e-commerce platform with some of the most talented people I have ever met.

But in July 2018, I got a Repetitive Strain Injury (RSI). I spent 7 months recovering, but I only spent 5 days off work.

This blog is about RSI, coding by voice, and general well-being.

TL; DR

Let's begin with the typescripts.

const title = "Coding by voice"

coding-example

Below is a full list of the tools that made me productive when I had limited use of a keyboard and mouse:

  • Aenea - A dictation toolbox; which enables control of linux machines from a remote windows host
    • I keep my own fork, because it is necessary for me to have access, I also keep my own grammars in there for common tasks.
  • Vimium - A chrome plugin that enables vim based navigation in the browser.
    • Using a phonetic alphabet, Vimium will put a character on every hyperlink, and you can call them out to navigate.
  • VSCode - I still used a rich IDE, in fact, I would recommend not changing from the tools you are most comfortable with if you can.

Some neat references

Setup

Aenea was fundamentally the key to my success, without it I do not know where I would be.

This server runs locally on a Linux machine, Ubuntu in my case, while a companion client runs inside a windows Virtual Machine (VM). It's outside the scope of this blog to provide the system architecture so, why not instead enjoy the GIF below of me using the system.

terminal-folder-navigation

Aenea was easy to set up in my case, but I could see how it may prove difficult. I identified early on that the VM was the complex value adding component, and I have it backed up on a physical disk and in the cloud. Now, I can run it on any device, from my super fast USB stick, and pull my Github repository of Aenea and my grammars. I can setup my voice coding environment on any device in just 10 minutes.

Grammar Development

Grammar(s) - A collection of voice commands that map to machine commands or macros to execute My own definition

The Grammar is the important part, and each user seems to develop their own in my experience.

It is not simple to develop, because there are a number of interweaving concerns:

  1. The language you use/speak:

    1. In my case English, is not an infinite set, so possibilities are constrained. In reality, your imagine will run dry before you reach the limits of this set.
  2. The programs you use:

    1. Aenea is great for this, it has a deep integration in Linux, and can determine the Application context here is an example for how my Ruby grammar works:

      from aenea import *
      
      ruby_context = aenea.wrappers.AeneaContext(
          ProxyAppContext(match='regex', title='(?i).*(.rb).*'),
          AppContext(executable='Ruby')
          )

      This means that my Ruby specific Grammar is only activated when inside ruby files. I don't advocate for having the same command do multiple things in different context, but having specific contexts limits the possible outcomes in the recognition engine running in the windows virtual machine. Meaning it is less likely to misinterpret your intent.

What makes a grammar

In 2018 I was predominantly a Ruby on Rails developer, and so I had some handy utterances I could drop into my microphone. If I was in the Linux terminal, the commands below would be recognised, otherwise, they would do nothing.

# A mapping syntax for applications command line

from aenea import *

class ShiftRule(MappingRule):
    mapping = {
            # Launching the application

            'shifty boot': Text('./bin/dev/boot') + Key('enter'),
            'shifty yawn': Text('./bin/dev/yarn '),
            'shifty post': Text('psql -p 5432 -h localhost -U postgres'),
            'shifty console': Text('./bin/dev/console') + Key('enter'),
            'shifty test': Text('./bin/dev/rspec '),
            'shifty rails': Text('./bin/dev/rails '),
            'shifty bundle': Text('./bin/dev/bundle '),
            'shifty break': Text('./bin/dev/rake '),
        }
    extras = [
        Dictation("text"),
        IntegerRef("n", 1, 100),
    ]
    defaults = {
        "n": 1,
    }

Flow State

As a programmer you may want to attain a flow state, or not, its up to you?!

Popularized by positive psychologists Mihaly Csikszentmihalyi and Jeanne Nakamura, flow state describes a feeling where, under the right conditions, you become fully immersed in whatever you are doing. headspace.com

But when you have to adapt to a new situation, it can be difficult to find that familiar rhythm of productivity.

For me, the approach was all or nothing, and one of the first milestones was being able to extend my voice coding grammars, with my voice. This meant that anytime a new, repetitive task came up I could code a new macro by voice for it; say my keyword "Force field". This would reload the grammars within the remote windows machine in under 2 seconds, and I would now have access to the new command that I needed.

The future of voice coding

I would love to see this level of accessibility begin to bleed into modern Integrated Development Environments.

My own pipe dream is an editor, like VSCode, where intellisense and language servers provide hints when voice coding. Inside VSCode, the intelligence is already in place for what it thinks you may type next. Could it be used to help reason around what it is you are most likely trying to say?

Instead of the up-front investment in grammar development, auto-complete would limit the set of possible values, and constrain the voice model, so we could arrive at likely intent. This would mean more people could dip their toes into coding by voice. Once there is a larger community, there would be faster advancement, and more experimentation, and so on and so forth until coding by voice would be something anyone could try for an hour, enjoy, and go back to the keyboard.

I am hopeful of that future.