Dev Patel

introduction

Cua is an open-source infrastructure for Computer-Use Agents which utilizes Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows). After getting reached out to by @Francesco Bonacci (Founder of Cua YC X25), my task was to build an Android Docker provider for the Cua Computer SDK. This would give the ability to run Android devices and control them using the existing Cua agent framework.

Now, initially, the task seemed daunting but yet fun as I had never worked with Android emulators moreover an Android itself. On top of that, it was my first time working with an existing codebase that I had to understand before I could start working on it.

initial outlook

Going into the project, I had no idea what I was getting into. All I knew was that I had to use an existing docker image (https://github.com/budtmo/docker-android) and build an Android Docker provider for the Cua Computer SDK.

In hindsight, the requirements were pretty straightforward. Implement AndroidDockerProvider (which extends BaseVMProvider) and register it in VMProviderFactory.

After getting familiar with the existing setup on how other providers were implemented like Lume, Lumier, Docker etc., it was merely a matter of following the same development pattern. However it was obviously easier said than done.

Turns out, I had completely forgot about the specifications and requirements of the docker image and thought it would be a simple installation via Docker. Once again, having no clue how Android emulators work, I spent a ton of time reading through documentation for the actual image instead of Cua's (I probably spent more time getting the emulator to run than actually implementing the provider!!). Of course, this was the least of my struggles. Turns out, since I was working on a MacOS device with ARM64 instead of the old intel chips, there's something called KVM (Kernel-based Virtual Machine) which is required to run the docker image.

Simply put, to run the Docker image itself, I had to be on a machine that had KVM support. That's where my first step into the rabbit hole came. I started off by installing VirtualBox to run a Linux system, all so I could actually start visualizing the emulator...

That clearly didn't work. Although you can use a virtual machine to run a Linux system which I thought would have KVM enabled, a subproblem was that nested virtualization wasn't supported on my machine. That's the first time I learned about nested virtualization and how it works!

Learning 1: If you're running a Linux VM on a physical machine and want to use KVM to host additional VMs inside that guest, the physical host must support and enable nested virtualization.

Now after experimenting with VirtualBox, Parallels and even a Remote Connection to a Windows machine which Francesco had given me, none of those worked. It was not until I was given access to a Linux VM which Francesco had provisioned for me that I was able to finally check for KVM support and it was enabled!

vmuser@dev-linux:~$ lsmod | grep kvmkvm_intel             479232  0kvm                  1388544  1 kvm_intel

android emulator

Now that I was finally in an environment where I could start the actual development work, I had to make sure that the emulator at least showed up, disregarding the fact that I had no idea how to do it. I did think ahead and saw that the actual Cua implementation was built on 3 OSes, Linux, Windows and MacOS. I figured it would be smart to just run a linux docker container and use that as my environment to run the docker image. So using docker, I just pulled the budtmo docker image and started it up.

This is where another issue came up. Although the docker android image was running as seen above, the emulator itself was not showing up. I had no idea what was going on, but I did know that I had to use VNC to connect to the docker image and run the emulator. Up till this point, I have yet to integrate the Cua provider but I knew if I had the emulator running, I could start working on the provider easily as that would just be a change to the docker execution command.

Although multiple emulators were available via the docker image, I saw that the docker image when running, it needed a default emulator/device had to be set on runtime. That's where I figured out the basis of how the docker execution command would fit into the provider.

 docker run --privileged -d  -p 6080:6080  -p 5554:5554      -p 5555:5555  -p 5900:5900  -e EMULATOR_DEVICE="Samsung Galaxy S10"      -e WEB_VNC=true  --device /dev/kvm  --name android-container      budtmo/docker-android:emulator_11.0

Now you'll see that there are a few commands in the docker execution command. This was specifically curated so that the emulator would appear in the NoVNC web interface and would give control to a few ports which was later used for adb connections and VNC clients.

I've put together a detailed list on what each parameter to the docker execution command does and what the usage of it is.

docker run | Create and start a new container from an image
--privileged | Grant the container extended Linux capabilities needed for the emulator and nested virtualization features
-d | Run the container detached in the background
-p 6080:6080 | Map host port 6080 to container 6080 to access the web VNC UI in a browser
-p 5554:5554 | Map the emulator console port for emulator control connections
-p 5555:5555 | Map the ADB over TCP port to connect adb from the host to the emulator
-p 5900:5900 | Map native VNC port for standard VNC clients
-e EMULATOR_DEVICE="Samsung Galaxy S10" | Set the emulator device profile to a Galaxy S10 preset inside the image
-e WEB_VNC=true | Enable the noVNC web interface served on port 6080 for browser-based viewing
--device /dev/kvm | Pass through the host's KVM device so the emulator can use hardware virtualization acceleration
--name android-container | Assign a readable name to the container instance
budtmo/docker-android:emulator_11.0 | Use the budtmo/docker-android image variant preconfigured with Android 11 emulator

cua implementation

After setting up the emulator, the next step was to implement the provider into the existing factory method. This was pretty straightforward as I had already done the hard testing with execution commands above when I tried running the emulator.

I started off with the dependency check and availability flag androiddocker/__init__.py, mimicking that of the previous implementations that were already built.

"""verify docker and adb dependencies are instaleldset has_android bool flag to true if both are installed (further used in factory to check if provider is available)"""try:    import subprocess    # Only check for Docker - ADB is inside the container, not needed on host    subprocess.run(["docker", "--version"], capture_output=True, check=True)    HAS_ANDROID = Trueexcept (subprocess.SubprocessError, FileNotFoundError):    HAS_ANDROID = Falsefrom .provider import AndroidDockerProvider__all__ = ["AndroidDockerProvider", "HAS_ANDROID"]

From there, I created the provider class androiddocker/provider.py. In specific, I added a snippet below for the constructor to talk about the parameters that are needed to run the emulator.

 def __init__(        self,        port: int = 8000,        host: str = "localhost",        image: str = "budtmo/docker-android:emulator_11.0",        verbose: bool = False,        storage: Optional[str] = None,        ephemeral: bool = True,        vnc_port: int = 6080,        adb_port: int = 5555,        device_profile: str = "Samsung Galaxy S10",        **kwargs    ):

You'll see some of the parameters are preset to default values, and some are optional. Most of these come from the way providers are to be setup in the factory method. In specific I wanted to talk about the port mappings.

6080 (noVNC): Browser-based VNC viewers like noVNC typically serve via an HTTP endpoint that upgrades to WebSockets and forward to a VNC server on 5900+x; 6080 is the widely adopted default listen port for that websockify/noVNC endpoint, making it easy to remember and consistent across tooling.

5555 (ADB over TCP): Android Debug Bridge uses TCP 5555 by convention for networked devices/emulators; most Android tooling assumes 5555 unless specified, simplifying adb connect :5555 workflows.

8000 (service HTTP API/UI): Lightweight development servers frequently default to 8000 for local APIs and dashboards, avoiding collisions with 80/443 and staying familiar to developers; many Python frameworks use 8000 by default.

Additionally, I registered AndroidDocker into the supported VMProvider types and listed it in the factory implementation.

 // base.py   class VMProviderType(StrEnum):    """Enum of supported VM provider types."""    LUME = "lume"    LUMIER = "lumier"    CLOUD = "cloud"    WINSANDBOX = "winsandbox"    DOCKER = "docker"    ANDROID = "android"    UNKNOWN = "unknown"   // factory.py  .  .  .  elif provider_type == VMProviderType.ANDROID:    try:        from .androiddocker import AndroidDockerProvider, HAS_ANDROID        if not HAS_ANDROID:            raise ImportError(                "AndroidDockerProvider requires Docker to be installed and running. "                "Please ensure Docker is installed and the Docker daemon is running."            )        return AndroidDockerProvider(            port=port,            host=host,            image=image or "budtmo/docker-android:emulator_11.0",            verbose=verbose,            **kwargs        )    except ImportError as e:        logger.error(f"Failed to import AndroidDockerProvider: {e}")        raise ImportError(            "Cannot use AndroidDockerProvider: Docker is required. "            "Please install Docker and ensure the Docker daemon is running."        ) from e

Agent Workaround
Now this is where the fun begins, actually implementing the actions! This definitely turned out to be much more difficult then intended as initially, the existing computer agent was supposed to be able to handle it directly but due to the docker image not being preloaded with pyautogui (which is the basis of what the Cua computer agent utilizes to perform actions), there needed to be a way to convert natural language that the agent can read and convert it into ADB commands that can be executed on the emulator.

Initially, the idea was to create some sort of websocket bridge that would intercept the request and convert it into ADB commands. This turned out to be a challenge because of how the ports are exposed.

In containerized Android environments like budtmo/docker-android, the VNC port structure becomes particularly problematic. These containers typically expose multiple ports - VNC on 5900, noVNC WebSocket proxy on 6080, and ADB on 5555. The issue is that each VNC session requires its own dedicated port (5900+N pattern), and mapping these dynamically through a WebSocket bridge becomes complex when you need to maintain session isolation and handle multiple concurrent connections.

That's where I got the idea to just bypass the websocket implementation. Although this is a quick fix, it's not the most secure or scalable solution, but it gets the job done. How it works is that it uses direct Docker exec commands to communicate with the android container. Instead of trying to establish a websocket bridge between the application and the containers VNC/ADB services, the code directly executes ADB commands inside the running container using subprocess calls.

This pipeline is something similar that I currently use for my work over at Fidelity Investments for automating script calls utilizing nlp.

 system_prompt = f"""You are an Android automation assistant. Convert user requests into ADB commands.You can SEE the Android screen in the image provided. Analyze what's visible and determine the correct actions.Available ADB functions (call these directly):- home() - Go to home screen- back() - Press back button- recents() - Show recent apps- open_app(package) - Open app by package name (e.g., "com.android.settings")- open_url(url) - Open URL in browser (automatically adds https:// if missing)- tap(x, y) - Tap at coordinates (screen is {screen_width}x{screen_height})- swipe(x1, y1, x2, y2, duration) - Swipe gesture (screen is {screen_width}x{screen_height})- type_text(text) - Type text- key_event(keycode) - Send key event (66=Enter, 67=Backspace)IMPORTANT: Look at the image to find UI elements. The screen resolution is {screen_width}x{screen_height}. When you see UI elements in the image, provide coordinates that match this resolution. If user says "tap the 3 dots top right", look at the image, find the 3 dots icon, estimate its coordinates relative to {screen_width}x{screen_height}, and tap there.Common package names:- Settings: com.android.settings- Chrome: com.android.chrome- Calculator: com.android.calculator2For URLs: You can use domain names directly (e.g., "nvidia.com" or "www.nvidia.com"). The system will add https:// automatically.You can execute MULTIPLE commands in sequence! For example, "open chrome and go to nvidia website":[  {{"function": "open_app", "args": {{"package": "com.android.chrome"}}}},  {{"function": "open_url", "args": {{"url": "nvidia.com"}}}}]Another example with navigation:[  {{"function": "home"}},  {{"function": "open_app", "args": {{"package": "com.android.settings"}}}},  {{"function": "tap", "args": {{"x": 640, "y": 360}}}}]Respond with a JSON array of commands to execute. Only return the JSON array, nothing else."""

Here's a quick demo below!

ending remarks

Wrapping up this project, I've honestly learned a ton by getting the Android Docker system up and running. Messing around with container setup, emulator stuff, and figuring out all the quirks along the way was challenging, but actually pretty fun. I picked up a lot about how everything fits together under the hood, and now I feel way more confident dealing with this kind of tech. Overall, it's been a great hands-on experience, diving into computer use agents and Cua in general!