At Stradigi AI, our best ideas often stem from collaboration between our departments and the conception of our American Sign Language (ASL) Alphabet Game was no different.
Developed as a demo for the NeurIPS conference, it took creativity, brain power, and meticulous coordination to bring this idea to life in a very short time.
What we are presenting today is how our project came to life, and how this demo, rooted in our mission to promote AI for Good, advocates the idea that artificial intelligence really can make people’s lives better.
Here’s how the ASL Alphabet Game came to be:
A Demo-cratic Idea
Like any good story, we’ll start at the very beginning. A few months ago, Carolina Bessega, Stradigi AI’s Chief Scientific Officer, made the announcement during a weekly team breakfast that we would be attending NeurIPS. She challenged us to come up with an idea for a demo with three defining characteristics; it has to be realistic, interactive and of course, useful.
In order to make an unbiased and democratic decision, a selection committee was assembled from various departments across the company. For the selection process, each proposed idea was discussed and rated based on four main criteria:
- How appealing the demo is;
- The complexity of execution;
- Reusability for the future (can we build on top of it and can we create a product that extends beyond NeurIPS?);
- Would and could the idea have a practical implication?
We received 18 (!) submissions, which were all added to a presentation that was circulated amongst the selection committee. There was no mention of the person who had come up with the specific idea, only a reference number and a short description. This, by design, eliminated any bias from the process.
The selected idea, submitted by Ben Tang, a Business Analyst here at Stradigi AI, served as the starting point to what would eventually become the final product: a sign language alphabet learning aide and translation tool that would take the form of a game. A person would race against a timer and attempt to win as many points as possible based on how many signs they can match correctly.
We knew that there were already a number of ASL alphabet demos that were using AI, but they only showcased a model. We wanted to create something interactive where the user objective and UX would promote learning. The model is not center stage, but it is crucial in the overall experience.
Due to the timeline for the project development, we decided to focus on the ASL alphabet since it required a clear detection of a single hand, simplifying the data gathering process and training of the model. The gamification component would come in later by adding points based on letters completed and speed.
The Technical Process
Choosing the Right Programs
Rosalia Stephany and Yaser Mohammadtaheri are two members of our Research team, whose areas of expertise are Image and Video Understanding. They were tasked with creating our model and began the project by selecting the necessary technologies to bring it to life. The goal was not to demonstrate our full technological stack, but create a teaser of the idea.
Here’s what they chose:
- Convolutional pose machines (for hand pose estimation);
- CNN for image classification;
- AutoML for hyperparameter optimization.
They first had to figure out how to detect where the hand appeared in the frame, so that the model could eventually know what to look for and recognize the 26 signs of the ASL alphabet. For our game, we needed this detection to be accurate, smooth (we needed a hand position in every frame because missed detections would cause problems) and real time. They decided to use Convolutional pose machines, which worked, but were way too slow, so they changed the number of stages from the original 6 to 3. *
In order to classify the direction of the hand, they tested several state-of-the-art classification CNNs, but ultimately decided to design and optimize their own convolutional network for the task.
By applying our own AutoML algorithm with the help of our researcher Zhi Chen, we optimized the architecture designed by our Images and Video Understanding team resulting in an improved accuracy and speed compared to other classification networks:
Our team of developers, Inna Gutnik and Simon Silvain got to work and started their development with a hosted app concept: a Public URL pointing to a VueJS app and connected to the ML through a hosted web service across a network.
We discovered that the real-time user experience was greatly affected by the speed of the internet connection. We knew that at NeurIPS was expecting over 8000 people, so we decided to find an alternate solution in order to counter the possibility of an unreliable network.
In the second version, we made the change to a desktop application. The python service had direct access to the webcam and the Web Application hosted in the local machine streamed from the local machine directly. We experienced no network delays and also discovered that only a machine with a dedicated graphics card could deliver enough power to provide fast enough ML processing of the webcam images.
Gathering Data Process
To train our models, we first attempted to look at what images of hand signs were publicly available. However the images were plagued with bad lighting and inconsistencies with the signs (in other spoken language, this might be thought of as “accents”). Another troubling aspect was the lack of diversity of hands. As a company, we actively ensure any projects we work to create as unbiased algorithms as possible. We would eventually come up with some pretty creative ways of broadening our sample outside of our office walls.
It became very clear that we needed to gather a lot of good, clean and diverse data. An added benefit of being part of a company going through hypergrowth is the incredible amount of enthusiastic hands at our disposal. When we reached out to invite our colleagues to help us train our model, everyone welcomed the chance to be a part of our project.
Some our initial gameplay challenges we had to overcome included:
- People playing had difficulty telling which direction the gestures for “O” and “C” were;
- Choosing to show Mirror vs Actual graphics of the hands was a hard choice. We had to decide between creating consistent icons of each sign (showcasing the sign from the participants perspective, or the camera’s perspective). Ultimately, user experience trumped consistency, since we valued people understanding and enjoying the game over appearance.
- It was often hard for gamers to determine which direction the instructional hand was facing. To solve the problem for directions, we added fingernails to the hands. The nails acting as a cue seemed to help people figure out what the gesture was communicating.
- The complexity of gesturing on different backgrounds was also causing problems. To solve this, we had them sign in front of a white wall.
Adding to all this was our swift realization that none of us were American Sign Language experts. For the sake of inclusiveness and solidarity we needed to consult people who actually used ASL every day.
Elitsa Asenova, our Product Owner and QA on the project, contacted the Deaf Anglo Literacy Center (DALC), a Montreal organization serving the Deaf and Deaf-Blind Communities, which volunteered to be part of our training data and graciously offered to answer any questions we might have. Our visit there was very productive – we were able to clear up our sign designs and got realistic training data from a verified source, which we then replicated in our lab with more volunteers. Perhaps even more importantly, we wanted to build a relationship and show ourselves as allies.
What we later realized was the important (and somewhat unexpected) impact this collaboration had on us and them. We were thrilled to realize that what we were working on could have a very positive impact and this put our project into perspective. Respect became the driving force in the next steps of our development. It was crucial that our designs communicated their language as accurately as possible to properly replicate it in our lab setting and share it to a wider audience with care and sensitivity.
The best feedback we received was from someone who said they wished they had this technology when they were first learning ASL. The positive and encouraging reaction to our demo made us realize that what you create and what you put out in the world truly can change and improve people’s lives. This is a lesson we would not have learned without leaving our lab and having this human to human interaction.
Our UI/UX designer, Anne-Marie Lafontaine had a lot to consider when designing the user interface of the game. Having a pixel-style monospace secondary font made it simple to adopt a 90’s inspired theme.
The first iterations of the in-game UI were too sparse and lacked structure. It did not provide the player with clear direction, missing a visual path they could follow. After deciding to utilize sentences for the gameplay instead of single words, the UI had to be rethought.
Signs and Cards
The signs were tested multiple times with potential players to ensure their clarity, essentially making sure that the icons communicated to the player how to position their hand to successfully recreate the sign for the camera. We also tested the accuracy of the icons with the help of an ASL expert in order to get some feedback to ensure the icons would be just right.
The letters “J” and “Z” were presenting a challenge, as they require motion (as opposed to the other letters, which are still) which our model was not designed for. After consultation with the DALC, we were told that the “still” version (essentially the last state of the movement), would not be offensive, so we moved forward with the changes.
The decision to add sentences to the game meant having to display those sentences on screen, in addition to the active word being played. Trying to display full sentences with the cards and signs would’ve made the UX overwhelming. In order to keep focus on the active word, we decided to display the sentence separately, below the active word (like in karaoke) with an indicator of where the active word above is in the sentence.
Building this project only scratched the surface of what we are capable of doing and the technology that is available. Ultimately, the goal is to expand the model to recognize two hands, as well as movement, to open up the possibility of having a complete sign language tool.
We would like to thank our friends at the DALC for their participation and encouragement. Also, we want to give a shout-out to our Project Manager Christian Bisson and congratulate the entire ASL team for their hard work!