A while ago, I was really blown away by results from an experiment using TensorFlow object detection api to track hands in an image. I made the trained model and source code available, and since then it has been used to prototype some rather interesting usecases (a tool to help kids spell, extensions to predict sign language, hand ping pong, etc). However, while many individuals wanted to experiment with the trained model, a large number still had issues setting up Tensorflow (installation, TF version issues, exporting graphs, etc). Luckily, Tensorflow.js addresses several of these installations/distribution issues, as it is optimized to run in the standardized environment of browsers. To this end, I created Handtrack.js as a library to allow developers quickly prototype hand/gesture interactions powered by a trained hand detection model.
Runtime:
22 FPS. On a Macbook Pro 2018, 2.2ghz, Chrome Browser.
13 FPSOn a Macbook Pro 2014 2.2GHz.
The goal of the library is to abstract away steps associated with loading the model files, provide helpful functions and allow a user detect hands in an image without any ML experience. You do not need to train a model (you can if you want). You do not need to export any frozen graphs or saved models. You can just get started by including handtrack.js in your web application (details below) and calling the library methods.
Interactive demo built using Handtrack.js here, and the source code on GitHub is here. Love tinkering in Codepen? Here’s a handtrack.js example pen you can modify.
victordibia/handtrack.js
A library for prototyping realtime hand detection (bounding box), directly in the browser. – victordibia/handtrack.jsgithub.com
How Do I Use It in a Web App?
You can use handtrack.js
simply by including the library URL in a script tag or by importing it from npm
using build tools.
Using Script Tag
The Handtrack.js minified js file is currently hosted using jsdelivr, a free open source cdn that lets you include any npm package in your web application.
<script src="https://cdn.jsdelivr.net/npm/handtrackjs/dist/handtrack.min.js"> </script>Copy the code
Once the above script tag has been added to your html page, you can reference handtrack.js using the handTrack
variable as follows.
const img = document.getElementById('img');
handTrack.load().then(model => {
model.detect(img).then(predictions => {
console.log('Predictions: ', predictions) // bbox predictions
});
});Copy the code
The snippet above prints out bounding box predictions for an image passed in via the img tag. By submitting frames from a video or camera feed, you can then “track” hands in each frame (you will need to keep state of each hand as frames progress).
Using NPM
You can install handtrack.js
as an npm package using the following
npm install --save handtrackjsCopy the code
An example of how you can import and use it in a React app is given below.
import * as handTrack from 'handtrackjs';
const img = document.getElementById('img');
// Load the model.
handTrack.load().then(model => {
// detect objects in the image.
console.log("model loaded")
model.detect(img).then(predictions => {
console.log('Predictions: ', predictions);
});
});Copy the code
Handtrack.js API
Several methods are provided. The two main methods including the load()
which loads a hand detection model and detect()
method for getting predictions.
load()
accepts optional model parameters that allow you control the performance of the model. This method loads a pretrained hand detection model in the web model format (also hosted via jsdelivr).
detect()
accepts an input source parameter (a html img, video or canvas object) and returns bounding box predictions on the location of hands in the image.
const modelParams = {
flipHorizontal: true, // flip e.g for video
ImageScaleFactor: 0.7, // reduce input image size.
maxNumBoxes: 20, // maximum number of boxes to detect
IouThreshold: 0.5, // ioU threshold for non-max suppression
ScoreThreshold: 0.79, // confidence threshold for Predictions.
}Copy the code
const img = document.getElementById('img');Copy the code
handTrack.load(modelParams).then(model => {
model.detect(img).then(predictions => {
console.log('Predictions: ', predictions);
});
});Copy the code
prediction results are of the form
[{
bbox: [x, y, width, height],
class: "hand",
Score: 0.8380282521247864
}, {
bbox: [x, y, width, height],
class: "hand",
Score: 0.74644153267145157
}]Copy the code
Other helper methods are also provided
model.getFPS()
: get FPS calculated as number of detections per second.model.renderPredictions(predictions, canvas, context, mediasource)
: draw bounding box (and the input mediasource image) on the specified canvas.model.getModelParameters()
: returns model parameters.model.setModelParameters(modelParams)
: updates model parameters.dispose()
: delete model instancestartVideo(video)
: start camera video stream on given video element. Returns a promise that can be used to validate if user provided video permission.stopVideo(video)
: stop video stream.
Library Size and Model Size
- Main Because it is bundled with the tensorflow.js Library (theres some open issues with recent versions that break the library.)
- Models – 18 MB. This is what causes the initial wait when the page is loaded. Tf.js webmodels are typically sharded Into multiple files (in this case four 4.2MB files and one 1.7MB file.)
How it Works
Underneath, Handtrack.js uses the tensorflow. js library — a flexible and intuitive APIs for building and training models from scratch in the browser. It provides a low-level JavaScript linear algebra library and a high-level layers API.
Creating the Handtrack.js Library
Data Assembly
The data used in this project is primarily from the Egohands dataset. This consists of 4800 images of the human hand with bounding box annotations in various settings (indoor, outdoor), captured using a Google glass device.
Model Training
A model is trained to detect hands using the Tensorflow Object Detection API. For this project, a Single Shot MultiBox Detector (SSD) was used with the MobileNetV2 Architecture. Results from the trained model were then exported as a savedmodel
. Additional details on how the model was trained can be found here and on the Tensorflow Object Detection API github repo.
Model Conversion
Tensorflow.js provides a model conversion tool that allows you convert a savedmodel
trained in Tensorflow python to the Tensorflow.js webmodel
format that can be loaded in the browser. This process is mainly around mapping operations in Tensorflow python to their equivalent implementation in Tensorflow.js. It makes sense to inspect the saved model graph to understand what is being exported. Finally, I followed the suggestion by authors of the Tensorflow coco-ssd example [2] in removing the post processing part of the object detection model graph during conversion. This optimization effectively doubled the runtime for the detection/prediction operation in the browser.
Library Wrapper and Hosting
The library was modeled after the tensorflowjs coco-ssd example (but not written in typescript). It consists of a main class with methods to load the model, detect hands in an image, and a set of other helpful functions e.g. startVideo, stopVideo, getFPS(), renderPredictions(), getModelParameters(), setModelParameters()etc. A full description of methods are on Github .
The source file is then bundled using rollup.js, and published (with the webmodel files) on npm. This is particularly valuable as jsdelivr automatically provides a cdn for npm packages. (It might be the case that hosting the file on other CDNs might be faster and the reader is encouraged To try out other methods). At the moment handtrackjs is bundled with tensorflowjs (v0.13.5) mainly because At the time of writing this library, There were version issues where TFJS (V0.15) had datatype errors loading image/video tags as tensors. As new versions fix this issue, it will be updated.
When Should I Use Handtrack.js
If you are interested in prototyping gesture based (body as input) interactive experiences, Handtrack.js can be useful. The user does not need to attach any additional sensors or hardware but can immediately take advantage of engagement benefits that result from gesture based/body-as-input interactions.
Some (not all) relevant scenarios are listed below:
- When mouse motion can be mapped to hand motion for control purposes.
- When an overlap of hand and other objects can represent meaningful interaction signals (e.g a touch or selection event for an object).
- Scenarios where the human hand motion can be a proxy for activity recognition (e.g. automatically tracking movement activity from a video or images of individuals playing chess, or tracking a persons golf swing). Or simply counting how many humans are present in an image or video frame.
- Interactive art installations. Could be a fun set of controls for interactive art installations.
- Teaching others about ML/AI. The handtrack.js libary provides a valuable interface to demonstrate how changes in the model parameters (confidence threshold, IoU threshold, image size etc) can affect detection results.
- You want an accessible demonstration that anyone can easily run or tryout with minimal setup.
Limitations
- Browsers are single threaded: What this means is that care must be taken to ensure prediction operations do not block the UI thread. Each prediction can take between 50 and 150ms which becomes noticeable to a user. For example when integrating Handtrack.js in an application where the entire screen is rendered (e.g. in a game) many times per second, I found it useful to reduce the number of predictions requested per second.
- Hands are tracked on a frame by frame basis: If interested in identifying hands across frames, you will need to write additional code to infer the id’s of detected hands as they enter, move and leave successive frames. Hint: keeping state on location of each prediction (and euclidean distance) across each frame can help.
- Incorrect predictions: There will be the occasional incorrect prediction (sometimes a face is detected as a hand). I found that each camera and lighting condition needed different settings for the model parameters (especially confidence thresholds) to get good detection. More importantly, this can be improved with additional data.
I really look forward to how others who use or extend this project solve some of these limitations.
Whats Next?
Handtrack.js represents really early steps with respect to the overall potential in enabling new forms of human computer interaction with AI. In the browser. Already, there have been excellent ideas such as posenet for human pose detection, and handsfree.js for facial expression detection in the browser.
Above all, the reader is invited to
imagine. Imagine interesting use cases where knowing the location of a users hand can make for more engaging interactions.
In the meantime, I will be spending more time on the following
- Better handmodel: Creating a robust benchmark to evaluate the underlying hand model. Collecting additional data that improves accuracy and robustness metrics.
- Additional Vocabulary: As I worked through building the samples, one thing that becomes apparent is the limited vocabulary of this interaction method. There is clearly a need to support atleast one more state. Perhaps a fist and an open hand. This will mean re-labelling the dataset (or some semi supervised approaches).
- Additional model quantization: Right now, we are using the fastest Model WRT architecture size and accuracy — MobilenetV2, SSD. Are there optimizations thatcan make things even faster? Any ideas or contributions here are welcome.
If you would like to discuss this in more detail, feel free to reach out on Twitter, Github or Linkedin. Many thanks to Kesa Oluwafunmilola who helped with proof reading this article.
References
- [1] Sandler, Mark, et al. “Mobilenetv2: “Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. Arxiv.org/abs/1801.04…
- [2] Tensorflow.js Coco-ssd example.
This library uses code and guidance from the Tensorflow.js coco-ssd example which provides a library for object detection trained on the MSCOCO dataset. The optimizations suggested in the repo (stripping out a post processing layer) was really helpful (2x speedup).