JeVois (jevois.org) is a portable open-source device for general vision tasks. I propose to build an interactive gesture plugin (for GNOME) with Quantized Neural Network.
Train on this dataset. I resized images into 128x128 before feeding them into the neural network.
In my deployment setting, the hand should be very conspicuous because the camera is near the hand, while in this dataset, the area of hand is relatively small:
My first encode-decode network works poorly. I find some inspirations from this paper Fully Convolutional Networks for Semantic Segmentation. The U-Net-like structure works well and detects the area of hand precisely on the training data. It seems that the color of skin is distinguishing enough to separate the hand from the environment. On the training set, my model even gives more precise (pixel-wise) predictions than the ground truth one because those labels are given in the form of bounding boxes, which are transformed to a Gaussian label.
Next step, I have to alleviate the overfitting problem and apply it with my camera to test its performance in real time. This model works poorly in my workplace.
Prediction on the training set (RGB image, ground truth label, prediction):
Prediction on the testing set (RGB image, ground truth label, prediction):
It can be observed from multiple testing cases that the model overfits and learn to detect the skin-color area rather than the hand.
And it works poorly in my workplace:
Use python-uinput to control the mouse/keyboards.
Use Gtk+3 to develop the window interface:
- Find a human-labeled dataset
- Use to train a standard neural network that is powerful enough to detect hands
- Keep the camera rolling in my workplace using the same resolution with the deployment one. Denote this one as
- Use to label hands in
- Use to train a tiny (maybe quantized) neural network such that is efficient and computational-cheap enough to run on JeVois in real-time
- Construct a customized gesture library
- Train another network to discriminate different kinds of gestures in
- Deploy and as a pipeline.
Tried different datasets
Although the data collected from http://lttm.dei.unipd.it/downloads/gesture/ is similar to my deploy settings, the result is still not good. Maybe I have to augment data to make the model invariant to the change of settings (e.g., illuminations, quality).
This dataset http://cims.nyu.edu/~tompson/NYU_Hand_Pose_Dataset.htm#overview provides extensive RGB-D pictures (92 GB). However, limited to the storage limitation, I failed to unzip the compressed file.
Matching hand with SIFT:
Issues: (1) The matching box keeps jumping and it is very hard to get a stable matching (2) It cannot detect the hand when the template and the testing hand have different gestures.
Neural Network firststep
In previous attempts, I tried to generate the heat map with a complex neural network that is similar to the one in semantics segmentation. I added some skip connection between layers to make the prediction more precise, which is inspired by https://arxiv.org/abs/1411.4038.
Although this model can generate a pixelwise prediction, it takes time to train and the runtime in deployment scenario is slow.
To simplify this model, I just take the first part of the model, namely the convolutional network. I add a convolutional layer with one filter to generate a prediction with low resolution and upsample directly from the convoluted results. The result is mosaic but can properly detect the hand’s position.
The following figure is the result of a network with four convolutional layers, whose filter sizes are 8, 8, 16, 1 respectively. The blue dot is the pixel with the highest confidence in this frame and the green one is the moving maximum.
There are few issues of this model to work on:
- It seems this model learns to recognize the color of hand rather than recognize “hand”. Therefore something red, elbow and head might be mis-predicted as hands.
- Use an average filter to find the area with high confidence rather than the pixel.
- Model working properly at night (with artificial illumination) fails to work at day (with natural illumination).
- Low FPS
- Skip the upsampling/sigmoid process. Directly search in the low-resolution prediction.
- Background subtraction cannot help the recognition process.
Possible Bugs when programming with OpenCV
- The capture format is in BGR, as well as the imshow. If your algorithm doesn’t work with OpenCV, please output the captured image (to file, no cv2.imshow) to check whether it is in proper color space.
- The coordinate system is transposed compared to the numpy indexing system.
- Reference vs. instance. Use copy() function to copy a new instance rather than refering to it.
Bug with Jevois
- In the script of `jevois-usbsd`, `$username` should be `$user` in Ubuntu.
- `jevois-usbsd` fails when two cameras are connect to the machine.
Bulky but accurate network
This bulky model can detect the position of the hand very accurately and is robust to noise. However, it has too many parameters to deploy in the real-time. Another difficulty I have is that the Darknet can handle discrete labels by default and it’s hard to apply the image as the label. Still working on the traditional method because the neural network consumes too much resources than I expected.
Since the RGB camera cannot separate the hand from the background with the information of depth, I am considering to limit the deployment scenario to a static background. This doesn’t mean the model can only work in my workplace. This means one has to reset the background information after the camera is moved. In fact, this assumption of static background is usually satisfied when one uses a desktop PC and the webcam is fixed somewhere for all the time.
Jevois meets DeepSymphony (Silent)
This model generates music where the mean of pitches and the density of notes can be controlled. When composing the songs, the system uses the position of the hand to decide the pitch and density. The pitch goes higher when the hand is moving up and the density becomes denser as the hand moving right.
Because the piano synthesizer occupied the audio channel, it failed to synthesize and record simultaneously. I will find some methods to record the audio as well.
The ‘miss’ notification does not mean Jevois fails to track the hand. I set a 0.02 timeout when reading the hand position from Jevois through serial-USB. When Jevois fails to react in such a short time, then I will use last recorded position.
- Not fluent enough. The serial reading/writing keeps blocking the generation. Maybe I can do it asynchronously. Also, the pitch and density of the generation takes several notes to complete the transition, which I think is acceptable.