# JeVois (Gesture Recognition with Quantized Neural Network)

Starts from October 05, 2017

Introduction:

JeVois (jevois.org) is a portable open-source device for general vision tasks. I propose to build an interactive gesture plugin (for GNOME) with Quantized Neural Network.

08/22/2017 00:00

Start

08/29/2017 00:00

09/06/2017 00:00

10/05/2017 20:53

10/05/2017 21:01

10/06/2017 04:26

#### Hand Detection

Train on this dataset. I resized images into 128x128 before feeding them into the neural network.

In my deployment setting, the hand should be very conspicuous because the camera is near the hand, while in this dataset, the area of hand is relatively small:

My first encode-decode network works poorly. I find some inspirations from this paper Fully Convolutional Networks for Semantic Segmentation. The U-Net-like structure works well and detects the area of hand precisely on the training data. It seems that the color of skin is distinguishing enough to separate the hand from the environment. On the training set, my model even gives more precise (pixel-wise) predictions than the ground truth one because those labels are given in the form of bounding boxes, which are transformed to a Gaussian label.

Next step, I have to alleviate the overfitting problem and apply it with my camera to test its performance in real time. This model works poorly in my workplace.

Prediction on the training set (RGB image, ground truth label, prediction):

Prediction on the testing set (RGB image, ground truth label, prediction):

It can be observed from multiple testing cases that the model overfits and learn to detect the skin-color area rather than the hand.

And it works poorly in my workplace:

10/07/2017 05:27

#### OutOfBox

Maybe trajectory is helpful to locate the hand.

10/09/2017 18:39

#### GNOME

GNOME Shell

Use python-uinput to control the mouse/keyboards.

Use Gtk+3 to develop the window interface:

10/10/2017 00:00

#### Todo

Dataset Construction:

1. Find a human-labeled dataset $\mathcal{D}_0$
2. Use $\mathcal{D}_0$ to train a standard neural network $\mathbbm{F}_0$ that is powerful enough to detect hands
3. Keep the camera rolling in my workplace using the same resolution with the deployment one. Denote this one as $\mathcal{D}$
4. Use $\mathbbm{F}_0$ to label hands in $\mathcal{D}$
5. Use $\mathcal{D}$ to train a tiny (maybe quantized) neural network $\mathbbm{F}$ such that $\mathbbm{F}$ is efficient and computational-cheap enough to run on JeVois in real-time
6. Construct a customized gesture library $\mathcal{G}$
7. Train another network $\mathbbm{G}$ to discriminate different kinds of gestures in $\mathcal{G}$
8. Deploy $\mathbbm{F}$ and $\mathbbm{G}$ as a pipeline.

10/14/2017 00:00

10/15/2017 23:10

#### SUFT/SIFT matching

Matching hand with SIFT:

Issues: (1) The matching box keeps jumping and it is very hard to get a stable matching (2) It cannot detect the hand when the template and the testing hand have different gestures.

10/21/2017 00:00

#### Neural Network firststep

In previous attempts, I tried to generate the heat map with a complex neural network that is similar to the one in semantics segmentation. I added some skip connection between layers to make the prediction more precise, which is inspired by https://arxiv.org/abs/1411.4038.

Although this model can generate a pixelwise prediction, it takes time to train and the runtime in deployment scenario is slow.

To simplify this model, I just take the first part of the model, namely the convolutional network. I add a convolutional layer with one filter to generate a prediction with low resolution and upsample directly from the convoluted results. The result is mosaic but can properly detect the hand’s position.

The following figure is the result of a network with four convolutional layers, whose filter sizes are 8, 8, 16, 1 respectively. The blue dot is the pixel with the highest confidence in this frame and the green one is the moving maximum.

There are few issues of this model to work on:

• It seems this model learns to recognize the color of hand rather than recognize “hand”. Therefore something red, elbow and head might be mis-predicted as hands.
• Use an average filter to find the area with high confidence rather than the pixel.
• Model working properly at night (with artificial illumination) fails to work at day (with natural illumination).
• Low FPS
• Skip the upsampling/sigmoid process. Directly search in the low-resolution prediction.
• Background subtraction cannot help the recognition process.

10/24/2017 14:50

#### Possible Bugs when programming with OpenCV

• The capture format is in BGR, as well as the imshow. If your algorithm doesn’t work with OpenCV, please output the captured image (to file, no cv2.imshow) to check whether it is in proper color space.
• The coordinate system is transposed compared to the numpy indexing system.
• Reference vs. instance. Use copy() function to copy a new instance rather than refering to it.

10/25/2017 04:05

#### Bug with Jevois

• In the script of jevois-usbsd, $username should be $user in Ubuntu.
• jevois-usbsd fails when two cameras are connect to the machine.

11/05/2017 19:32

#### Bulky but accurate network

This bulky model can detect the position of the hand very accurately and is robust to noise. However, it has too many parameters to deploy in the real-time. Another difficulty I have is that the Darknet can handle discrete labels by default and it’s hard to apply the image as the label. Still working on the traditional method because the neural network consumes too much resources than I expected.

Since the RGB camera cannot separate the hand from the background with the information of depth, I am considering to limit the deployment scenario to  a static background. This doesn’t mean the model can only work in my workplace. This means one has to reset the background information after the camera is moved. In fact, this assumption of static background is usually satisfied when one uses a desktop PC and the webcam is fixed somewhere for all the time.

11/07/2017 03:58

#### Jevois meets DeepSymphony (Silent)

This model generates music where the mean of pitches and the density of notes can be controlled. When composing the songs, the system uses the position of the hand to decide the pitch and density. The pitch goes higher when the hand is moving up and the density becomes denser as the hand moving right.

Because the piano synthesizer occupied the audio channel, it failed to synthesize and record simultaneously. I will find some methods to record the audio as well.

The ‘miss’ notification does not mean Jevois fails to track the hand. I set a 0.02 timeout when reading the hand position from Jevois through serial-USB. When Jevois fails to react in such a short time, then I will use last recorded position.

To improve:

• Not fluent enough. The serial reading/writing keeps blocking the generation. Maybe I can do it asynchronously. Also, the pitch and density of the generation takes several notes to complete the transition, which I think is acceptable.

11/07/2017 17:35

#### Jevois meets DeepSymphony (Cherry-pickingless)

I didn't cherry-pick the video so that you can find that there are many aspects to improve:

• overfitting generation
• delayed responding
• unstable hand segmentation
• abrupt transition
• weak relationship of the density-axis