Can we generate Automatic Cricket Commentary using Neural Networks ?

Like everything else, the world of cricket has also gone through a lot of technological transformations in the recent years. They way cricket is played and and how it is viewed all around the world have both changed as a result. In this post we discuss if neural networks are capable of generating cricket commentary by just watching it.
There has been some work in the literature (can be found here, here and here) but they do not use neural networks. Being a believer in end to end deep learning, I think neural networks will seal the deal on this task in the near future. This is a hard problem to tackle, because apart from visual feature extraction, it involves very complex temporal dynamics and handling of long term dependencies. This is because commentary is generally highly contextualized by the development of current game, its significance in broader perspective (friendly match vs tournament), and histories of teams and players involved. Decontextualized explanation of what is happening appears to be a easier problem to solve and I can think of an architecture that can used for modelling this.
Drawing ideas from the recent emergence of spatio-temporal neural networks, I think a reasonable architecture should include a convolutional neural network to extract visual features from static frames, recurrent neural network to model complex non-linear temporal dynamics of these features, and decoder encoder architecture on top of them for end to end (video to commentary) learning. It seems manageable to build a decent amount of data for training this network, with cricket footage as input and commentary as the supervision signal. I spy a very promising project idea here for the interested people!
Cricket shot classification appears to be a vital component of this automatic commentary generation system. A very interesting recent work focuses on this problem and uses a CNN and LSTM based architecture for classification of video clips into relevant shots, it shows promising results. Player localization and pose estimation are very important for accurate shot classification. In the following sections, we will do a rudimentary exploration of efficacy of human pose estimation for identification of cricket shot from static images.
I collected a small dataset of about 120 images in total for ‘cut shot’, ‘sweep’ and ‘drive’ from google images. This data set is very small and if you plan to engineer a meaningful project out of this idea, you should probably collect a much larger data set. I use Open Pose for pose estimation, it offers a 2d real-time multi-person keypoint detection. I used the lightest and the fastest model ‘mobilenet_thin’ for simplicity and fast execution, you can try a different model for a speed-accuracy trade-off of you choice. Open pose gives x and y location (within the 2d image) of the 18 key points of player. I then try to build a random forest classifier on top of this 18 dimensional feature space to classify each instance as one of the three shots. With a train test split of 70/30, I am getting F1-scores around 0.8 on test set. This is pretty good considering the scarcity of data set I used. Some predictions on test set are shown below.



Code is available on github.
Things that can improve accuracy:
- Larger data set (no-brainer)
- Use larger and more accurate pose estimation model (refer to their github).
- Use images from a same camera angle or alternately built multi-view feature space using all the available camera angles.
- Use data augmentation. Horizontal flip will help generalize among right handed and left handed batsman. Don’t use large rotation if you are plan to use fixed camera angles. Jittering and random crops would only help if you are also fine tuning the pose estimation using your data set.
Have fun !





