avatarUrwa Muaz

Summary

The website discusses the potential of using neural networks for generating automatic cricket commentary by analyzing visual features and temporal dynamics of the game.

Abstract

The article explores the technological advancements in cricket and proposes the use of neural networks for automated commentary generation. It suggests that while previous work has not utilized neural networks, an end-to-end deep learning approach could effectively handle the complex task of commentary generation, which involves understanding visual cues, temporal dynamics, and contextual information such as the significance of the match and player histories. The proposed architecture includes convolutional neural networks for visual feature extraction, recurrent neural networks for modeling temporal dynamics, and an encoder-decoder setup for video-to-commentary translation. The article also highlights the importance of cricket shot classification, referencing recent work that uses CNN and LSTM for shot classification and discussing the use of human pose estimation for identifying shots from static images. A small dataset experiment using OpenPose for pose estimation and a random forest classifier achieved promising F1-scores, indicating the feasibility of the approach despite the limited data.

Opinions

  • The author is a proponent of end-to-end deep learning solutions, believing neural networks will soon excel in generating cricket commentary.
  • The task of automatic commentary generation is recognized as challenging due to the need to capture complex temporal dynamics and long-term dependencies in cricket matches.
  • Decontextualized commentary generation is considered a simpler problem compared to generating context-aware commentary.
  • The author suggests that spatio-temporal neural networks could form the basis of a reasonable architecture for this task.
  • Cricket shot classification is seen as a vital component of the automatic commentary generation system, with recent work showing promising results in this area.
  • Player localization and pose estimation are emphasized as crucial for accurate shot classification.
  • The author reports successful preliminary results using a small dataset and OpenPose for pose estimation, followed by a random forest classifier.
  • The article encourages the collection of a larger dataset and the exploration of more accurate pose estimation models to improve classification accuracy.
  • The author suggests that data augmentation techniques like horizontal flipping, jittering, and random crops could enhance the model's generalization capabilities, especially when dealing with fixed camera angles.

Can we generate Automatic Cricket Commentary using Neural Networks ?

Like everything else, the world of cricket has also gone through a lot of technological transformations in the recent years. They way cricket is played and and how it is viewed all around the world have both changed as a result. In this post we discuss if neural networks are capable of generating cricket commentary by just watching it.

There has been some work in the literature (can be found here, here and here) but they do not use neural networks. Being a believer in end to end deep learning, I think neural networks will seal the deal on this task in the near future. This is a hard problem to tackle, because apart from visual feature extraction, it involves very complex temporal dynamics and handling of long term dependencies. This is because commentary is generally highly contextualized by the development of current game, its significance in broader perspective (friendly match vs tournament), and histories of teams and players involved. Decontextualized explanation of what is happening appears to be a easier problem to solve and I can think of an architecture that can used for modelling this.

Drawing ideas from the recent emergence of spatio-temporal neural networks, I think a reasonable architecture should include a convolutional neural network to extract visual features from static frames, recurrent neural network to model complex non-linear temporal dynamics of these features, and decoder encoder architecture on top of them for end to end (video to commentary) learning. It seems manageable to build a decent amount of data for training this network, with cricket footage as input and commentary as the supervision signal. I spy a very promising project idea here for the interested people!

Cricket shot classification appears to be a vital component of this automatic commentary generation system. A very interesting recent work focuses on this problem and uses a CNN and LSTM based architecture for classification of video clips into relevant shots, it shows promising results. Player localization and pose estimation are very important for accurate shot classification. In the following sections, we will do a rudimentary exploration of efficacy of human pose estimation for identification of cricket shot from static images.

I collected a small dataset of about 120 images in total for ‘cut shot’, ‘sweep’ and ‘drive’ from google images. This data set is very small and if you plan to engineer a meaningful project out of this idea, you should probably collect a much larger data set. I use Open Pose for pose estimation, it offers a 2d real-time multi-person keypoint detection. I used the lightest and the fastest model ‘mobilenet_thin’ for simplicity and fast execution, you can try a different model for a speed-accuracy trade-off of you choice. Open pose gives x and y location (within the 2d image) of the 18 key points of player. I then try to build a random forest classifier on top of this 18 dimensional feature space to classify each instance as one of the three shots. With a train test split of 70/30, I am getting F1-scores around 0.8 on test set. This is pretty good considering the scarcity of data set I used. Some predictions on test set are shown below.

Code is available on github.

Things that can improve accuracy:

  • Larger data set (no-brainer)
  • Use larger and more accurate pose estimation model (refer to their github).
  • Use images from a same camera angle or alternately built multi-view feature space using all the available camera angles.
  • Use data augmentation. Horizontal flip will help generalize among right handed and left handed batsman. Don’t use large rotation if you are plan to use fixed camera angles. Jittering and random crops would only help if you are also fine tuning the pose estimation using your data set.

Have fun !

Machine Learning
Cricket
Computer Vision
Deep Learning
Recommended from ReadMedium