Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

4193.pdf">a person or vehicle re-ID datasets</a>, the re-ID of this Amur tiger dataset presents several novel challenges. The main challenge is that it includes video clips with a wide range of pose variations and different lighting and background environments.<figure id="f35b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*faU8hlDAPiieZeNTd0h6iw.jpeg"><figcaption>Photo by <a href="https://unsplash.com/@hans_veth?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Hans Veth</a> on <a href="https://unsplash.com/s/photos/tigers?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a></figcaption></figure>Before looking at the model, let’s take a look at how machine learning works.Take the network structure in the image below. We allow the model to learn feature extractions ( parts of the tiger) based on example data from input images (the red neurons) to output classifications (the blue neurons).<figure id="c5af"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*KT2PFLuYOjlPV3830xuGnA.png"><figcaption>The image depicts a basic fully-connected deep neural network. Source: <a href="https://dvl.in.tum.de/teaching/i2dl-ss19/">https://dvl.in.tum.de/teaching/i2dl-ss19/</a> under a CC license <a href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</a></figcaption></figure>In re-ID, we pass pixel values into the network through the input layer (red neurons in the left of the network). Those values are then fed into the network structure known as the hidden layers (the yellow neurons) with weighted connections and biases corresponding to activation values. The outputs of the hidden layers are passed out of the network (the blue neurons), where they are classified.Each neuron within this network is based on how biological neurons fire and transmit information. The strength of these connections is weighted, which varies from a maximum excitation of +1.0 to a maximum inhibition of –1.0. The way information is passed through a neuron is as follows. Input in the form of pixel values between 0 and 255 are fed into the network through the neurons in the input layer. This input is then passed to neurons in the hidden layers through weighted connections (corresponding to excitation or inhibition). The weighted inputs are then summed, a bias is added, and the sum of the weighted inputs with the bias is passed to an activation function (we shall explain what this does below). The resultant output is called an activation value, which represents the size of the neuron’s activation. Whether the neuron is activated depends on whether its activation value is above a threshold value. If it is, the neuron will ‘fire’; if isn’t, it won’t.This output is analogous to the axon of a biological neuron, which takes information in the form of electrical impulses away from the main body of the neuron cell to other neurons or cells. The size of these activations represents their state of excitation or inhibition.<h1 id="64b3">The model</h1>As the authors of the paper point out, since tigers have much larger pose variation due to motion that does not preserve the shape of the individual (i.e., non-rigid motion), the local feature representations created by pooling do not provide precise modeling of the tiger body.One type of model that is good at capturing large pose variation is <a href="https://en.wikipedia.org/wiki/Part-based_models">part-based models</a>, which focus on parts of an image for detection and identification.These models have demonstrated competitive performance in tasks such as object detection and recognition of objects made up of “deformable parts”, namely a low resolution ‘root’ template and part templates of the object in question where each part represents a location property of the image and the deformations are distinguished by links connecting those parts.This type of method represents local parts by a rectangular patch and adopts structured SVM to learn part structures. Recently, pose key-point estimation techniques such as <a href="https://arxiv.org/pdf/1611.08050.pdf">OpenPose</a> provide even precise body parts and skeletons modeling, which offers new opportunities for a part-based model.The authors propose a novel pose part-based model (PPbM) that achieves impressive performance. They use a classification-based re-id method as a baseline which uses an <a href="https://www.researchgate.net/figure/Multiple-Granularity-Network-architecture-The-ResNet-50-backbone-is-split-into-three_fig1_324218957">ImageNet pre-trained ResNet-50 backbone</a>, followed by two fully-connected layers with 1024 and 107 neurons respectively. The ResNet

Options

allows for a deeper neural network capable of extracting local and global features from the dataset.<figure id="54c2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*7FSRzs-PRxI8BAE3VjA10g.png"><figcaption>Images of detection subset. Photo from Li et al 2019 <a href="https://arxiv.org/pdf/1906.05586.pdf">https://arxiv.org/pdf/1906.05586.pdf</a> under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA 4.0 License</a></figcaption></figure>In the above images, you can see sample bounding boxes of Amur tigers from multiple wild zoos. Below you can see images of the tiger skeleton key-points.<figure id="b3f1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*BqhUoVEwzRnoZWPZSNDCKw.png"><figcaption>Images of pose subset. The different colored lines represent the tiger key-points including the ears, nose, shoulders, hips, etc., Photo from Li et al 2019 <a href="https://arxiv.org/pdf/1906.05586.pdf">https://arxiv.org/pdf/1906.05586.pdf</a> under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA 4.0 License</a></figcaption></figure>The different colored lines mapping the different points on the tiger represent the tiger’s key-points which include the ears, nose, shoulders, hips, and so on.Now, let’s turn to the architecture of the model. Below is a visual representation of the structure of the model used to re-ID the tigers.<figure id="aedc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*7qrFItEnGjHMSL4q9Zs5sw.png"><figcaption>Photo from Li et al 2019 <a href="https://arxiv.org/pdf/1906.05586.pdf">https://arxiv.org/pdf/1906.05586.pdf</a> under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA 4.0 License</a></figcaption></figure>I won’t go into too much detail here, but the model is a ‘pose part based model’ (PPbM). This means that it integrates the result of pose key-point estimation into deep neural networks, where a tiger is represented with a 7-part star model where the parts include the trunk, front legs, hind thighs, and hind shanks.Below are the two local head structures with the PPbM-a (top) and PPbM-b (bottom). PPbM-a uses a concatenating function to concatenate features from 7-parts together, while PPbM-b uses a “soft-attention” strategy to combine 7-parts.<figure id="6d45"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vDhXm7Fc5SRbPCzU1yO39w.png"><figcaption>Photo from Li et al 2019 <a href="https://arxiv.org/pdf/1906.05586.pdf">https://arxiv.org/pdf/1906.05586.pdf</a> under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA 4.0 License</a></figcaption></figure>For each part, the authors calculate the Axes Aligned Bounding Box (AABB) according to the pose skeleton. This can be seen below.<figure id="f3f2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*d0_88tcZEo4qnxsrL_CAFA.png"><figcaption>The red points define a part; the yellow box is the bounding box with annotated dimensions, height, and length; the blue box is AABB which contains the yellow box. Photo from Li et al 2019 <a href="https://arxiv.org/pdf/1906.05586.pdf">https://arxiv.org/pdf/1906.05586.pdf</a> under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA 4.0 License</a></figcaption></figure>For each AABB area, the authors used the ResNet-50 backbone. They extract the local feature representation with regional average pooling on the res3d feature map, and they use an intermediate layer res3d instead of the final residual layer.This model achieves high levels of accuracy, with competitive performance when compared to state-of-the-art models. What will be its impact?<h1 id="6905">Concluding Thoughts</h1>The re-ID of endangered animals is a growing field of research in computer vision. The authors in the discussed paper present a new large-scale wildlife re-ID dataset which contains bounding box, pose key-point, and ID annotations of Amur tigers from multiple wild zoos.As we have seen, large pose variations demand precise target modeling, which is less studied in current re-ID datasets and research. It, therefore, challenges existing algorithms and offers new avenues for future research with application in the domain of wildlife and ecology conservation.Through systematic benchmarking, the authors show not only that state-of-the-art models are challenged by their large-scale dataset but also that a novel pose part-based model (PPbM) can re-ID the dataset with high levels of accuracy. This research is a valuable and exciting application of computer vision techniques which will, importantly, continue to develop and improve in the years to come.</article></body>

How Deep Learning Can Track Animals in the Wild

And help protect endangered species

There are little over 500 Amur tigers currently living in eastern Russia.

Despite big efforts to protect this species, they remain on the brink of extinction.

However, there is hope on the horizon. 2000 miles away in Shanghai, a team of computer scientists has applied deep learning techniques to reimagine the conservation effort for this endangered species.

How exactly have they done this? Let’s take a closer look.

Identifying Tigers in the Wild

The task of re-ID involves determining whether a given person, object, or animal in an image or video clip (a ‘query image’) is similar to a person, object, or animal in images or video clips from a particular dataset.

So, in essence: you find a tiger, look at its stripes, and then work out whether you’ve seen it before.

The re-ID of endangered species is extremely important for wildlife and ecology conservation since it garners useful data on population estimates and the location and trajectory of wildlife. As such, it helps monitor dwindling animal populations and provides a “snapshot” into ecosystem health.

Existing methods for animal re-ID such as tagging are reliable but are time-consuming, costly and the effects on the animals are not well understood. A second method involves camera trap images. Unlike trapping and tagging animals, this approach is less expensive and invasive. The disadvantage of this approach however is that it’s less reliable and subject to human judgment bias.

Applying a deep neural network to the second approach (camera trap images) makes sense because deep learning is capable of quantifying, localizing, and re-identifying animals at a comparatively low cost and in a scalable way.

This task of image retrieval presents a real challenge for computer vision techniques, however. And the re-ID of animals in the wild is particularly challenging with tigers since their unrestricted movement using four limbs presents an extensive pose variation. Performing re-ID on these animals is therefore a difficult task to complete.

In addition to movement, there is a wider variety of ‘noisy’ backgrounds as well as the increased occurrence of occlusion making identifying individual tigers especially difficult.

The method for identifying Amur tigers relies on tiger stripe patterns. This is because these patterns are the best marker for identifying and discerning between individual tigers. Since the stripes on the left and right sides of tigers differ, the authors consider images of the left and right sides of their bodies as different identities.

Let’s now have a quick look at the dataset. This will be a quick overview, covering the key elements of the research paper. To read the full paper, ‘ATRW: A Benchmark for Amur Tiger Re-identification in the Wild’¹, please click here.

Amur Tigers

The Amur Tiger Re-identification in the Wild (ATRW) by Li et al., 2019 is a large-scale dataset that introduces a novel collection of data for capturing the tigers in a very wide range of poses and lighting conditions.

The dataset consists of 92 Amur tigers from multiple wild zoos with 8,076 high-resolution video clips.

The re-ID dataset contains 182 entities of 92 tigers, with a total of 3,649 bounding boxes, namely the coordinates for the rectangle which serves as a point of reference for an object or individual in a given image or video clip.

Compared to a person or vehicle re-ID datasets, the re-ID of this Amur tiger dataset presents several novel challenges. The main challenge is that it includes video clips with a wide range of pose variations and different lighting and background environments.

Before looking at the model, let’s take a look at how machine learning works.

Take the network structure in the image below. We allow the model to learn feature extractions ( parts of the tiger) based on example data from input images (the red neurons) to output classifications (the blue neurons).

The image depicts a basic fully-connected deep neural network. Source: https://dvl.in.tum.de/teaching/i2dl-ss19/ under a CC license https://creativecommons.org/licenses/by/4.0/

In re-ID, we pass pixel values into the network through the input layer (red neurons in the left of the network). Those values are then fed into the network structure known as the hidden layers (the yellow neurons) with weighted connections and biases corresponding to activation values. The outputs of the hidden layers are passed out of the network (the blue neurons), where they are classified.

Each neuron within this network is based on how biological neurons fire and transmit information. The strength of these connections is weighted, which varies from a maximum excitation of +1.0 to a maximum inhibition of –1.0. The way information is passed through a neuron is as follows. Input in the form of pixel values between 0 and 255 are fed into the network through the neurons in the input layer. This input is then passed to neurons in the hidden layers through weighted connections (corresponding to excitation or inhibition). The weighted inputs are then summed, a bias is added, and the sum of the weighted inputs with the bias is passed to an activation function (we shall explain what this does below). The resultant output is called an activation value, which represents the size of the neuron’s activation. Whether the neuron is activated depends on whether its activation value is above a threshold value. If it is, the neuron will ‘fire’; if isn’t, it won’t.

This output is analogous to the axon of a biological neuron, which takes information in the form of electrical impulses away from the main body of the neuron cell to other neurons or cells. The size of these activations represents their state of excitation or inhibition.

The model

As the authors of the paper point out, since tigers have much larger pose variation due to motion that does not preserve the shape of the individual (i.e., non-rigid motion), the local feature representations created by pooling do not provide precise modeling of the tiger body.

One type of model that is good at capturing large pose variation is part-based models, which focus on parts of an image for detection and identification.

These models have demonstrated competitive performance in tasks such as object detection and recognition of objects made up of “deformable parts”, namely a low resolution ‘root’ template and part templates of the object in question where each part represents a location property of the image and the deformations are distinguished by links connecting those parts.

This type of method represents local parts by a rectangular patch and adopts structured SVM to learn part structures. Recently, pose key-point estimation techniques such as OpenPose provide even precise body parts and skeletons modeling, which offers new opportunities for a part-based model.

The authors propose a novel pose part-based model (PPbM) that achieves impressive performance. They use a classification-based re-id method as a baseline which uses an ImageNet pre-trained ResNet-50 backbone, followed by two fully-connected layers with 1024 and 107 neurons respectively. The ResNet allows for a deeper neural network capable of extracting local and global features from the dataset.

Images of detection subset. Photo from Li et al 2019 https://arxiv.org/pdf/1906.05586.pdf under a CC BY-NC-SA 4.0 License

In the above images, you can see sample bounding boxes of Amur tigers from multiple wild zoos. Below you can see images of the tiger skeleton key-points.

Images of pose subset. The different colored lines represent the tiger key-points including the ears, nose, shoulders, hips, etc., Photo from Li et al 2019 https://arxiv.org/pdf/1906.05586.pdf under a CC BY-NC-SA 4.0 License

The different colored lines mapping the different points on the tiger represent the tiger’s key-points which include the ears, nose, shoulders, hips, and so on.

Now, let’s turn to the architecture of the model. Below is a visual representation of the structure of the model used to re-ID the tigers.

Photo from Li et al 2019 https://arxiv.org/pdf/1906.05586.pdf under a CC BY-NC-SA 4.0 License

I won’t go into too much detail here, but the model is a ‘pose part based model’ (PPbM). This means that it integrates the result of pose key-point estimation into deep neural networks, where a tiger is represented with a 7-part star model where the parts include the trunk, front legs, hind thighs, and hind shanks.

Below are the two local head structures with the PPbM-a (top) and PPbM-b (bottom). PPbM-a uses a concatenating function to concatenate features from 7-parts together, while PPbM-b uses a “soft-attention” strategy to combine 7-parts.

For each part, the authors calculate the Axes Aligned Bounding Box (AABB) according to the pose skeleton. This can be seen below.

The red points define a part; the yellow box is the bounding box with annotated dimensions, height, and length; the blue box is AABB which contains the yellow box. Photo from Li et al 2019 https://arxiv.org/pdf/1906.05586.pdf under a CC BY-NC-SA 4.0 License

For each AABB area, the authors used the ResNet-50 backbone. They extract the local feature representation with regional average pooling on the res3d feature map, and they use an intermediate layer res3d instead of the final residual layer.

This model achieves high levels of accuracy, with competitive performance when compared to state-of-the-art models. What will be its impact?

Concluding Thoughts

The re-ID of endangered animals is a growing field of research in computer vision. The authors in the discussed paper present a new large-scale wildlife re-ID dataset which contains bounding box, pose key-point, and ID annotations of Amur tigers from multiple wild zoos.

As we have seen, large pose variations demand precise target modeling, which is less studied in current re-ID datasets and research. It, therefore, challenges existing algorithms and offers new avenues for future research with application in the domain of wildlife and ecology conservation.

Through systematic benchmarking, the authors show not only that state-of-the-art models are challenged by their large-scale dataset but also that a novel pose part-based model (PPbM) can re-ID the dataset with high levels of accuracy. This research is a valuable and exciting application of computer vision techniques which will, importantly, continue to develop and improve in the years to come.