Image Processing – Image Tracking

November 27, 2024November 27, 2024

ETTrack: enhanced temporal motion predictor for multi-object tracking

Han, X., Oishi, N., Tian, Y. et al. ETTrack: enhanced temporal motion predictor for multi-object tracking. Appl Intell 55, 33 (2025). https://doi.org/10.1007/s10489-024-05866-4

Our new paper on tracking multiple people using a transformer to learn complex motion patterns.

August 7, 2018

NEC unveils facial recognition system for 2020 Tokyo Olympics – The Verge

But will it work…..

Source: NEC unveils facial recognition system for 2020 Tokyo Olympics – The Verge

May 17, 2018May 17, 2018

Deep-Learning-for-Tracking-and-Detection: Collection of papers and other resources for object tracking and detection using deep learning

A useful Github collection

Source: abhineet123/Deep-Learning-for-Tracking-and-Detection: Collection of papers and other resources for object tracking and detection using deep learning

December 1, 2017June 13, 2018

Finding webcams

List all devices on windows using ffmpeg

ffmpeg -list_devices true -f dshow -i dummy

and on Linux

v4l2-ctl –list-devices

To get the device capabilites

ffmpeg -f dshow -list_options true -i video="Mobius"

where “Mobius” is the name of the camera.

On the Mac use

ffmpeg -f avfoundation -list_devices true -i ""

December 1, 2017December 1, 2017

How to write lossless video in Python

OpenCV does a reasonable job of reading videos from file or webcams. It’s simple and mostly works. When it comes to writing videos,
it however leaves a lot to be desired. There is little control over the codecs and it is almost impossible to know which codecs are installed. It also wants to know things like the frame size at intailisation. This isn’t always a problem, but if you don’t know it yet it means you have to set up the video writer inside your main processing loop.

To make something as cross-platform compatible as possible it would be nice to use FFmpeg. There are a few python wrappers around, but as far as I can tell they are mainly used for transcoding type applications. One solution is run FFmpeg as a subprocess and set its input to accept a pipe. Then every video frame is passed through the pipe. You write this yourself, in fact it’s only a few lines of code. However, the scikit-video package will do this for us, with some nice boilerplate to make life easier.

The steps are:

install FFmpeg — if you running on Linux use your system’s package manager if it’s not already installed. If you’re unlucky enough to be using Windows you need to download the zip file from here, and add the bin directory to your system’s path.
install scikit-video –I tried installing scikit-video via pip on my Anaconda distro but the version was too old. Instead, I cloned the github version and installed that. Instructions are provided on github.

Below is a simple example the grabs from your webcam and records lossless video.

#test recording of video
import cv2
import skvideo.io


capture=cv2.VideoCapture(0) #open the default webcam
outputfile = "test.mp4"   #our output filename
writer = skvideo.io.FFmpegWriter(outputfile, outputdict={
  '-vcodec': 'libx264',  #use the h.264 codec
  '-crf': '0',           #set the constant rate factor to 0, which is lossless
  '-preset':'veryslow'   #the slower the better compression, in princple, try 
                         #other options see https://trac.ffmpeg.org/wiki/Encode/H.264
}) 
while True:
    ret,frame=capture.read()
    if ret==False:
        print("Bad frame")
        break
    cv2.imshow('display',frame)
    writer.writeFrame(frame[:,:,::-1])  #write the frame as RGB not BGR
    ret=cv2.waitKey(10)
    if ret==27: #esc
        break

writer.close() #close the writer
capture.release()
cv2.destroyAllWindows()

view raw

pythonVideoCapture.md

hosted with

by GitHub

November 29, 2017

A curated list of awesome computer vision resources

Some useful Computer Vision links

Source: jbhuang0604/awesome-computer-vision: A curated list of awesome computer vision resources

November 29, 2017

A curated list of deep learning resources for computer vision

A really useful list of key papers using Deep learning in computer vision

Source: kjw0612/awesome-deep-vision: A curated list of deep learning resources for computer vision

January 3, 2017

dlib C++ Library: Python Stuff and Real-Time Video Object Tracking

Source: dlib C++ Library: Python Stuff and Real-Time Video Object Tracking

December 21, 2015December 21, 2015

Axis IP camera

I use axis IP camera a lot for capturing images and video. The image quality is great and they are highly customisable. I use a P1344 camera and it supports still images, MJPEG and H264. The still image are fine for capturing a one off, but too slow for video work. For this I need either MJPEG or H264. Both have there pros and cons.

MJPEG is existentially a sequence of JPEG images. It’s easy to use and the quality is good, depending on the compression settings. The downside is the bitrate over the network is larger than H264.

H264 is a lossy video compression format that has become ubiquitous on the internet these days for compressed video and blue-ray videos.

The camera can be controlled using the url. To get a H264 stream (using VLC in this case, but ffplay works perfectly well), at the command prompt type

vlc rtsp://192.168.0.103:554/axis-media/media.amp

changing the IP address to that of your camera. You can also add in the camera’s username and password using:

vlc "rtsp://192.168.0.103:554/axis-media/media.amp?user=XXX&password=XXXX"

The image resolution can be changed with

vlc rtsp://192.168.0.103:554/axis-media/media.amp?resolution=640x480

The resolution is camera dependent. There are a bunch of different settings, such as bit rate, compression, you can apply see the AXIS VAPIX documentation for the whole list. An easy way to do this is by using the cameras settings page to create a Stream Profile. There are a number built in and you can select them like so

 vlc rtsp://192.168.0.103:554/axis-media/media.amp?streamprofile=Quality

Still images can be captured, by using

http://192.168.0.103/axis-cgi/jpg/image.cgi?resolution=320x240&compression=25

here I’ve selected the resolution and compression factor. You can grab the image by placing the above url into a browser.

Image Quality

Looking at the above shows subregions of example images captured at full resolution using JPG. We can see that there is significant compression artifacts in the image even at low compression ratios. Setting the compression ratio less than 40 appears to have little effect on image quality.

H264 streams appear to be similarly affected. The bitmap image shows some improvement however the data rate to transmit this is considerably larger.

Note that the RMS errors are calculated from the JPEG image with a compression factor of 0.

Benchmarking

 ffplay "http://192.168.0.103/axis-cgi/mjpg/video.cgi?resolution=640x480&fps=15"

uses 60% of one core of my Odroid XU4 and 17% on my 2.7 GHz iMac

ffplay "rtsp://192.168.0.103:554/axis-media/media.amp?resolution=640x480&fps=15"

uses 88% on the Odroid and 14% on the iMac.

November 19, 2015November 30, 2015

Mathematical representation of images and optics.

The way we represent an image mathematically can have a big impact on our ability to mathematically manipulate it. Conceptually it would be simplest to represent an image (let’s assume it’s grey-level) as a 2D array. If my image is a 2D array $\mathbf{X}$ I could implement the effect a linear shift invariant blurring function $\mathbf{H}$ and produce an output image $\mathbf{F}$ via the convolution operator:

$\mathbf{F}=\mathbf{H} \ast \mathbf{X}$

I could do other things with this notation such as introduce a shift operator to move my image by one pixel

$\mathbf{F}=\mathbf{\acute {H}} \ast \mathbf{X}$

where $\mathbf{\acute {H}}=[0, 0 ,1]$ .

The problem is this is all shift invariant, the same blur or shift is applied to all the pixels in an image. What if the amount of blurring and shifting changes from pixel to pixel as it does in a real image due to imperfections in the camera’s lens? I would need a separate $\mathbf{\acute {H}}=[0, 0 ,1]$ for every pixel. A more convenient way is to drop the 2D convolution and implement our system using matrix multiplications. To do this we lexicography rearrange the 2D image matrix into a 1D vector

$\left[ \begin{array}{ccc} a & b & c \\ d & e & f \\ g & h & i \end{array}\right] \longrightarrow \left[ \begin{array}{c} a\\d\\g\\b\\e\\h\\c\\f\\i \end{array} \right]$

In Matlab this would be implemented with X1d=X(:) and we can transform it back to 2d with knowledge of the original number of rows and columns X=reshape(X1d,rows,cols).

For simplicity sake I shall reduce the number pixels in my image to 3. But what can we do with this? Well let’s look at a matrix multiply operation

$\left[ \begin{array}{ccc} a & b & c \\ d & e & f \\ g & h & i \end{array}\right] \left[ \begin{array}{c}x\\y\\z\end{array}\right] = \left[ \begin{array}{c} ax+by+cz\\dx+ey+fz\\gx+hy+iz\end{array}\right]$

each row in the matrix is like an operator on each pixel. I’ve effectively got a shift variant convolution. For example I could blur the first pixel and leave the rest the same

$\left[ \begin{array}{ccc} 0.33& 0.33 & 0.33 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}\right] \left[ \begin{array}{c}x\\y\\z\end{array}\right] = \left[ \begin{array}{c} 0.33x+0.33y+0.33z\\y\\z\end{array}\right]$

Note that the 1s go down the diagonal.

I could implement a shift on the second pixel

$\left[ \begin{array}{ccc} 1& 0 & 0 \\ 0 & 0& 1 \\ 0 & 0 & 1 \end{array}\right] \left[ \begin{array}{c}x\\y\\z\end{array}\right] = \left[ \begin{array}{c} x\\z\\z\end{array}\right]$

by changing the values I could implement rotations and warps.

And we can combine several matrices together to define our system. If $\mathbf{S}$ is a shift matrix and $\mathbf{B}$ is a blurring matrix with can simply combine the results together

$\mathbf{F}=\mathbf{SBX}$

to describe our shift variant optical system.

An additional step we may wish to introduce the effect of sensor pixel size. We can implement this by making our original image have a much higher resolution and them use a decimation filter to reduce this to a low resolution camera image. To this we create a matrix with $N$ rows, which equals the number of pixels in the decimated image, and $M$ columns, which equals the number of pixels in the high resolution image.

$\left[ \begin{array}{ccc} 0.5& 0.5 & 0 & 0 \\ 0 & 0 & 0.5 & 0.5 \end{array}\right] \left[ \begin{array}{c}w\\x\\y\\z\end{array}\right] = \left[ \begin{array}{c} 0.5w+0.5x \\0.5y +0.5z\end{array}\right]$

shows how we can reduce the resolution by 1/2 in one dimension and we can easily extend this to 2D.

Update

It’s worth noting if the blurring is shift invariant (which is a lot easier to deal with) the matrix is block circulant. This means it is of the form

$\left[ \begin{array}{ccccc} d(0) & d(M-1) & d(M-2)& \ldots &d(1) \\ d(1) & d(0) & d(M-1)& \ldots &d(2)\\ d(2) & d(1) & d(0)& \ldots &d(3)\\ \vdots&\vdots&\vdots&\vdots&\vdots&\\ d(M-1) & d(M-2) & d(M-3)& \ldots &d(0) \end{array}\right]$

Note, each row is a shifted version of the one above it. The reason, this is important is that it is easy to invert.