The Mystery Staring Us In The Face: Our Ability To Perceive Fluid Motion From Video Images
The Key To Unlocking The Mystery That Is Staring Us In The Face And What It Can Tell Us About Reality


Have you ever thought about what is happening when you watch video images flashing on a screen, but all you see is fluid motion? Forget about how it happens for the moment. We are much too locked into explaining everything as the result of mechanisms moved by hypothetical entities like dark matter, dark energy, and abstruse mathematical incantations, to allow us to reach escape velocity in our rational efforts to think through this mystery.
Think about why you can do it. Not everything has a purpose it is said, but everything has a cause within our paradigms of physical causality and natural selection, which together purport to encompass all the possible ways things can interact, and come to be. Do they encompass our ability to experience fluid motion in streaming flashing images? Not at all. Let’s see why:
Stare at a video of random images flashing at 20 to 77¹ images per second and you can see each individual image clearly. Change to a video of images that have a consistency of content, but within which at least one identified object is different to some degree, and you experience fluid motion. You do not see flashing images. You see things in motion against a background that is as fluid as the way you see the real motion of the world around you, only you should know the motion isn’t in the video you are watching — it’s in you. Somehow your brain realizes the video images have a coherent continuity between them and you experience that, not the individual images as you do when they are random images.
But here is the problem: in order for your brain to accomplish this feat of perceptual slight of hand — however it does it — it has to know what the next image contains so that it knows to construct the fluid motion and substitute that in place of the first, already gone, image. And it has to do this for every image you see. So how does the brain know to hold-off on the individual image perception before the next image appears? Barring the impulse to invoke Whovian time travel shooting us back a frame at a time as each new frame appears, like an ancient French mitrailleuse — bang bang bang — and staying grounded in mechanisms that today are all the rage, what would need to be done by our brain to know whether or not to show each frame, or instead, the constructed motion between two frames?
And more problematically, why would your brain have such an ability at the ready for tens of thousands of years before humans invented video technology after discovering that flashing images leads to an experience of fluid motion because we already had the ability to do that? There are other possible ways to explain this capability in mechanistic terms, such as positing that the motion is the result of a prediction based upon the prior image, without reference to the next image, which, by applying the same prediction processing can better predict the motion based upon the first image and second image contents, etc., but before running down every possibility, I think it prudent to first answer this question: What possible survival advantage would naturally select this ability to construct fluid motion from streams of flashing images for evolutionary development and reproduction in a world with neither movie theaters, nor video screens? And more to the point, why would human brains do all this extra processing over tens of thousands of years for no benefit, since there are no naturally occurring movies present in the wild — not then, and not now!
Even more disconcerting is that other species of animals also have this ability, as do parrots!
The time delay and energy wastage² of it boggles the mind, and strains credulity: modern smartphones perform trillions of calculations just to parse the contents of a single camera image, so the difficulty of the process is known and shouldn’t be ignored. Luckily for us, our visual system sees motion without having to create it from still images, because our eyes aren’t cameras taking still pictures — We see fluidly! Not click click click staccato.
And the final insult to our intelligence is that the ‘scientific’ explanation for how the motion is constructed is that images persist in the eye, and when a new image comes in it overlays the previous image causing apparent motion. An image that persists, because the film didn’t advance, with another image overlayed upon it, was once known as a ‘double exposure’. Today, it is called apparent motion, but even if you have never heard of a double exposure before, I am sure that you can imagine what overlayed images would look like, and it looks nothing at all like the fluid motion our brains construct from streams of flashing images. Take this painting depicting a nude — using overlaid images — descending down a staircase.

To suggest that the persistence (i.e., stillness) of an image somehow creates motion, is sheer idiocy. There is no deus ex machina that is going to swoop in and save us from this impossibility.
One last point that will profit you to consider: if the resulting scene that is filled with fluid motion is constructed by our brain, why does that have the visceral feeling that it does for us, since it is only a constructed perception? And if it is our brain that is doing it, why can’t our brain give our memories that same visceral feel? Perhaps there is more than the brain active here.
As we will see, these questions will yield a bounty of insights, as we proceed in this book.


Footnotes:
¹ “The results of both experiments show that conceptual understanding can be achieved when a novel picture is presented as briefly as 13 ms and masked by other pictures.” Note: 13 ms is approximately 77 fps. Quoted from: “Detecting Meaning In RSVP At 13 ms Per Picture,” Mary C. Potter, et al., of the Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, in “Attention, Perception, & Psychophysics,” 2014 76:270–279, DOI 10.3758/s13414–013–0605-z
² There is a salient difference in the approaches of neuroscientists and computer scientists regarding how much time is required to perform the requisite processing, as well as, the energy loads that such processing requires. Neuroscientists do not seem, on the whole, to concern themselves with these factors since the only possible solution is that the brain is already doing it, and therefore, all we need to understand is how it is done by the brain, without any need to balance the proposed solutions against how much time and energy would be required. Computer scientists, on the other hand, are not so constrained in their reasoning in this regard, and they universally are concerned about how much time it takes a computer to do the processing, and how much energy is required to support that processing. For example, in the introduction for a recent paper dealing with ways of implementing fluid motion for various machine deployments, the time and energy requirements are an expected aspect of any solution: “Dynamic machine vision (DMV) technology has numerous significant applications in video analysis, robotic vision, self-driving technology, and intelligent transport. The ability to use present vision to recognize past motion and predict future trajectories is crucial in DMV. Current imaging systems utilize multiple modules, including sensors, signal converters, memory, and processors, to recognize and predict motion by analyzing massive frame-by-frame image sequences and using complex algorithms, engendering redundant data flows and high-energy consumption.” Found in: Tan, H., van Dijken, S. Dynamic machine vision with retinomorphic photomemristor-reservoir computing. Nat Commun 14, 2169 (2023). https://doi.org/10.1038/s41467-023-37886-y





