In the 1970s, it was found that the visual input was processed in different bands of spatial resolution and that these bands were, to some extent, treated separately. Then they were called spatial frequency channels, these are now called spatial scales.
Later, in the 1980s, I and others added two further findings. First, we found that although the scales were acquired independently, they were combined quite inflexibly at a very early stage. Observers can’t make judgements on the information in the individual scales, only on their combination (which does preserve most of the information). We were able to go as far as to see that the coarser scales were used to group together details from finer scales. The mechanism for this appeared to be very simple, involving nothing more than half-wave rectification and addition.
The second finding was that although all the scales are present to be seen at the start of the visual process, they are there only as texture, not as a spatial layout. The visual process that creates a representation of spatial layout scans from coarse to fine scales over time – potentially taking maybe as much as a second to get to the fine scales.
When you look at a a tree, initially you see the shape of the tree and the layout of its major limbs, with all of the tree type of detail as a texture overlaid or alongside the coarse spatial structure of the tree. Texture can be thought of as a representation that notes there are leaves and some general characteristics of the leaves, but not what their spatial layout is.
Over time, your vision replaces the texture representation of the leaves etc with a spatial representation at those finer scales. Eventually, the location of the individual leaves is represented.