Swarm Vision and The New Everyday

Contributed by George Legrady UC Santa Barbara
February 15, 2014
Legrady's picture
Part of the Cluster:


The “New Everyday” is the mapping, measuring and data collection of everything local and global. Sensing systems measure heat, sound, light, in essence waves and vibrations from the sub-atomic to deep space. Imaging systems function to capture presence, to record change, to see beyond human range, to create evidence, to stimulate action. Stimulated mostly by research in medical, surveillance and the space program, there has been extensive research and development activity to automate sensing systems for applications spreading to many other fields, in particular industrial fabrication, the collection of bio-data, commercial storage systems, architecture design, and many other.

Computational systems allow for automated sensing through the sampling of natural phenomena and physical processes resulting in a representation of the real. Based on the collected sampled data and the analysis of behavioral processes, such systems produce outcomes that mimic nature, artificial constructs that are not accurate the way the world is structured but reflect the way we imagine the world.

In my evolution as an artist focused on the photographic image, I was invested in the study of why the flat 2D photographic representation is so engaging to humans. This led to many projects that explored the properties of how photographic images are constructed, how they produce coherence, and how they convey meaning, or how cultural meaning is encoded and decoded in images. Whereas anticipation and excitement are part of the arrival of the photograph one has taken, as it implies discovery, a kind of “what can I see through capture by a machine that I would not have perceived otherwise”, the image-taking process is situated in an uncertain space – one of “lets see what happens if I try this”.

Image framing is a process of organizing visual forms in front of the camera lens through the act of composition based on distance relation to the scene, and selection of what to keep in the image and where things need to be positioned. Professional image-makers are good at recognizing high-level visual characteristics and relationships in images that may go unnoticed to an untrained eye. While much of this analysis often involves intuition, there are syntactic rules that each image-maker will implement to suite their specific approach to image construction. Some of these include subdivision of spaces, complementary forms, contrast, use of texture, spatial balancing, depth, articulation of texture, patterns, repetitions and rhythm, motion, clarity and blur, and challenging conventional standards of how to implement these principles.

A set of standard methods have developed in computer image processing to analyze images based on some of these attributes of how images are structured. The field, called “Computer Vision” consists of computational methods by which to analyze the elements within the image such as identifying objects through shape, for instance face recognition, complex textures, patterns, etc. In most cases, the effort of the field is to computationally and automatically interpret 3D scenes in 2D images given the properties of the image. This research takes place at different levels, from extracting basic details such as edges and corners, to recognizing forms/objects, to interpreting activity or behavior of what may be going on in the image. A related field titled “Machine Vision” applies such processes for automated inspection and implementation in medical, industrial manufacture and other areas.

The “Swarm Vision” installationi evolved initially inspired by studies of how to automatically generate a dynamically changing novel type of image composite, continuously being transformed through continuous movement of 3 camera views. The project eventually morphed into an automated visual system with multiple cameras where the cameras would be trained to look around the space they inhabit, to feature their behavior over time, to record what they see and how they perform, and to build into the system a measuring process by which each camera would rate its performance according to some rule, and inform each other as to how they are performing. We invested efforts in the study of the image from a syntactic perspective listing the set of possible visual constructions aiming for aesthetic qualities rather than the cultural meaning of what the visual results represent.

To arrive at a computational system that can translate aesthetically defined perception and image construction, some fundamental questions had to be addressed as to what degree could such a complex human behavior be formally described through mathematical language. Is it possible to identify systematic elements or rules that artists look for when analyzing works of art. Artists also arrange these elements in interesting ways to create new compositions. These rules may serve as insights for automating such behavior using computer vision and graphics techniques.

What steps would be required to automatically replicate, using a computer, an analysis of how a human comprehends a photograph, and then follows to create one based on the rules that have been identified? Computer Vision provides some readily implementable techniques such as: 1) Object detection, where the analysis involves detecting several objects in the image. Faces, arms, cameras, cars, and pedestrians, or at least, basic forms 2) Object pose estimation where the “directions of movement” are given by lines, forms, face and arm orientations, and motion directions of moving objects 3) Segmentation and edge detection to identify forms.

A basic description of the “Swarm Vision” system is as follows: The system consists of three cameras on rails. Each camera continuously searches the scene to fulfill its individual computer vision task. Camera 1 searches for brightest color, camera 2 for straight lines, camera 3 for complex textures. Each camera’s level of efficiency is measured, rated, and stored in memory. When a camera achieves a high rating (marked by a yellow frame) it forces the other two cameras to abandon their searches and look where the dominant camera is looking, thereby disrupting the other two cameras’ activities to concentrate on one location.

Screen I features 3 camera views and performance rating. Camera 1: best color, camera 2: straight lines, camera 3: high textures

Swarm Vision” outputs to 2 screens to convey its performance. The first screen represents the three individual camera views, the second show a 3D virtual scene overview that features the space, the three cameras, and the images they produce. This consists of real-time visualizations of what images the cameras capture, which are positioned in a spatially reconstructed representation of their three-dimensional visual environment. Each camera’s images are placed at the distance of focus in the virtual space, generating emergent sculptural forms out of the overlaid flat images which are positioned in relation to each other within the virtual space. In the exhibition setting, visual segments of spectators who enter the viewing space populate the images leaving an imprint of their presence that are later erased as the images sequentially fade away. Overall camera visual range is continuously rated and stored into a subdivided a matrix of cells where each cell segment corresponds to a part of the scene. Values are stored to give the system a “memory” which is used to control how often a camera will revisit that specific location. The memory is used as a feedback system where 1) the image processing system determines what is interesting in the scene, and computes a target location where to look 2) the movement calculation step determines how the robot must orient itself in order to look directly at that location. The output of this computation is the pan, tilt, zoom, and rail position the robot should be at 3) lastly, the camera changes its orientation and location on the rail, giving it a new view on the scene. This new view is seen by the lens, and input into the image-processing algorithm, which is then referenced in the future when the camera has to decide what scene to look at.

Screen II: 3D virtual overview of exhibition space, camera locations colored red, green, blue, and the images they produce.

The project conceptually builds on the history of artistic explorations that address the time-space-motion-image beginning with 19th century scientist/photographer/motion studies inventor Jules-Etienne Marey; the Bauhaus artist/photographer Moholy-Nagy; 1960s’ video close-circuit studies by Bruce Nauman and Dan Graham; Michael Snow’s machine determined aesthetics in “La Region Centrale” (1971); the Vasulkas’ projects “Allvision” (1976) and “Machine Vision” (1978). Many new media artworks have been realized in Europe and Japan in the past 20 years. The Berlin collective Art+Com’s 1995 “The Invisible Shape of Things Past” is an early reference in how to represent video footage into a 3D virtual spatial representation.

The work embraces aesthetic directions in photography, conceptual art, interactive installation, computational aesthetics, performance, and digital visual studies. It is both an artistic and engineering focused work, a first step for the three collaborators to formally describe aesthetics into a rule-based and formal mathematically scripted sequence. Its creation has required a collaboration of artistic, controls engineering and computer vision experts to address the question of how we visually study a space. The complexity of effort it takes for a human to make sense of a visual scene is explicitly explored.

i The project was developed at the Experimental Visualization Lab in the Media Arts & Technology program, UC Santa Barbara. Funding support included a grant by the National Science Foundation, Information and Intelligence Systems, and the Robert W. Deutsch Foundation. The three collaborators consist of George Legrady, concept, management, and aesthetic direction; Danny Bazo, platform development, aesthetics and robotic behavior development, and Marco Pinter, robotic and artistic behavior development focused on kinetic visualization. A prototype version was presented at Siggraph 2013 Art Gallery, Anaheim, July 2013. 

The exhibition premiered at Vox Gallery Montreal in “Drone: The Automated Image”, Le Mois de la Photo à Montréal, September 2013. Other venues have included a solo exhibition at the Run Run Shaw Creative Media Centre, Hong Kong, November-December 2013, and a single camera version at the Miami Art Fair, December 2013.

Screen I features 3 camera views and performance rating. Camera 1: best color, camera 2: straight lines, camera 3: high textures
Screen II: 3D virtual overview of exhibition space, camera locations colored red, green, blue, and the images they produce.