14 Image Signal Processing
This chapter reviews basics imaging signal processing algorithms. The output of an image sensor is what we usually call the raw pixels. The raw pixels are not the usual RGB images we are used to see. For starters, if we use the common CFA approach for color sensing, each raw pixel has only one color channel response — the two missing channel responses must be recovered. We have also ignored noises, which are introduced every step along the signal transduction chain from incident lights to raw values. Therefore, the raw pixels usually go through a post-processing pipeline before raw sensor can be consumed, whether it is by a human observer or, increasingly, machine vision algorithms.
That pipeline in modern cameras is implemented by a special hardware accelerator called the Image Signal Processor (ISP), which is an Intellectual Property (IP) block in a mobile System-on-a-Chip (SoC). Implementing the post-processing algorithms in dedicated hardware makes a lot of sense from an efficiency perspective: when you press a button to capture an image, you certainly do not want to wait for a long time or burn a lot of energy before the image is shown to you. As many mobile vendors do not actually control the optics and the sensor, the ISP increasingly has become the key product differentiator. As a result, many companies have their custom ISP designs; for instance, Qualcomm’s Snapdragon SoC has their own Spectra ISP.
Many texts exist on the general ISP algorithm (Ramanath et al. 2005; Karaimer and Brown 2016) and the hardware design (Hegarty et al. 2014), which we refer you to. We also have a pedagogical ISP written in Python (Zhu 2022a) that is a good reference, too. The goal of this section is to walk through the general pipeline and point out main ideas. One thing worth emphasizing here is that the ISP design is strongly influenced by the downstream task that consumes the output of the ISP. The two main consumers are human vision and machine vision. The former cares about visually pleasing images while the latter does not — as long as the key semantic information is retained and can be extracted.
14.1 General Pipeline
Figure 14.1 shows a general ISP pipeline and how it fits into the entire imaging pipeline. The ISP takes the raw pixels generated by the sensor and generates two types of output: the finished image, usually encoded in sRGB color space and compressed, and statistics of the image that are used to drive the so-called “3A” algorithms, i.e., auto white balance (AWB), auto exposure (AE), and auto focus (AF). We will not have much time to discuss the 3A algorithms, but they can be thought of as feedback controls over the rest of the imaging system: AWB controls the white balancing stage in the ISP, AE controls the exposure time of the image sensor, and AF controls the lens movement in the optics. The 3A algorithms usually run on the host CPU or an MCU because they are relatively simple computationally.
The ISP pipeline shown here is a general architecture that covers roughly what an ISP has to do. Keep in mind that the exact stages and their arrangements are proprietary and vary by vendors. Regardless, all ISPs operate on a set of basic principles.
Recall that the raw pixel values output by the sensor should ideally be proportional to the scene luminance, but this is hardly the case in reality. The first thing an ISP does is to recover luminance-proportional values from the raw sensor output; this includes three main steps.
- The first step is called dark signal subtraction (DSS) or back-light subtraction. This is necessary because even when pixels receive no light at all, their raw values are usually not zero. This is because of “dark current”, formed by thermally dislodged electrons even in the absence of incident photons. Measuring the raw values of “optical black” pixels (Kameda 2012) per frame and subtracting those values would allow us to eliminate the effect of the dark current1.
- The second step is called flat-field correction or lens-shading correction (LSC). It accounts for the fact that the raw pixel values (with dark current subtracted) are spatially non-uniform, a phenomenon called “vignetting”, where even under uniform (flat-field) illumination peripheral pixels receive fewer photons. It is caused by a variety of reasons: the mechanical design (including microlenses) of the camera blocks more lights toward the edge of the pixel array, the radiance fall-off (the \(\cos\theta\) term) when rays incident in an oblique angles, etc. We can pre-calibrate this non-uniformity, store it in an image, and compensate for the non-uniformity for each frame.
- The final step is denoising. Many excellent discussions of noise sources exist, such as Boukhayma (2018), Nakamura (2006, Chpt. 3.3), and Rowlands (2020b, Chpt. 3.8-3.9), which we refer you to. Regardless of the noise source, the general strategy of denoising is low-pass filtering: blurs are subjectively less objectionable than noises.
It is important that these steps are taken at the very beginning of an ISP: if the pixel values are noisy, any subsequent manipulations on the pixels also manipulate, sometimes amplify, the noise.
After that, we can assume that the raw pixels carry physical meanings: they are proportional to luminance, but of course because of the CFA, the color information is spatially sampled. The raw pixels before demosaicing are usually called pixels in the Bayer domain. The next stage is demosaicing, which essentially reconstructs the color information (all three channels) from the single-channel samples. While many reconstruction filters/kernels exist, the easiest and most commonly used filter is the bilinear filter.
The demosaiced color information is encoded in the sensor’s native color space, because the sensor’s SSFs almost certainly do not match the cone fundamentals. So the next stage is to transform color from the sensor’s native color space to a typical color space such as the CIE XYZ space. We build an interactive tutorial to walk you through this correction process that you are invited to play with (Zhu 2022b). White balancing usually is implemented along with color correction, because both involve linear transformations of colors (Rowlands 2020a; Zhu 2021).
Usually there is a tone mapping stage in the ISP. The dynamic range of the raw pixels is usually (but not always) larger than that of a typical output medium (e.g., a display or a print). Tone mapping operators map signals between the two dynamic ranges so that the output image is visually appealing.
The final output is usually compressed, either through an image compression algorithm (e.g., JPEG) or, in the case of video capturing, a video compression algorithm (e.g., H264).
14.2 Two Trends
First, it has become increasingly common to co-design ISP algorithms, along with optics and image sensor design, with the downstream tasks. This is particularly important for machine vision, which is not concerned with the traditional goal of an ISP, i.e., generating visually pleasing images. A co-design between the ISP and the machine vision algorithms could potentially improve both task quality and efficiency.
Second, a huge amount of recent efforts have been spent on exploring the notion of “neural ISP”, which is nothing more than replacing part, or the entirety, of the ISP pipeline with deep neural networks (DNNs). The learning paradigm has two main advantages: it replaces some of the heuristics in traditional ISP designs, and it allows the algorithm to be more easily updated without having to wait until the next generation of the product. The latter point is possible because a neural ISP pipeline can run on a DNN accelerator that almost all modern mobile SoCs have, and updating the algorithm is nothing more than updating the model weights.
The key issue with neural ISP is speed and efficiency: a neural ISP model executed on a generic DNN accelerator is likely much slower and more energy hungry than traditional ISPs. So it is more likely that neural ISPs will find their main uses in offline image processing and photo finishing rather than in the real-time imaging pipeline.
Note, however, that DSS does not eliminate the effect of dark current shot noise, which results from fluctuations in dark current during the exposure time (so the dark current is in theory different for different pixels even if they have the exact temperature), and the effect of dark current non-uniformity, which results from spatial differences in dark current across pixels (because of, for instance, the spatial temperature differences).↩︎