12 Image Sensor Architecture
This chapter discusses image sensors, the devices that transform optical signals to electrical signals. We start from the basic principle that governs this signal transduction inside a pixel and then discuss how pixels are architected together to form an image sensor. We then turn to various in-sensor optics, which are not necessarily important for forming images but are important for forming visually pleasing images that, for instance, have realistic colors and are free of aliasing effects.
12.1 Overview
The main job of the sensor is to turn optical signals, i.e., the optical image impinging on the sensor plane, into electrical signals, i.e., digital images. This conversion is broken down into two steps, first by converting photons to charges followed by turning charges into digital numbers.
Figure 12.1 (a) shows a cross-sectional view of the sensor hardware, which has three main components.
- First, there is a set of optical elements sitting on the sensor. These optical elements are not the imaging optics we discussed in the previous chapter because their main goal is not to form an image.
- Second, under these optical elements are the photodiodes, which turn optical signals carried in photons to electrical signals in the form of electric charges.
- Third, interleaved with the photodiodes is the circuitry that processes the output of the photodiodes, turning charges into digital values.
From a computational perspective, we can model an image sensor as a signal processing chain, a transfer function \(f\), that transfers the optical signal to the electrical signal. Figure 12.2 visualizes this chain of signal processing. This chain of processing is best understood as computing on random variables. The input optical signal can be seen a random variable \(R_o(\mu_o, \sigma_o)\) with a mean and standard deviation of \(\mu_o\) and \(\sigma_o\), respectively. Every step in the signal processing chain not only manipulates the signal itself but also introduces/affects the noise. As a result, the output electrical signal is another random variable \(R_e(\mu_e, \sigma_e)\). So the transfer function, viewed this way, is:
\[ \begin{align} f: (\mu_o, \sigma_o) \mapsto (\mu_e, \sigma_e), \end{align} \]
Any imaging session can be seen as drawing a concrete value from the distribution of \(R_o\), and its output (raw pixel values) can be seen as drawing a value from the distribution of \(R_e\). An important goal of our study is to build an analytical model for this transfer function \(f\). For simplicity, we will first ignore noise as if \(f\) operates only on the mean signal. We will then discuss the sources of noise and how to model them.
There are two ways the pixels and the wires that read out the pixel outputs are physically arranged, shown in Figure 12.1 (b). In the back-side illumination (BSI) arrangement, the wiring of the circuitries is behind the photodiodes, which directly interface with the lights. In the front-side illumination (FSI) arrangement, the metal wiring sits between the light and the photodiodes. This means light could be absorbed and scattered through the metal layer before reaching the photodiodes, reducing the chance of a photon being properly captured. While earlier image sensors used FSI because it is easier to manufacture, almost all commercial image sensors use BSI now (Swain and Cheskis 2008).
FSI is actually quite similar to the structure of human eyes, where, if you recall, the photoreceptors are “hiding” behind other retinal neurons such as the retinal ganglion cells, which are functionally the last layer of retinal processing but anatomically sitting at the first layer on the retina. Different from the FSI sensor, however, the non-photoreceptor neurons on the retina do very little to light: they do not absorb or scatter light much and can be generally thought of as transparent. Metal wires, of course, disrupt incident photons significantly.
12.2 From Photons to Charges and Digital Numbers
We will talk about how optical signals are first converted to electrical signals in the form of charges, and then talk about how the charges are detected, at which point the electrical signals are manifested as voltage potentials. The voltage potentials are then quantized as digital numbers, which are the raw pixel values. We will focus on the basic building blocks that enable these conversions and leave it to Section 12.3 to discuss how these building blocks are connected in a global sensor architecture. The discussion here assumes monochromatic sensing without noise. We will talk about color sensing and the noise issue later.
12.2.1 Photons to Charges
What turns optical signals to electrical signals is the light-sensitive photodiode in a pixel. A photodiode is a p-n junction made of silicon, a semiconductor material. When a photon hits silicon and is absorbed, an electron from the silicon might be freed/emitted, transforming optical signals to electrical signals. This is called the photoelectric effect (Einstein 1905b, 1905a), the discovery of which won Albert Einstein his Nobel Prize.
In particular, when a photon is absorbed, if its energy is greater than or equal to the work function \(\phi\) of the material, which is the minimum energy needed to free an electron from the surface of the material, the photon can transfer its energy to an electron and free the electron. A photon’s energy is given by the Planck’s relation:
\[ \begin{align} \mathcal{E} = h f = \frac{hc}{\lambda}, \end{align} \tag{12.1}\]
where \(h\) is the Planck constant, \(f\) is the photon frequency, and \(c\) is the speed of light. So if \(h f > \phi\), an absorbed photon can free an electron. Interestingly, the residual energy \(hf - \phi\) becomes the kinetic energy of the electron, so a photon with a shorter wavelength (i.e., higher frequency) would allow the emitted electron to move faster.
It is clear that there is a frequency threshold \(\phi/h\), lower than which a photon would never be able to free an electron. Higher than the threshold, there is generally a one-to-one mapping between an absorbed photon and an emitted electron: an absorbed photon always frees an electron. Since the work function of silicon is about 1.1 eV (electron volt), absorption of photons with wavelengths longer than 1,100 nm would not emit any electron.
Quantum Efficiency
A key figure of merit in image sensing is the notion of quantum efficiency (QE), which is the ratio between the number of electrons collected and the number of incident photons:
\[ \begin{align} QE = \frac{\#\text{~of electrons collected}}{\#\text{~of incident photons}}. \end{align} \tag{12.2}\]
Figure 12.3 (a) shows the QE spectrum of an image sensor in the Hubble Space Telescope. It might come as a surprise that QE is lower than 1 (even for wavelengths well within the 1,000 nm threshold) and is actually wavelength dependent: shouldn’t every absorbed photon (within the wavelength threshold) always free an electron? There are two reasons.
First, the denominator in the QE definition is the number of incident photons, not the number of absorbed photons. Not all photons that hit the photodetector will be absorbed. Figure 12.3 (b) shows the spectral absorption coefficient \(\sigma\) (unit 1/cm) of silicon on the left \(y\)-axis, and the right \(y\)-axis shows the corresponding mean free path \(l\) (i.e., the expected length a photon can travel within silicon before being absorbed) at different wavelengths; recall from that \(l = 1/\sigma\). We can see that absorption is strongest for the blue-ish lights but decays very rapidly toward the longer wavelengths. This definition of QE is different from how QE is defined in human vision. Recall from Section 3.1; there, QE is the probability of pigment excitation once the pigment actually absorbs a photon; there, the QE of photopigment is roughly two-thirds and is not wavelength-sensitive.
Second, the nominator in the QE definition is the number of collected, not emitted, electrons: even if an electron is freed by an absorbed photon, that electron might not actually be collected and contribute to the electrical signal. Depending on where the electrons are freed, some of them need to go through a random walk (think of it as Brownian motion) before being collected, and you can imagine some electrons can be recombined with the holes during the walk.
Given QE, the total number of emitted electrons after an exposure time \(T\) is given by:
\[ \begin{align} N = \int_\lambda QE(\lambda) Y(\lambda) \text{d}\lambda, \end{align} \tag{12.3}\]
where \(Y(\lambda)\) is the number of photons incident on a photodiode at a particular wavelength \(\lambda\) during the exposure time \(T\) (assuming \(Y\) is invariant during \(T\) here). According to the Planck’s relation (Equation 12.1), \(Y(\lambda)\) is related to the spectral power distribution (SPD) of the incident light \(\Phi(\lambda)\) by: \(Y(\lambda) = \frac{\Phi(\lambda) T \lambda}{hc}\), where \(\Phi(\lambda) T\) is spectral energy distribution. Therefore, we have:
\[ \begin{align} N = \int_\lambda QE(\lambda) \frac{\Phi(\lambda) T \lambda}{hc} \text{d}\lambda. \end{align} \tag{12.4}\]
Note that we define QE for the photodiode itself: the denominator in Equation 12.2 refers to the number of photons incident on the photodiode, not those that enter the camera system. This is an important distinction, because many photons that enter the camera would not even make their way to the photodiode; some of them are reflected at the lens surfaces, and others are absorbed by the various filters (Section 12.4). In many contexts, the QE is reported with respect to the entire camera system, where the denominator is the number of photons entering the camera. Always ask what the precise definition of a QE is when reading the literature.
12.2.2 Measuring Charges
Basic Principle
Now that we have turned photons to charges — the freed electrons move to the n region and the holes move to the p region of the p-n junction — the next step is to measure the charges. The basic principle of doing so is using a capacitor: we use the electrons to discharge a capacitor with a known capacitance; by measuring the voltage difference before and after the discharge, we can then estimate the number of emitted electrons:
\[ \begin{align} \Delta V = \frac{\mathcal{Q}_{sig}}{C_{FD}} \times g = \frac{N q}{C_{FD}} \times g, \end{align} \tag{12.5}\]
where \(\mathcal{Q}_{sig}\) is the charge in the signal used to discharge the capacitor, which is usually a floating diffusion (see later) that has a capacitance of \(C_{FD}\), and \(g\) is the voltage gain of whatever device is used to read out the voltage, usually a source follower (see later). \(\mathcal{Q}_{sig}\) itself is the product of \(N\), the number of charges in the signal, and \(q\), the elementary charge.
\(\frac{q}{C_{FD}}\) is also called the conversion gain (CG) of the pixel. CG has a unit of \(\text{Volt}/\text{e}^-\) and can be interpreted as the amount of voltage change per charge. CG is a very important quantity. A high CG means the output voltage change is very sensitive to small amount of input light change, which is good for improving the signal-to-noise ratio (SNR). In contrast, a small CG means the output voltage change is small given the same amount of light change, and that small voltage change becomes very difficult to detect in the presence of noises, resulting in a low SNR. While desirable from a noise perspective, a high CG necessarily means a smaller capacitor, which is easier to fill up (saturate). We will get back to this point when discussing dynamic range (Section 12.2.4).
We can see that once we can measure \(\Delta V\), we can get an estimate of \(N\). Why do we care about \(N\)? Intuitively, the incident light luminance is positively related to \(N\): more incident photons means higher luminance. Luminance \(L\), if we are interested in only grayscale, monochromatic imaging, is ultimately what we want to estimate.
It is important to realize that the actual relationship between \(L\) and \(N\) is not linear. We know that luminance is defined as:
\[ \begin{align} L = \int_\lambda V(\lambda) \Phi(\lambda) \text{d}\lambda, \end{align} \tag{12.6}\]
where \(V(\lambda)\) is the luminance efficiency function (LEF) and \(\Phi(\lambda)\) is the SPD of the incident light. Taking Equation 12.6 and Equation 12.4 together, we can see that given \(N\), we cannot quite estimate \(L\), because \(L\) depends on \(\Phi(\lambda)\), but estimating \(\Phi(\lambda)\) from \(N\) is an under-determined problem, as Equation 12.4 shows. To be exact, \(L\) does not necessarily scale linearly with \(N\) — it does not even necessarily scale positively with \(N\), but it is perhaps not terribly wrong to informally say a higher charge count means a higher luminance in the scene. We will return to this problem when we discuss color sensing, too.
4T Design
The photodiode (PD) technically acts as a capacitor itself (the n-side neutral region holds electrons and the p-side neutral region holds holes), so we could simply use the PD for that purpose. This is indeed how an earlier pixel design works, which we will return to shortly. Modern pixels actually transfer the charges from the PD to a separate measurement node, which we focus on here.
Figure 12.4 (a) shows the circuit diagram of a typical pixel design that detects and measures the charges. The design has a PD and four transistors, so it is usually called the 4T design. The M-TX
switch controls the transfer of the charges accumulated in the PD to the Floating Diffusion (FD)1, another capacitive area, and is sometimes called the measurement node, the sense node, or the conversion node, because that is where the charges are actually being measured. The FD is connected to the NMOS Source Follower (SF) transistor M-SF
, where the gate terminal is its input and is connected to the FD voltage, the drain is connected to the supply voltage, and the source is the output that faithfully follows/transfers the input with a gain of about 0.9 (\(g\) in Equation 12.5).
The sequence of operation goes roughly like the following, and Figure 12.4 (b) shows the corresponding timing diagram:
- Before the exposure, we turn on the
M-RST
switch and theM-TX
to drain the charges (electrons) at the PD, which will also, as a byproduct, drain the charges in the FD, resetting their voltage potentials both to \(V_{RST}\). Resetting the FD voltage at this step is of no functional use, as we will shortly see. - We then turn off
M-RST
andM-TX
, and the exposure begins, during which the charges are collected inside the PD. We can see from Equation 12.5 that in order to measure the charges we need to measure the voltage difference at the FD node before and after the charges are transferred. So toward the end of the exposure, we turn on theM-RST
switch again while, importantly, keeping theM-TX
switch off. This would allow us to reset the FD voltage to \(V_{rst}\), which will be measured throughM-SF
as \(V_1\) in Figure 12.4 (b)2. - We then turn on the
M-TX
switch, which transfers the charges from the PD to the FD. After that, we turn offM-TX
and read the voltage fromM-SF
for the second time, this time for the voltage at FD after the charge transfer. This is the \(V_2\) in Figure 12.4 (b). The difference between \(V_1\) and \(V_2\) is the \(\Delta V\) in Equation 12.5.
As we can see, we read the voltage of the FD twice to obtain the voltage difference caused by the charges collected during the exposure. This is called Correlated Double Sampling (CDS), which turns out to also be very important to mitigate many noise sources, which we will discuss later.
To read out the voltage from the SF, the M-SEL
switch needs to be turned on, which is omitted from Figure 12.4 (b) for simplicity. As we will shortly see in Section 12.3, in most cases (although not all), pixels are read out row by row, so the M-SEL
switches of all pixels in the same row are connected to the same signal, usually called the row select signal.
The timing diagram in Figure 12.4 (b) is illustrative of the major operations (omitting M-SF
) but not drawn to scale. The exposure time is usually at the tens of milliseconds scale (e.g., 30 FPS means roughly a 33.3 ms exposure time), but the timescale to operate the transistors/switches is at the microsecond level. Also observe, in Figure 12.4 (b), that during the exposure the voltage at the FD (\(V_{FD}\)) slowly reduces from \(V_{rst}\) after the first reset — because of the charge leakage in the FD, just like how DRAM cells leak. This is why we need the second reset to bring the voltage at FD back to \(V_{rst}\) before charge transfer. This is also why we say the first reset is of no functional use to the FD (but of course very important to the PD because we want the PD to collect only electrons emitted from the current exposure).
4T APS vs. 3T APS vs. PPS
The (4T) pixel design described above is called an Active Pixel Sensor (APS) design, first conceived by Noble (1968) (see Fossum (1993) for a more modern perspective). An APS has a per-pixel SF (a common-drain amplifier) that “actively” reads out the signal for each pixel by turning its charges to voltage. We briefly discuss the other, older pixel designs that are less commonly used now. See El Gamal and Eltoukhy (2005) for a more detailed discussion and visual comparisons.
A simpler and earlier version of the APS design uses only three transistors (3T) without the gate. Figure 12.5 (left) compares the 4T APS with the 3T APS. Without the transfer gate, the PD is used as the measurement/sensor node itself, so the \(C_{FD}\) in Equation 12.5 is effectively the capacitance of the PD itself. The 3T APS simplifies the pixel design and, thus, increases the fill factor (without the microlenses). It, however, generally suffers from a lower signal-to-noise ratio (SNR) for a variety of reasons. For instance, the PD has a large inherent photodiode capacitance, so the signal (\(\Delta V\) in Equation 12.5) read from the PD is low, making it more vulnerable to noise. In contrast, we get to control the FD in the 4T APS, which can be made to have a much lower capacitance, leading to a higher SNR. The CDS for 3T APS is also much less effective in suppressing noise, as we will discuss later.
A precursor to APS was the Passive Pixel Sensor (PPS), first suggested in Weckler (1967) and Dyck and Weckler (1968). A PPS has only one transistor, as shown in the top-right panel in Figure 12.5. The PPS has no SF that reads out voltage from the PD charges. Instead, the charges (not voltage) in the PD “passively” flow through a column bus and are turned to voltage there through a charge amplifier (Aoki et al. 1982). The PPS design is simpler (as only one transistor is needed) but leads to a much worse noise profile because of the large (parasitic) capacitance of the column bus. The SF in APS acts as an active amplifier, which isolates the sense node (whether it is the PD or the FD) from the large column bus capacitance, providing a much higher output current and lower output impedance than a PD does and, thus, improving the SNR (Kozlowski et al. 1998; Ohta 2020, Chpt. 2.5).
Electronic Shutter
Ideally, when we are not capturing light, the photodiodes should not be exposed to lights. This is achieved by a shutter. Mechanical shutters do so by physically blocking lights. The sensor is not exposed to light normally, blocked by the shutter. The shutter then mechanically opens to expose the sensor to light. There are many types of mechanical shutters, of which the most popular one is the focal plane shutter shown in Figure 12.6 (a). The shutter has two curtains that move in sync with a gap that allows lights in. The size of the shutter opening and the speed of the movement dictate the exposure time: a larger opening and slower speed mean longer exposure time. This is called a focal plane shutter because the shutter is located in front of the focal plane (sensor). There is also the leaf shutter, which is usually located at the aperture plane with the lenses.
The 4T pixel design above essentially implements an electronic shutter (ES). With an ES, we expose photodiodes to lights all the time. The way we mark the start of the exposure is through the M-RST
switch, which resets the PDs, and the way we mark the end of the exposure is through the M-TX
switch, which transfers the PD charges for measurement. The time difference between these two steps dictates the exposure time. As you can imagine, the shutter speed (inverse of the exposure time) of an electronic shutter can be much faster than that of a mechanical shutter, since there are no mechanical moving parts.
12.2.3 Read-out Circuitry
Following the pixel circuitry is the read-out circuitry, which usually has two main components: the programmable-gain amplifier and the analog-to-digital Converter (ADC). Figure 12.7 illustrates the common, simplified designs of the two components.
The amplifier is there to amplify the voltage read from the pixel, and the gain of the amplifier is programmable. A programmable gain is useful in imaging and photography to artificially shorten or extend the exposure time (e.g., through the ISO setting in a digital camera). The particular design shown in Figure 12.7 (a) combines CDS with a classical amplifier design with two capacitors. Specifically, the two voltages read out from the FD (one right after the reset and the other right after the charge transfer) are sampled by the \(C_{in}\) capacitor sequentially, which essentially performs an analog-domain subtraction that is required by CDS. The voltage difference is then amplified with a gain \(\frac{C_{in}}{C_{feedback}}\). \(C_{feedback}\) is usually programmable, allowing us to control the gain.
The amplified voltage difference then goes through an ADC to obtain the digital value. There is a huge amount of ADC designs (Murmann 2014). The design that is commonly used in image sensors is the single-slope (SS) design, whose simplified diagram is shown in Figure 12.7 (b). An SS ADC consists of a comparator, a ramp signal generator, and a counter. The ramp generator provides a monotonically increasing or decreasing ramp signal, which is compared with the to-be-quantized analog signal (output of the amplifier). At every clock cycle, the comparator compares the two inputs while the counter increments. When the two input signals cross, the counter value is recorded and represents the quantized digital value of the analog signal.
The designs in Figure 12.7 perform CDS in the analog domain (through \(C_{in}\)). In many image sensors today, the CDS is performed in the digital domain after the ADC (Nitta et al. 2006). You would think that such a design might require twice the ADC overhead plus the additional digital subtraction overhead. In reality, the design is quite clever. The ADC would first quantize the first sample (before reset), and the resulting counter value represents the digital value of the first sample. For the second sample, instead of counting from scratch, we would simply turn the counter around so that it counts backward. At the end, the counter value is naturally the digital difference of the two samples.
12.2.4 Dynamic Range
We can intuitively think of each pixel as a well (a pixel well) that collects electrons. Equation 12.4 indicates that there are two main factors that determine the number of electrons going into a particular pixel well: the incident light power and the exposure time. A pixel cannot indefinitely collect electrons. The full-well capacity (FWC) is the max amount of electrons that can be held by a pixel’s photodiode. More electrons than the FWC would saturate the well, at which point no charges will be stored by the pixel. When a pixel well is saturated, photographers call that pixel “over-exposed”. This is illustrated in Figure 12.8, where, ordinarily, the number of charges collected is proportional to the incident light luminance until the pixel well is full.
A larger FWC leads to a higher sensor dynamic range, which, informally, refers to the range of scene luminance that a sensor can capture. Formally, the dynamic range is defined as the ratio between the highest and the lowest luminance level that can be faithfully captured. The highest level is the FWC, but what about the lowest level? Wouldn’t that simply be 0 and, if so, wouldn’t the dynamic range of any image sensor be infinity?
The answer is that at very low light levels the charges collected by a pixel are dominated by noise. We call the charges collected when there is no incident light the “noise floor”, which can be measured by taking an image when the camera is in dark. The dynamic range is thus the ratio between the FWC and the noise floor (Nakamura 2006, Chpt. 3.4.2.1):
\[ \begin{align} \text{DR} = \frac{\text{FWC}}{\text{Noise Floor}} \end{align} \]
We discuss noise in detail in Chapter 13 and will not get into it too much here, but briefly, the noise floor is dominated by “dark noise”, which is caused by the thermally dislodged electrons, and the “read noise”, which is the noise introduced by the read-out circuitry.
Not only can saturation occur at a PD’s well, it can also occur when transferring the charges from the PD to the FD during read-out. As we have briefly alluded to when discussing the conversion gain (CG) in Section 12.2.2, when the CG is low, the SNR is high but we need to use a small FD, whose capacity could sometimes be smaller than that of the PD, in which case the charge transfer might saturate the FD. Alternatively, a large FD will not saturate (during charge transfer) but will lead to a low CG and, thus, lower SNR.
A technique that many image sensors use is called dual conversion gain (DCG), where a pixel’s charges can be read-out twice, once with a high conversion gain (HCG) and the second time with a low conversion gain (LCG) (Solhusvik et al. 2019; Willassen et al. 2015; Huggett et al. 2009; Miyauchi et al. 2020; Takayanagi et al. 2018). To support the LCG read-out, we need an (or sometimes multiple) extra capacitive node, e.g., an additional FD (let’s call that \(FD_2\)), that is connected in parallel with the original FD (let’s call that \(FD_1\)) so as to increase the effective \(C_{FD}\) in Equation 12.5.
- In the first HCG read-out, we use only \(FD_1\) but not \(FD_2\). This reading has a high HCG and high SNR, which is especially important for dark parts of the scene. For the bright areas, however, \(FD_1\) saturates and the readings are useless. Importantly, however, the left-over charges are not discarded; they still stay in the PD.
- Then in the subsequent LCG read-out, the extra \(FD_2\) is switched in. Now all the charges, including the left-over ones in the PD and the charges in \(FD_1\), are then re-distributed to \(FD_1\) and \(FD_2\), which collectively will not saturate, so highlights are captured at the cost of low SNR.
High Dynamic Range Imaging
The goal of high-dynamic-range (HDR) imaging is to design imaging systems such that the scene luminance can be faithfully reconstructed from pixel values. Two things are in the way: noise at low-luminance regions in the scene and saturation at high-luminance regions in the scene. A common strategy for HDR is called exposure bracketing, which can be implemented in two ways, both involving taking multiple shots of the scene and then fuse them.
- Each shot has the same, short exposure time so no pixel is over-exposed, but pixels for low-luminance regions are noisy. We then average multiple shots; averaging is a form denoising (Chapter 13). This is the approach that Google’s HDR+ system takes (Hasinoff et al. 2016).
- Each shot has a different exposure time. Long-exposure shots are used to capture details in low-luminance regions, and short-exposure shots capture details in high-luminance regions.
Either way, the issue with exposure bracketing is the longer capturing time, which makes the resulting image more susceptible to motion blur. We ideally would like “single-shot” HDR. There are multiple methods, and they usually require co-designing the image sensor/pixel with the post-processing algorithms (aside from modern deep learning approaches that rely semantics information, which we will not discuss).
One strategy is to use split pixels or dual PDs, an emerging technology that sensor companies are exploring. The idea is to use split a pixel into two PDs, each with a different “sensitivity” to light (Iida et al. 2018; Solhusvik et al. 2019; Willassen et al. 2015; J. Xu et al. 2022). The sensitivity is usually controlled by PD size (and the corresponding microlens size): the larger PD (LPD) can collect more charges at the same light intensity (quantified by photons/area) than the small PD (SPD)—simply because of the large photon collection area—and, thus, saturate faster. The two groups of PDs are interleaved on the sensor plane, so they each perform a uniform sampling of the scene (preceded by a spatial integration over the pixel area of course).
The way that split pixels extend dynamic range is illustrated in Figure 12.9. The LPD, with a FWC of \(S_3\), saturates at a low luminance level \(L_1\), so only those (large) pixels that image low-luminance regions in the scene do not saturate; as a result, LPDs provide a good sampling of the low-luminance information. In contrast, the SPD, with a lower intrinsic FWC of \(S_2\), saturate at a high luminance level \(L_2\), so SPDs provide a good sampling for high-luminance information in the scene. Note that even though the SPD has a smaller intrinsic FWC than that of the LPD, the SPD’s sensitivity to light is even lower3, so the SPD still saturates at a higher intensity level.
If we increase the FWC of the small pixels, they take even longer/higher luminance to saturate. The way we increase the FWC is by adding a lateral overflow integration capacitor (LOFIC), which holds the overflow charges from the PD during exposure (Sugawa et al. 2005; Akahane et al. 2006; Takayanagi et al. 2019; Ikeno et al. 2022). In almost all cases, the FD itself participates in collecting the overflow charges, too. In this way, the FWC of the small pixels, \(S_4\), is effectively the total capacity of the photodiode, the LOFIC, and the FD. This further extends the small pixel’s saturation level to \(L_3\), shown in Figure 12.9.
LOFIC can be used in conjunction with DCG. For instance, in the HCG read-out we would use only the FD as the measurement node, and in the LCG read-out we would use both the FD and LOFIC (Takayanagi et al. 2019). Of course we can also add additional FDs to lower the conversion gain even more (Iida et al. 2018).
We could also combine the split-pixel architecture with DCG (Iida et al. 2018; Solhusvik et al. 2019; Willassen et al. 2015), where usually the large pixels are read-out with twice with DCG and the small pixels are read-out with only LCG; this is because large pixels are meant to sample low-luminance information so they benefit more from HCG. This is shown in Figure 12.9, where \(S_1\) is the capacity of the LPD’s FD node, which saturates at a lower intensity than \(L_1\) and is the measurement node in the HCG read-out. The LCG read-out can read all the charges in the FWC (with the help of an additional FD) at the cost of a lower conversion gain.
Another approach is the time-to-saturation (TTS) technology (Stoppa et al. 2002), which uses a counter to measure the time it takes for each pixel to saturate and use that time to extrapolate the information given the actual exposure time:
\[ \begin{align} Q_{\text{act}} = Q_{\text{sat}} \frac{T_{\text{exp}}}{T_{\text{sat}}}, \end{align} \]
where \(Q_{\text{act}}\) is the actual number of charges a pixel would have collected without saturation, \(Q_{\text{sat}}\) is the FWC, \(T_{\text{exp}}\) is the exposure time, and \(T_{\text{sat}}\) is the saturation time. One could combine TTS with DCG and LOFIC (Ikeno et al. 2022; Liu et al. 2020, 2022).
12.3 Global Architecture
We have discussed the individual building blocks that are needed for a pixel to turn lights into digital values, but how are they put together in an actual image sensor supporting tens of millions pixels? This chapter talks about the global architecture of an image sensor. We will start with a common architecture followed by other variants.
RST
signal (connecting to the M-RST
switches) and the SEL
signal (connecting to the M-SEL
switches) (for simplicity, we omit the per-row TX
signal, which connects to all the M-TX
switches in the same row); (b): timing diagram operating the image sensor in (a) with a rolling shutter; technically the FD reset should be overlapped with the exposure time but is lumped into the readout box for simplicity. (c): comparison of column-level ADC used in (a) with pixel-level ADC and array/chip-level ADC. (d): timing diagram operating the image sensor in (a) with a global shutter.
12.3.1 Column-Parallel Readout
Figure 12.10 (a) shows a typical arrangement, where pixels are organized as a 2D array, just like a (DRAM/SRAM) memory array, and each column has an amplifier and ADC shared by all the pixels in that column. That is, the Output
pin in Figure 12.4 of all the pixels in the same column are connected to the same amplifier and ADC. The read-out circuit is then connected to digital processing circuitry, which could potentially perform simple image-space operations such as downsampling, scaling, rotation, etc. There is also an I/O unit that transfers the pixels to the host processor, usually through the MIPI-CSI interface, and transfers commands/configuration data from the host processor, usually through the I2C interface, which has a much lower bandwidth than MIPI (Kb/s vs. Gb/s).
The pixels in the pixel array are addressed row by row through a row scanner logic, shown on the left of Figure 12.10 (a). Pixels in the same row share three external signals: a reset signal RST
, which is connected to all the M-RST
transistors in the row, a row-select signal SEL
, which is connected to all the M-SEL
transistors of the same row, and a transfer signal TX
(omitted in the figure) connected to all the M-TX
switches in the same row.
The operating sequence of the pixel rows is shown in Figure 12.10 (b); the times are not drawn to scale. Each row of pixels goes through the PD reset, exposure, and readout phases under the control of the three external signals (RST
, SEL
, and TX
). Importantly, the three phases are pipelined across rows. That is, while the first row is being exposed, we can start resetting the PDs for the subsequent rows and preparing them for exposure. For instance, in the concrete example of Figure 12.10 (a), the first row is starting the read-out sequence, the nth row is starting the exposure, while all other rows in-between are currently under exposure. While the exposure times of different rows can overlap, their readout sequences cannot — pixels in the same column but different rows share the same the read-out circuitry.
We can see that the way the pixel array is addressed and operated is similar to how a memory array (e.g., SRAM/DRAM) is, where the data in an entire row is accessed at once. However, since the pixel rows are operated strictly sequentially (unless random sampling is needed (Feng et al. 2024)), the row scanner logic does not need a decoder, which supports random accesses that a typical memory array would need. Instead, one can usually use parallel shift registers to generate the three external signals row by row.
12.3.2 Rolling vs. Global Shutter
The timing diagram suggests that pixels in different rows technically have slightly shifted exposure times, inherently using a rolling shutter. The mechanical focal-plane shutter shown in Figure 12.6 (a) is inherently a rolling shutter. Rolling shutters introduce noticeable artifacts; one such example is shown in Figure 12.6 (b), where the photo was taken by a camera traveling in a car driving at about 50 mph. As a result, the fence and gate appear slanted because vertical parts of these objects are taken at different times. Such an artifact is much less visible for more distant objects, such as the cliff (can you reason about why?).
Global shutters address the rolling shutter artifacts by exposing all pixels at the same time. Figure 12.6 (d) shows the timing diagram of a global shutter sensor; compare that with that of the rolling shutter sensor in Figure 12.6 (a). All the PDs are reset at the same time and have the same exposure duration.
The pixels are still read out row by row due to the column-level design of the read-out circuitry. This means the pixel values have to be temporarily held in some form of analog buffer after exposure and before they are read out. One could certainly use the FD for this analog buffer — with the caveat that the this prevents the PD from starting a new exposure cycle. This is because starting a new exposure requires resetting the PD, which would also reset the corresponding FD, as shown in Figure 12.4 (a). For that reason, it is common to implement an additional analog buffer inside each pixel. The buffer can be implemented either in the charge domain before the FD (Yasutomi, Itoh, and Kawahito 2011; Sakakibara et al. 2012; Tournier et al. 2018; Y. Kumagai et al. 2018; Yokoyama et al. 2018; Kobayashi et al. 2017) or implemented in the voltage domain after the FD (Kondo et al. 2015; Stark et al. 2018; Miyauchi et al. 2020).
12.3.3 Pixel-Parallel and Chip-Level Readout
We can also arrange the read-out circuitry differently, as illustrated in Figure 12.10 (c). For instance, we could have a per-pixel (gain-controllable) amplifier and ADC and, consequently, a per-pixel digital memory. This essentially allows each pixel to directly output digital values, giving rise to the so called Digital Pixel Sensor (DPS) design, which was first reported in Fowler, El Gamal, and Yang (1994) and is recently gaining tractions (Liu et al. 2019), where the in-pixel memory can is a 6T SRAM cell and the entire pixel array acts almost like an SRAM array. The bottom-right panel in Figure 12.5 shows the pixel design diagram of a DPS, where the in-pixel memory can be, for instance, a 6T SRAM cell. In this case, the entire pixel array is indeed like an SRAM array.
DPS increases the pixel design complexity and pixel sizes, which, without microlenses, reduces the fill factor. This can, however, be alleviated with a stacked design, which we will get to in Section 12.3.5. The main advantage of the DPS is that it massively increases the readout bandwidth due to pixel-parallel ADCs, which could shorten the frame latency when using a global shutter (see Figure 12.10 (d)), especially when short exposure time is desirable (e.g., high frame rate or “snap-shot” photography).
Yet another read-out arrangement is to have a single gain-controllable amplifier and ADC for the entire pixel array. This is shown in Figure 12.11 (b). In this case, we not only need logic to scan rows one by one but also to scan columns one by one (e.g., through shift registers). This arrangement is not common (thus omitted in Figure 12.10 (c)) due to its slow read-out speed but is the only option for sensors based on the Charge-coupled Devices (CCD), a design that is different from all the designs we have discussed so far and is our focus next.
12.3.4 CMOS vs. CCD Sensor
All the sensor designs we have covered so far are called Complementary Metal-Oxide-Semiconductor (CMOS) sensors, because they heavily rely on circuitries implemented using the CMOS techonlogies. CCD sensor is the other major category of sensor design, first reported in Boyle and Smith (1970). Both CCD and CMOS sensors use silicon to implement the PDs, although the specific implementations can differ (Nakamura 2006, Chpt. 3.1.2). The main difference lies in how the charges generated by the PDs are read out. See Fossum (1993), Fossum (1997), El Gamal and Eltoukhy (2005), and more recently, Fossum, Teranishi, and Theuwissen (2024) for the historical background and comparisons.
A CCD sensor directly reads out charges from pixels by shifting the collected charges row by row. When a row reaches the bottom of the pixel array, we then shift the charges column by column to a single, array-level SF amplifier (and potentially a gain-controllable amplifier and ADC afterwards). This architecture is shown in Figure 12.11 (a). In CMOS sensors, in contrast, the charges are converted to voltages within the pixels, and it is the voltage potentials that are being read out from the pixel array by addressing, rather than shifting across, individual rows. The CMOS architecture is shown in Figure 12.11 (b).
The key to a CCD sensor is the charge-coupled devices themselves. A CCD is a set of connected MOS capacitors that store and transfer, between them, charges (Hu 2009, Chpt. 5), invented by Willard Boyle and George E. Smith (Boyle and Smith 1970)4. In a CCD image sensor, the CCDs are connected to the PDs. After the exposure, all the PDs simultaneously transfer their charges to the corresponding vertical CCDs. The vertical CCDs in the same column then act as a shift register, transferring the charges downward to the horizontal CCD at the bottom of the chip. When a row of charges reaches the horizontal CCDs, the charges are then transferred horizontally (again, in a shift-register fashion) to the SF amplifier, which turns charges to voltage.
Given this signal read-out architecture, it is perhaps unsurprising to see that CCD sensors inherently support global shutters: the CCDs used for shifting charges naturally store the charges temporarily during the read-out.
CCDs are fabricated using process technologies that are optimized for charge transfer and that are incompatible with the CMOS technologies. In contrast, the read-out architecture of the CMOS sensors can be fabricated using CMOS technologies. This is a huge advantage because non-imaging logics such as control (e.g., clock generation) and analog/digital processing (e.g., ADC, image processing, computer vision tasks) are also based on CMOS technologies. Such logics, in CCD sensors, need to be implemented on a separate chip that interfaces with the CCD chip, rather than integrated with the pixel array on the same chip in a CMOS image sensor.
As modern CMOS technologies mature and gradually take over the semiconductor industry, CMOS image sensors have become more appealing. The main advantage of the CCD sensors is their high SNRs. CCD sensors do not have active devices during read-out and, thus, avoid/minimize many sources of noise that CMOS sensors are vulnerable to, a point we will return to when discussing noise modeling5. Because of that, while consumer cameras today mostly use CMOS sensors, CCD sensors are still use widely used in many scenarios where imaging quality is critical, e.g., scientific imaging. For instance, many telescopes for astrophysics (e.g., Sloan Digital Sky Survey) still use CCD sensors.
12.3.5 Computational and Stacked CMOS Image Sensors
Because the imaging circuitries and the logic processing circuitries both use the CMOS process technologies, a clear trend in CMOS Image Sensor (CIS) design is to move into the sensor computations that are traditionally carried out outside the sensor, which gives rise to the notion of Computational CIS.
CIS Scaling Trends
Figure 12.12 (a) shows the percentage of computational CIS papers in International Solid-State Circuits Conference (ISSCC) and International Electron Devices Meeting (IEDM), two premier venues for semiconductor circuits and devices, from Year 2000 and Year 2022 with respect to all the CIS papers during the same time range. The trend is clear: increasingly more CIS designs integrate compute capabilities.
A key reason why we could integrate processing/computational capabilities into the CIS chip is because of the advancements in the CMOS technologies that, for instance, have significantly shrunk the feature size, which is the smallest physical dimension that can be reliably fabricated on a semiconductor chip and is proportional to the transistor size. At the same time, however, the PD size itself has not shrunk proportionally, meaning adding CMOS logic to the sensor increases the total chip area minimally in the grand scheme of things.
This is shown in Figure 12.12 (b), where triangle markers show the pixel sizes in CIS designs from all ISSCC papers appeared during Year 2000 and Year 2022, which include leading industry CIS designs at different times. We overlay a trend line regressed from these CIS designs to better illustrate the pixel size scaling trend. As a comparison, the blue line at the bottom represents the standard CMOS technology node scaling laid out by the International Roadmap for Devices and Systems (IRDS) (IRDS 2024). We can see that the gap between the pixel size and the standard CMOS feature size steadily increases. In fact, the pixel size scaling stagnates at around 5 \(\mu m\), which has long been seen as the practical pixel size limit (Fossum 1997). As semiconductor manufacturers keep pulling rabbits out of a hat, the CMOS feature size is still, miraculously, shrinking (TSMC/Samsung are shipping products with a 2 nm process node in 2025), so the gap would still exist, at least for quite a while.
Computational CIS Architectures
The computations inside a CIS could take place in both the analog and the digital domain. Figure 12.13 (b) illustrates one example where analog computing is integrated into a CIS chip before the ADC. Analog operations usually implement primitives for feature extraction (Bong, Choi, Kim, Kang, et al. 2017; Bong, Choi, Kim, Han, et al. 2017), object detection (Young et al. 2019), and DNN inference (Hsu et al. 2020; H. Xu et al. 2021). Figure 12.13 (c) illustrates another example that integrates digital processing, such as ISP (Murakami et al. 2022), image filtering (Kim et al. 2005), and DNN (Bong, Choi, Kim, Han, et al. 2017).
As the processing capabilities become more complex, CIS design has embraced 3D stacking technologies, as is evident by the increasing number of stacked CIS in Figure 12.12. Figure 12.13 (d) illustrates a typical stacked design, where the processing logic is separated from, and stacked with, the pixel array layer. The different layers communicate through the hybrid bond or the micro Through-Silicon Via (\(\mu\)TSV) (Liu et al. 2022; Tsugawa et al. 2017). The processing layer typically integrates digital processors, such as ISP (Kwon et al. 2020), image processing (Hirata et al. 2021; O. Kumagai et al. 2018), and DNN accelerators (Eki et al. 2021; Liu et al. 2022).
Three-layer stacked designs have also been proposed. Sony IMX 400 (Haruta et al. 2017) is a 3-layer design that integrates a pixel layer, a DRAM layer (1 Gbit), and a logic layer with an Image Signal Processor (ISP). The DRAM layer buffers high-rate frames before streaming them out to the host. This enables super slow motion (960 FPS); otherwise, the bandwidth of the MIPI CSI-2 interface limits the capturing rate of the sensor. Meta conceptualizes a three-layer design (Liu et al. 2022) with a pixel array layer, a per-pixel ADC layer, and a digital processing layer that integrates a DNN accelerator — using DPS. Stacking makes it easier to implement DPS: the main disadvantage of DPS is the complexity of the pixel design, but with stacking, the additional pixel processing circuitry (gain amplifier, ADC, etc.) can be “hidden” on a separate layer than the pixel array layer (Liu et al. 2022, 2020).
Challenges of CIS
Moving computation inside a CIS, however, is not without challenges. Most importantly, processing inside the sensor is far less efficient than that outside the sensor. This is because, while the CIS is implemented using the CMOS technologies, it uses significantly older process nodes than that of the conventional CMOS.
This is shown in Figure 12.12 (b), where the square markers show the process node used in each CIS paper surveyed. As a reference, the IRDS standard CMOS process node scaling line is also shown. At around the year 2000, the CIS process node started lagging behind that of the conventional CMOS node, and the gap is increasing. CIS designs today commonly use 65 nm and older process nodes. This gap is not an artifact of the CIS designs we pick; it is fundamental: there is simply no need to aggressively scale down the process node because the pixel size does not, and can not, shrink much. In fact, from Figure 12.12 (b) we can see that the slope of CIS process node scaling almost exactly follows that of the pixel size scaling. The reason that pixel size does not shrink much is to ensure light sensitivity: a small pixel reduces the number of photons it can collect, which directly reduces the dynamic range and the SNR6.
Inefficient in-sensor processing can be mitigated through 3D stacking technologies, which allow for heterogeneous integration: the pixel layer and the computing layer(s) can use their respective, optimal process node. Stacking, however, could increase power density, especially when future CIS integrates more processing capabilities. Therefore, harnessing the power of (stacked) computational CIS requires exploring a large design space and is still an active area of research (Ma 2024; Feng et al. 2024; Ma et al. 2023).
12.4 In-Sensor Optics
The on-chip optics serve a few purposes: blocking lights in the IR/UV ranges, boosting photon collection efficiency, anti-aliasing, and filtering for color reproduction.
12.4.1 IR/UV Cut-Off Filters
Many cameras have cut-off filters for infrared (IR) and ultraviolet (UV) lights. Their goals are to remove/block IR or UV lights, as much as possible, from the incident light. These filters are transparent in that they predominantly absorb light while scattering very little light. So their optical behaviors can be adequately captured by their transmittance spectra. Figure 12.14 (left) shows the transmittance spectrum of the cut-off filter on the Nikon D200, where light below 400 nm and above 700 nm is essentially blocked from hitting the sensor.
The reason most photographic cameras want to remove IR and UV lights is because the human visual system is not sensitive to IR and UV lights (recall our earlier discussions about the spectra of the cone fundamentals, which drop to 0 beyond roughly the 380 \(\text{nm}\) and 780 \(\text{nm}\) range). So for a camera to accurately reproduce the color of an object as if the object is directly viewed by the human eyes, the sensor’s sensitivity ideally needs to mimic that of the human eyes. Cutting IR and UV lights, to which our photoreceptors are not sensitive, is just the first step. We will discuss in detail in Section 12.6 what other mechanisms are in place for accurate color reproduction in image sensors.
Interestingly, thermographic cameras detect optical power in the IR range to estimate object temperature. Any object above absolute zero radiates, and this is call the blackbody radiation. Planck’s law governs the electromagnetic power emitted at a particular wavelength at a particular temperature. It turns out that at room temperature (about 300 K), most of the radiation power is in the IR range; very little radiation comes from the visible range. That is why thermal cameras use IR radiation for temperature estimation. Figure 12.14 (right) shows an example of an IR image visualized as a heatmap, a real heatmap.
12.4.2 Microlenses
An important figure of merit of image sensors is the fill factor (FF), which is defined as the ratio of the photosensitive area of a pixel to the actual pixel area. Usually the photosensitive area is much smaller than the pixel area. This is because in addition to the actual photodiode, a pixel contains many other electrical components (capacitors, transistors, and other complex logic gates) that take up the area. This is illustrated in Figure 12.15 (a), where many incident lights will not reach the PD, leading to a low FF. Given a fixed pixel area, a low FF means the pixel collects fewer photons during exposure, which translates to a higher signal-to-noise ratio, so it is almost always desirable to have a higher FF.
One common way to increase the FF that is prevalent in almost all image sensors is through microlenses. This is illustrated in Figure 12.15 (b). Every pixel has a convex lens, which we call a microlens, sitting on top of it. The job of the microlens is to, ideally, direct all the photons hitting the pixel to the photodiode, in which case the FF would effectively be 100%, which contemporary image sensors are very close to.
12.4.3 Anti-Aliasing Filters
Many image sensors also have anti-aliasing (AA) filters, especially photographic sensors. Recall that pixels perform spatial sampling of the optical image, which is continuous, thus introducing aliasing. The classic anti-aliasing method is to pre-filter the continuous signal using a low-pass filter, essentially blurring the signal and reducing its peak frequency. Pharr, Jakob, and Humphreys (2023, Chpt. 8) and Glassner (1995, Unit II) provide great technical discussions of signal sampling and reconstruction, which we will omit here.
In some sense, the photodiodes themselves and the microlenses act as pre-filters already: they inherently perform spatial 2D box convolutions over the continuous signal impinging upon them. Take the photodiode as an example: each photodiode integrates all the incident photons, as we have seen in Section 12.2, and integration is equivalent to convolving/filtering the signal with a 2D box filter.
However, the support of the filter carried by the microlens and the photodiode is small: the microlens filter has a size of the pixel area, and the photodiode filter support is even more compact. To more aggressively pre-filter the signal, we need a filter with a wide support. To that end, AA filters use birefringent material, as shown in Figure 12.16 (a), which essentially splits a ray into two rays, each with a different polarization and, thus, takes a slightly different path (recall that the refractive index depends on the polarization of light). If we cascade two such materials, a ray gets split into four rays; this is called a 4-dot beam splitting. This is done by, e.g., the Nikon D800e, as shown in Figure 12.16 (b).
The birefringent material acts as a low-pass filter. The intuition is that if an incident ray is spread over, say, 4 sensor-plane points, then each sensor-plane point, equivalently, integrates information from 4 incident rays, each coming from a distinct scene point (assuming a pinhole aperture). We know integration is essentially low-pass filtering.
The way to understand the effect of the AA filter is to analyze its Point Spread Function (PSF) and Modulation Transfer Function (MTF), which we have seen in Section 11.4.3. Assuming a pinhole aperture, a 4-dot beam-splitting AA filter essentially imposes a PSF where a scene point is spread over 4 sensor-plane points. The PSF is the sum of 4 Dirac Delta functions placed on a regular grid with an offset \(d\) between adjacent grid points (which depends on the difference in refractive indices and the relative positions between the two splitting planes):
\[ \begin{align} f(x, y) = \frac{1}{4}[\delta(x, y) + \delta(x-d, y) + \delta(x, y-d) + \delta(x-d, y-d)]. \end{align} \]
With a little math, which we omit here, we can show that the MTF of this PSF is:
\[ \begin{align} MTF(f_x, f_y) = |\cos(\pi d f_x)||\cos(\pi d f_y)|. \end{align} \]
An example of this MTF is shown in Figure 12.16 (c), where the \(x\)-axis and \(y\)-axis are the two spatial frequencies \(f_x\) and \(f_y\), and the \(a\)-axis is the MTF. We can see that this particular MTF passes low frequencies and cuts off at a frequency of, in the case where \(d=1\), 0.5. Interestingly, the MTF also passes high frequencies, which is generally not a huge concern because power at high frequencies is usually already attenuated by the PSFs of other optical elements (e.g., the main imaging lens). Of course, in reality the aperture is not a pinhole, so the PSF is not simply a sum of four Delta functions but can nevertheless still be similarly analyzed.
Figure 12.17 (a) and Figure 12.17 (b) compare the images taken of the same scene by Nikon D800e, which lacks an AA filter, and Nikon D800, which has a 4-dot AA filter. Look at the AC’s condenser coil; the AA image is more blurred but has much less objectionable aliasing effect.
12.5 Monochromatic, Noise-Free Sensor Model
Each in-sensor optical element adds its own spectral transmittance, so the overall transmittance of the in-sensor optics is the product of them. We will simply use \(T(\lambda)\) to represent the overall transmittance. Given what we have discussed so far, we can build an analytical model for a monochromatic, noise-free image sensor. The raw pixel value \(n\) of a pixel of size \(u \times v\) whose top-left corner is \((x, y)\) and is exposed for a duration of \(t_{exp}\) is given by:
\[ \begin{align} \mathcal{Q}_{px} &= \int_{\lambda} \int_{t}^{t+t_{exp}} \int_y^{y+v} \int_x^{x+u} Y(x', y', \lambda, t') T(\lambda) QE(\lambda) \text{d}x' \text{d}y' \text{d}t' \text{d}\lambda, \label{eq:mono_model_1} \\ \Delta V &= \frac{\mathcal{Q}_{px}q}{C_{FD}} \times g, \label{eq:mono_model_2} \\ n &= \lfloor \frac{\Delta V}{V_{max}} (2^{N} - 1) \rfloor, \label{eq:mono_model_3} \end{align} \tag{12.7}\]
where \(Y(x', y', \lambda, t')\) is the number of photons incident on position \((x', y')\) at a particular wavelength \(\lambda\) at a particular time \(t'\), so it is a quantal counterpart of the spectral irradiance; \(T(\lambda)\) is the overall spectral transmittance of the in-sensor optics, \(QE(\lambda)\) is the quantum efficiency, and \(q\) is the elementary charge.
The first equation in Equation 12.7 models \(\mathcal{Q}_{px}\), the total amount of charges collected at the particular pixel, where we integrate spatially, temporally, and spectrally. The second equation in Equation 12.7 is essentially Equation 12.5, and models the voltage difference sensed before and after the exposure. The last equation in Equation 12.7 is a crude ADC model, assuming that the voltage range \([0, v_{max}]\) is quantized into \(N\) bits, and the output of the ADC model is the digital number, a.k.a., the raw pixel value.
How do we express \(Y(x', y', \lambda, t')\), the quantal counterpart of irradiance? The spectral irradiance at position \((x', y')\) and time \(t'\) is:
\[ \begin{align} E(x', y', \lambda, t') = \int^{\Omega(p, V)} L(p, \omega, \lambda, t') \cos\theta~\text{d}\omega, \end{align} \tag{12.8}\]
where \(p = (x', y')\), \(V\) is the aperture, \(\Omega(p, V)\) is the solid angle subtended by \(p\) and \(V\); \(L(p, \omega, \lambda, t')\) is the radiance with a wavelength \(\lambda\) incident on \(p\) from the direction \(\omega\) at time \(t'\), and \(\theta\) is the polar angle subtended by \(\omega\) and the pixel normal vector.
Given Planck’s equation (Equation 12.1), we can turn irradiance \(E\) (energy per unit area per unit time) into the quantity \(Y\) (photon quantity per unit area per unit time):
\[ \begin{align} Y(x', y', \lambda, t') = \frac{E(x', y', \lambda, t') \lambda}{hc}. \end{align} \tag{12.9}\]
Plugging Equation 12.8 and Equation 12.9 into the \(\mathcal{Q}_{px}\) expression in Equation 12.7, we have:
\[ \begin{align} \mathcal{Q}_{px} = \int_{\lambda} \int_{t}^{t+t_{exp}} \int_y^{y+v} \int_x^{x+u} \int^{\Omega(p, V)} \frac{L(p, \omega, \lambda, t') \cos\theta \text{d}\omega T(\lambda) QE(\lambda) \lambda}{hc} \text{d}x' \text{d}y' \text{d}t' \text{d}\lambda. \end{align} \tag{12.10}\]
Rearranging the terms a bit we get:
\[ \begin{align} \mathcal{Q}_{px} = \int_{\lambda} \Big( \int_{t}^{t+t_{exp}} \int_y^{y+v} \int_x^{x+u} \int^{\Omega(p, V)} L(p, \omega, \lambda, t') \cos\theta \text{d}\omega \text{d}x' \text{d}y' \text{d}t' \Big) T(\lambda) QE(\lambda) \frac{\lambda}{hc} \text{d}\lambda \label{eq:cam_measurement_2}. \end{align} \tag{12.11}\]
Recall from Section 8.5, the inner four integrals in Equation 12.11 collectively form the so-called camera measurement equation, which calculates \(Q(\lambda)\), the energy at wavelength \(\lambda\) collected by the pixel during the exposure7. Therefore, we get:
\[ \begin{align} \mathcal{Q}_{px} = \int_{\lambda} Q(\lambda) T(\lambda) QE(\lambda) \frac{\lambda}{hc} \text{d}\lambda. \end{align} \tag{12.12}\]
We have implicitly assumed here that the effects of the in-sensor optics can simply be modeled by the spectral transmittance \(T(\lambda)\). This is largely reasonable because 1) in-sensor optics are mostly transparent and 2) they are very close to the pixels, so we can ignore rays that are incident on the edge of the optics and, after refractions, miss the pixels.
Spectral Sensitivity Function
We can make a few assumptions to simplify our discussion. First, we assume the ADC quantization error is negligible. Second, we assume that the irradiance within a pixel is spatially and temporally uniform during a short exposure time. The raw pixel value \(n\) in Equation 12.7 is then simplified to:
\[ \begin{align} n \approx k \int_{\lambda} Y(x, y, \lambda, t) T(\lambda) QE(\lambda) \text{d}\lambda, \end{align} \tag{12.13}\]
where \(Y(x, y, \lambda, t)\) is the (average) number of incident photons at wavelength \(\lambda\) hitting position \((x, y)\) at time \(t\); \(k = uvt_{exp}\frac{qg}{C_{FD}}\frac{2^N-1}{V_{max}}\) is a constant.
Let’s define a convenient term: Spectral Sensitivity Function (SSF), which is the product of \(T(\lambda)\) and \(QE(\lambda)\). Therefore, we can rewrite \(n\) as:
\[ \begin{align} n \approx k \int_{\lambda} Y(x, y, \lambda, t) SSF_{quantal}(\lambda) \text{d}\lambda. \end{align} \tag{12.14}\]
SSF is the only spectral (wavelength-dependent) term in Equation 12.14 other than the incident light itself; it represents the phenomenological light sensitivity of the sensor over wavelength. SSF is sometimes also called the camera response function.
The SSF defined in Equation 12.14 is an “equal-quantal” function because it tells us the relative responses between different wavelengths under the same amount of incident photons. We can turn it into an “equal-energy” or “equal-power” function that operates on energy or power. We first express the raw pixel value \(n\) in terms of the spectral power distribution \(\Phi(\lambda)\) rather than the spectral quantity distrubition \(Y(\lambda)\) and rewrite Equation 12.14 as:
\[ \begin{align} n \approx k \int_{\lambda} \frac{\Phi(x, y, \lambda, t)}{t_{exp}\frac{hc}{\lambda}} SSF_{quantal}(\lambda) \text{d}\lambda, \label{eq:mono_model_pow_2} \end{align} \]
where \(\Phi(x, y, \lambda, t)\) denotes the spectral power distribution of the light hitting position \((x, y)\) at time \(t\). Now let’s absorb \(t_{exp}hc\) into \(k\) and define \(k' = uv\frac{qg}{C_{FD}}\frac{2^N-1}{V_{max}}\frac{1}{hc}\) and \(SSF_{power}(\lambda) = SSF_{quantal}(\lambda)\lambda\), we get:
\[ \begin{align} n \approx k' \int_{\lambda} \Phi(x, y, \lambda, t)SSF_{power}(\lambda) \text{d}\lambda. \end{align} \tag{12.15}\]
\(SSF_{power}(\lambda)\) is the equal-power SSF. The subscript is usually omitted in the literature because it is usually clear what SSF is being used (e.g., from the quantity that is being multiplied with the SSF). Also note that in some literature, the SSF is used interchangeably with QE, so be very careful.
12.6 Color Sensing
There is one main piece of the on-chip optics we have not discussed: the color filters, which are critical for color sensing and deserve their own section.
12.6.1 Goal of Color Sensing
What does it mean for an image sensor to capture color? We know that colors are subjective sensations caused by cone photoreceptor responses to light; a color can be expressed as a point in a 3D space formed by the L, M, and S cone responses, i.e., the LMS cone space. Ideally, if we can build an image sensor in such a way that it also possesses three kinds of pixels, each of which has a spectral sensitivity matching exactly that of a cone class (i.e., cone fundamental), the sensor would be able to accurately capture and reproduce the color information.
In fact, it is even sufficient for the sensor responses to be just a (linear) transformation away from the cone responses, as long as we can pre-calibrate the transformation matrix offline. This idea is illustrated in Figure 12.19. We emphasize linear transformation here simply because it is computationally cheaper; nothing prevents you from designing a sensor sensitivity profile that requires a sophisticated transformation from the cone space.
Where do the three classes of spectral sensitivities come? Examine our monochromatic sensing model in Equation 12.14; it appears that all the pixels share the same response function and, thus, have the same spectral sensitivity: every pixel has the same quantum efficiency and the same optical elements sitting above them (so the same spectral transmittance of the optics).
There are a variety of ways to introduce sensitivity differences across pixels, which we will discuss shortly in Section 12.6.2. Assuming, for now, that we have somehow introduced the three classes of SSFs, denoted \(SSF_R(\lambda)\), \(SSF_G(\lambda)\), and \(SSF_B(\lambda)\). Given an incident light with an SPD \(\Phi(\lambda)\), the camera responses are:
\[ \begin{align} [\int_{\lambda} \Phi(\lambda)SSF_{R}(\lambda) \text{d}\lambda, \int_{\lambda} \Phi(\lambda)SSF_{G}(\lambda) \text{d}\lambda, \int_{\lambda} \Phi(\lambda)SSF_{B}(\lambda) \text{d}\lambda]. \end{align} \]
This is a direct invocation of Equation 12.15 with the constant omitted. The color of the light expressed in the LMS cone space is:
\[ \begin{align} [\int_{\lambda} \Phi(\lambda)L(\lambda) \text{d}\lambda, \int_{\lambda} \Phi(\lambda)M(\lambda) \text{d}\lambda, \int_{\lambda} \Phi(\lambda)S(\lambda) \text{d}\lambda]. \end{align} \]
If the cone responses form a 3D cone space, the camera raw responses also form a color space, which is sometimes called the camera’s native color space. We provide an interactive tutorial that allows you to interactively explore and compare the native color spaces of various cameras and the LMS cone space. Figure 12.20 (left) shows the SSFs of iPhone 11 (solid lines) and the cone fundamentals. The SSFs are normalized so that \(SSF_G\) is peaked at unity, and the cone fundamentals are each normalized to peak at unity, so you could compare the relative sensitivity between the three SSFs in iPhone 11 but could not between the cone classes. Usually the SSF of a camera depends on a variety of factors such as the materials of the optical elements and the photodiodes as well as the pixel design, so it is almost impossible for the three SSFs to match exactly the cone fundamentals. Figure 12.20 (right) shows the spectral locus in iPhone 11’s native color space and in the cone space; they evidently do not overlap.
A major task in sensor calibration is to identify a transformation matrix \(M\) such that the following (approximately) holds:
\[ \begin{align} \begin{bmatrix} \int_{\lambda} \Phi(\lambda)SSF_{R}(\lambda) \text{d}\lambda\\ \int_{\lambda} \Phi(\lambda)SSF_{G}(\lambda) \text{d}\lambda\\ \int_{\lambda} \Phi(\lambda)SSF_{B}(\lambda) \text{d}\lambda \end{bmatrix} \times M = \begin{bmatrix} \int_{\lambda} \Phi(\lambda)L(\lambda) \text{d}\lambda\\ \int_{\lambda} \Phi(\lambda)M(\lambda) \text{d}\lambda\\ \int_{\lambda} \Phi(\lambda)S(\lambda) \text{d}\lambda \end{bmatrix} \end{align} \]
The transformation matrix is then applied in the post-processing pipeline of the raw pixels to turn raw pixel responses into a color value. We will discuss the calibration and the post-processing pipeline in greater details in Chapter 14.
12.6.2 Implementing Three “Classes of Pixels”
Perhaps the most straightforward method to introduce varying SSF is to apply a spectral filter to different pixels. A spectral filter is just a transparent optical element with a wavelength-selective transmittance. We need only three filters to emulate the three cone classes, but ideally each pixel should get all three simultaneously, which is difficult if you think about it, since at any given time you can physically have only one filter sitting on a pixel.
Three-Shot and Three-Chip Methods
There are two ways to go about addressing this issue. We can take three images of the same scene, each with a different filter, and then combine the together. This approach is believed to be pioneered by Sergey Prokudin-Gorsky, who conducted a breathtaking “photographic survey” of the early 20th-century Russia using this method (Prokudin-Gorsky 1948). This is called the “three-shot” approach. Alternatively, one could split the incident lights and send each of them to a different sensor, each with a different filter. This approach would obviously increase the form factor of the camera but avoids having to register and align the three separate shots, which is subjective to object motion. These camera are called “three-chip” or “three-CCD/COMS” cameras, which are still very widely used today in broadcasting, film studios, etc.
Color Filter Array (CFA)
Both the three-shot and the three-chip approach allow each incident light to be transformed to three responses needed for color reproduction — at the cost of capturing overhead or bulky system design. A much simpler approach, and the most commonly used approach today, is called Color Filter Array (CFA), which assigns each pixel only one filter.
Figure 12.21 shows the most commonly used CFA, where the three classes of filters are tiled in what is called the Bayer filter mosaic, named after Bryce Bayer, who invented this pattern while working for Eastman Kodak in Rochester, NY (Bayer 1976). Each of the three filters has a transmittance spectrum that peaks at, roughly, red-ish, green-ish, and blue-ish wavelengths, similar to the spectra shown in Figure 12.20 (left).
The three filter classes are organized in \(2\times 2\) tiles, where each tile has two green filters. Bayer did so because he wanted to mimic human vision, where the photopic Luminance Efficiency Function (LEF) is most sensitive to green-ish lights (Sharpe et al. 2005, 2011) (see Figure 4.9). We can see that the CFA approach is actually more similar to human color vision than the three-shot or three-chip approach. In human vision, each cone photoreceptor has a particular sensitivity spectrum, and generates one of the three responses needed to form color vision.
A necessary consequence of using the CFA is that each pixel gets only one color channel information. Figure 12.21 (middle) shows a raw image captured using a CFA, where each pixel evidently has only one color channel. The overall image looks overwhelmingly green because of the sheer amount of green filters. An important step in the post-processing pipeline is to reconstruct the two other missing channels, a process called demosaicing, i.e., removing the Bayer mosaic artifacts. An example of the reconstructed image is shown in Figure 12.21 (right).
We will have more to say about the demosaicing process when we get to Chapter 14, but for now, let’s just observe that demosaicing is nothing more than a signal sampling and reconstruction problem. The CFA allows each pixel to sample only one channel of the three channels of response. So the green-filter response, for instance, is sampled by half of the pixels8, and the other two responses are sampled by one quarter of the pixels each. The job of demosaicing is then to reconstruct the full signal responses from the samples — a well-established problem in signal processing.
Foveon Approach
The final approach does away with optical color filters altogether. Instead, we will use three photodiodes vertically stacked for each pixel. Figure 12.22 illustrates a pixel in the Foveon X3 sensor, which is perhaps the most famous sensor that uses this architecture.
The idea is that the silicon absorption spectrum is wavelength sensitive, as shown in the right panel of Figure 12.3. Blue-ish lights have a much shorter mean free length than do green-ish lights, which have a shorter mean free length than do red-ish lights. This means most short-wavelength lights will be absorbed after the first photodiode, leaving mostly medium- to long-wavelength lights. Those lights will go through the second photodiode, which absorbs mostly the medium-wavelength lights, leaving mostly long-wavelength lights to the third photodiode. As a result, each PD actually receives a different light spectrum, effectively creating three different responses for the same light incident on the pixel.
Let’s assume that the three PDs have a depth of \(d_B\), \(d_G\), and \(d_R\), respectively. The incident light impinging on the pixel (i.e., the first PD surface) has a SPD \(\Phi(\lambda)\). The light impinging on the second PD then has a spectrum \(\Phi(\lambda)e^{-\sigma(\lambda)d_B}\), where \(\sigma(\lambda)\) is the silicon’s absorption coefficient spectrum. This is easily derived from the fact that pure absorption (no scattering and emission) leads to an exponential decay of the input signal (Equation 10.4). Similarly, the light impinging on the third PD then has a spectrum \(\Phi(\lambda)e^{-\sigma(\lambda)(d_B+d_G)}\). The responses produced by the three PDs are thus (in the order of R, G, and G):
\[ \begin{align} [\int_\lambda\Phi(\lambda)\eta_R(\lambda)e^{-\sigma(\lambda)(d_B+d_G)}, \int_\lambda\Phi(\lambda)\eta_G(\lambda)e^{-\sigma(\lambda)(d_B)}, \int_\lambda\Phi(\lambda)\eta_B(\lambda)], \end{align} \]
where \(\eta_R(\lambda)\), \(\eta_G(\lambda)\), and \(\eta_B(\lambda)\) are QE spectra of the three PDs (where we consider only photons that reach a PD as the denominator in Equation 12.2 while ignoring photons that are reflected/absorbed before the photons hit the PD), respectively, and \(\Phi(\lambda)\) is the SPD of the light incident on the pixel surface. The three PDs use identical material (so they share the same silicon absorption spectrum) but can still have different \(\eta(\lambda)\)s because of the thickness differences — due to the differences in the lengths of the depletion and neutral regions in the PD p-n junctions. Can you guess why the thickness tends to increase for deeper PDs in Figure 12.22 (right)?
Compared to using the CFA, the vertical PD stacking approach is much more complicated to fabricate and more costly, so it is much less commonly used. It avoids color sampling (and the resulting aliasing) and the need for demosaicing, and in theory could also have a higher overall quantum efficiency (and signal-to-noise ratio) since there are no color filters, so it might find uses in scientific imaging (Chen et al. 2023).
For the charges collected in PD to be transferable to the FD, the photodiode needs to be “pinned”, which means there is another layer of p+ implant above the p-n junction pinned to the ground (0 V). Such a PD is also called the Pinned Photodiode, or PPD (Teranishi et al. 1982; Teranishi 2015; Fossum and Hondongwa 2014).↩︎
\(V_1\) and \(V_{rst}\) technically are ever so slightly different because the charges might be leaking between resetting and read out.↩︎
For instance in Solhusvik et al. (2019), the sensitivity ratio between the LPD and SPD is over 100\(\times\), but the FWC of the SPD is less than three times smaller than that of the LPD.↩︎
They shared the Nobel Prize in Physics in 2009.↩︎
It is worth noting, however, that it is difficult for the CCD sensor to perform CDS because of its read-out architecture (shifting charges to a single SF amplifier).↩︎
It is interesting to note the fact that there is a fundamental pixel size limit negates one advantage of the CCD sensors, where the pixel design is simpler so one can theoretically make the pixel size smaller, but that is countered by the limit to which the PDs can shrink (Fossum 1997).↩︎
Don’t be confused by the two similar notations that represent different quantities: \(\mathcal{Q}_{px}\) for the number of charges at a pixel and \(Q\) for the energy at a pixel.↩︎
If we want to be pedantic, each green pixel has a small, but non-infinitesimal, area, so it first performs a low-pass filtering using a box filter whose extent is the pixel area, followed by sampling at the center of the pixel.↩︎