June 30, 2022
The development of CMOS image sensors and the prospect of using advanced imaging technologies promises to improve the quality of life. With the rapid emergence of parallel analog-to-digital converter (ADC) and backside illuminated (BI) technologies, CMOS image sensors currently dominate the digital camera market, while stacked CMOS image sensors continue to provide enhanced functionality and user experience. This paper reviews recent achievements of stacked image sensors in the evolution of image sensor architectures to accelerate performance improvements, expand sensing capabilities, and combine edge computing with various stacked device technologies.
Image sensors are currently used in a variety of applications. Since the invention of the charge-coupled device (CCD) in 1969, solid-state image sensors have spread to a variety of consumer markets, such as compact video cameras and digital cameras. The CMOS image sensor, which has been the mainstream solid-state image sensor since 2005, builds on the technology developed for CCDs. In addition to smartphones, currently the largest image sensor market, demand for image sensors is rapidly expanding to include network cameras for security, machine vision for factory automation, and automotive cameras for assisted driving and autonomous driving systems.
A major turning point in CMOS image sensor technology was the successful development of backside-illuminated (BI) image sensors, which enabled the development of stacked structures of image sensors, as shown in Figure 1. In the original front-illuminated (FI) structure, it was difficult to reduce the pixel size of the sensor because the incident light had to be collected by the photodiode through a gap surrounded by metal lines. Backside illuminated (BI) structures have greatly improved sensitivity and allow flexibility in metal routing, and it has become a popular product for image sensors due to wafer bonding and extremely uniform wafer thinning techniques. Image sensors are gradually developing towards stacked structures, in which logic circuits are directly integrated on the base wafer. The stacking process allows for a higher level of integration of highly parallel analog-to-digital converters (ADCs) and signal processing elements in more advanced CMOS processes, independent of the sensor process customized for the pixel photodiodes. Stacked device structures continue to dramatically change image sensor architectures.
Figure 1. Structure of a CMOS image sensor. (a) FI structure, (b) BI structure, and (c) stacked structure with vias.
This paper reviews trends in image sensor architectures with stacked devices to significantly accelerate performance improvements, expand sensing capabilities, and integrate edge computing capabilities connected to the sensor layer. The second section presents different sensor architectures for stacked device configurations that enable high pixel resolution and high frame rate imaging through highly parallel column-parallel ADCs. Section 3 presents some advanced pixel circuits implemented using pixel-pitch Cu–Cu connections that are critical for better pixel performance at practical pixel resolutions. Pixel-pitch Cu-Cu connections are also enabling sensor architectures to move toward pixel-parallel digitization. Section IV presents some advances in sensor architectures that extend sensing capabilities, such as spatial depth, temporal contrast sensing, and invisible light imaging. Section V introduces vision sensors that integrate artificial intelligence (AI) accelerators at the edge. Finally, Section VI gives some conclusions.
II. Recording with over megapixel movies
Movie recording requires a frame rate of at least 30 or 60 frames per second (fps), even though the number of pixels is increasing from the 2-megapixel high-definition (HD) format to the 8-megapixel 4K format. Additionally, higher frame rate operation, such as 120, 240 or 1000 frames per second (fps), can provide slow-motion playback. Since the column-parallel ADC architecture was proposed in 1997, frame rates have improved by increasing the number of parallel ADCs and speeding up the ADC operation itself. Stacked structures help maximize frame rates as the best process technology can be applied to sensor pixels and peripherals. Sensor fabrication requires several ion implantation processes to form photodiodes and transistors with low junction leakage. However, the logic process requires low resistance and high speed transistors. For pixels, three or four layers of wiring are usually sufficient, but about ten layers of wiring are required for logic circuits. The stacking technique used can alleviate the conflicting constraints of non-stacked image sensors on the same chip, including sensor pixels and logic circuits.
A. Dual Rank ADC Architecture
Currently, most CMOS image sensors include an array of pixels, thousands of ADCs and logic circuits organized in a column-parallel structure. As shown in Figure 2(a), through-silicon vias (TSVs) located outside the pixel array connect the pixel columns to the ADC in a highly parallel fashion. In the first stacked CMOS image sensor introduced in 2013, the analog and digital parts of the column ADC were split into top and bottom chips, respectively, as shown in Figure 2(b). In 2015, a dual-column ADC architecture was proposed and achieved a frame rate of 120 fps at 16M pixels, where the column ADC was completely moved to the bottom chip, as shown in Figure 2(c). The sensor chip is fabricated using a 90nm sensor custom process for photodiodes, using only NMOS logic. The logic chips are fabricated using a standard 65-nanometer CMOS process. Since the column ADC can be implemented independently of the sensor chip, the ADC can be highly integrated. In addition to increasing the frame rate, redundant parallel ADCs are used to reduce noise by averaging multiple analog-to-digital (AD) conversions, as shown in Figure 3. The output of one pixel is distributed to two ADCs simultaneously, and the two digital outputs are summed to reproduce the image frame. The timing phases of the two ADCs are slightly different to achieve noise reduction by reducing the correlation between their noisy signals.
Figure 2. Implementation of a stacked CMOS image sensor. (a) TSV connection between photodiode and logic circuit. (b) The first stacked CMOS image sensor. (c) Dual-rank ADC architecture.
Figure 3. Simplified block diagram (left) and improved noise characteristics (right) of a dual-rank ADC architecture.
B. Three-layer stacked CMOS image sensor with dynamic random access memory (DRAM)
As the number of pixels and parallel ADCs increases, image sensors output large amounts of data. In 2017, a three-layer stacked CMOS image sensor was proposed to record slow-motion video at 960 fps, as shown in Figure 4; the three layers are connected by through-silicon vias (TSVs), and the data obtained from the parallel ADC is buffered in the The second layer of DRAM to achieve slow motion capture. For super slow-motion recording, the sensor can run at 960 fps at full HD resolution while the digital data from the ADC is temporarily buffered in DRAM over a 102-Gbit/s bus. When the sensor detects user triggers or fast motion in the scene during 30 fps movie shooting, the readout speed becomes 960 fps. Up to 63 frames of full HD resolution can be stored in DRAM at a time and buffered data output during subsequent movie capture.
Figure 4. Three-layer stacked CMOS image sensor with DRAM
C. For Large Optical Format Chip-on-Wafer Technology
Stacked CMOS image sensors introduced to date are fabricated in a wafer-on-wafer (WoW) bonding process. However, since the dimensions of the sensor and logic chips must be the same, this process is not always the best choice, especially for a large optical format. Another stacking method involves CoW bonding, as shown in Figure 5 shown. Area efficiency is best in WoW bonding when a logic chip of the same size as the optical format is completely filled with highly parallel ADCs and digital building blocks. However, if the logic circuit is smaller than the optical format, the CoW configuration has the best area efficiency, while the WoW configuration has cost issues.
Figure 5. Area efficiency of WoW and CoW bonding processes for large optical format image sensors.
A stacked CMOS image sensor using CoW bonding process  was reported in 2016, realizing a global shutter image sensor for broadcast cameras with a super-35 mm optical format. Here, two sliced logic chips are designed in a 65-nm CMOS process with parallel ADCs and microbumps and stacked on a large sensor chip custom-designed for global shutter pixels, as shown in Figure 6. A cut-out logic chip with a high aspect ratio is connected to the sensor via microbumps with a pitch of 40 µm. Therefore, the total number of connections is about 38 000. The sensor also allows for super slow-motion playback at 480 fps via 8 megapixels.
Figure 6. Stacked CMOS image sensor using CoW bonding process.
Figure 7 shows performance trends for large optical-format image sensors, with 50 megapixels and 250 fps for full-35-mm-format image sensors in 2021. To increase the number of parallel ADCs and incrementally increase the static random access memory (SRAM) frame buffer, the WoW process is used to achieve high performance. On the other hand, the CoW process is used to balance cost efficiency with the performance of large optical-format sensors. Also introduced in 2021 is a 3.6-inch image sensor with 127 million pixels and four logic chips stacked using a CoW process. The next challenge for the CoW process is to increase the throughput of chip placement on the wafer to increase productivity.
Figure 7. Performance trends for large optical format image sensors.
III. Pixel Parallel Architecture
In the previous section, the sensor architecture using stacked devices was mainly used to increase the frame rate of the column-parallel ADC based architecture. This section presents some advances based on pixel-parallel architectures using fine-pitch Cu–Cu connections. Here, the connections between the sensor and logic layers have been changed from TSVs to hybrid-bonded Cu-Cu connections, as shown in Figure 8(a). In a TSV configuration, the signal lines are routed to the logic layer on the periphery of the pixel array. In contrast, Cu-Cu connections can be integrated directly under the pixel, and these connections allow to increase the number of connections. The latest trends regarding the Cu-Cu connection spacing are shown in Fig. 8(b). The hybrid bonding process of image sensors requires millions of Cu-Cu connections without connection defects, while the contact spacing gradually decreases with the stable connection of a large number of contacts; moreover, 1-µm Cu-Cu has recently been reported Hybrid bond spacing. These fine-pitch connections will enable pixel-parallel circuit architectures to be fabricated at practical pixel dimensions.
Figure 8. Cu-Cu junction spacing trends (a) simplified device structure and (b) cross-section.
A. Stacked pixel circuit expansion
Numerous techniques and implementations have been proposed in the literature to improve pixel performance through pixel circuit expansion, such as full well capacity (FWC), and to implement additional functions, such as global shutter. Figure 9(a) and (b) show the pixel configuration for single conversion gain and double conversion gain, respectively. Smaller capacitive CFDs experience high voltage swings from optoelectronics for low-noise readout, but it is easily saturated by a large number of signal electrons. However, pixels with dual conversion gains are switched by sequential operation between the two conversion gains, enabling low noise readings on CFD and high dynamic range (HDR) readings on CDCG; in addition, the area overhead of additional transistors and capacitors High pixel resolution is achieved by limiting the amount that the pixel size can be reduced. In 2018, a stacked pixel circuit extension with double conversion gain was proposed; additional circuits were implemented on the bottom chip through pixel-parallel Cu-Cu connections, as shown in Fig. 9(c). By switching between conversion gains of 20 and 200 µV/e-, a 1.5-µm pixel was successfully displayed with a dynamic range of 83.8 dB and low noise of 0.8 e-rms. As shown in Figure 10, the pixel-level stacked circuit configuration has been applied to the voltage-domain global shutter function and the pixel with double conversion gain. 2019 demonstrated a 2.2 µm global shutter pixel with a shutter efficiency of over 100 dB. State-of-the-art pixels with dual conversion gain and voltage-domain global shutter achieve pixel sizes of 0.8 µm and 2.3 µm, respectively, without pixel-level stacking circuit scaling; however, stacked pixel configurations are still expected to enhance pixel performance for smaller pixels.
Figure 9. Pixel circuit configurations (a) with single conversion gain, (b) with double conversion gain, and (c) with double conversion gain and stacked pixels with parallel Cu-Cu connections.
Figure 10. Pixel circuit configuration of a stacked voltage-domain global shutter via pixel-parallel Cu-Cu connections.
B. Pixel Parallel ADC
Since the concept of pixel-parallel digitization was proposed in 2001, pixel-parallel Cu-Cu-connected stacked image sensors with hybrid bonding processes have also been proposed. Within-pixel area overheads in complex circuits definitely limit pixel resolution, but in 2017 a 4.1-megapixel stacked image sensor with an array-parallel ADC architecture was proposed, followed in 2018 by a 1.46-megapixel parallel ADC's stacked image sensor. The pixel-parallel ADC architecture has achieved Mpixel resolution due to the fine pitch Cu-Cu connections of the hybrid bonding process. As shown in Figure 11, single-slope ADCs are used in pixel-parallel and traditional column-parallel architectures, but without source follower circuits. In-pixel transistor amplifiers are integrated directly into the comparators, connecting each pixel to the bottom chip via two Cu-Cu connections. Due to the area limitation of the counter, the Gray code is assigned to in-pixel latches, and digital readout pipelines have been implemented using ADCs under the pixel array.
Figure 11. Circuit configuration of pixel-parallel ADC.
Figure 12(a) shows a prototype chip with a pixel-parallel ADC architecture; although each ADC is implemented with a pixel pitch of only 6.9 µm, where the quiescent current of the comparator is limited to 7.74 nA, the noise floor due to effective bandwidth control suppressed to 8.77 e−rms. All pixel-parallel ADCs operate simultaneously as a global shutter; therefore, as shown in Figure 12(c), no rolling shutter focal plane distortion as shown in Figure 12(b) is observed in images captured using the prototype. Pixel-parallel ADC architectures continue to be developed. The most recent work in 2020 shows a pixel pitch of 4.6 µm, a dynamic range of 127-dB, and a noise of 4.2e−rms, and a work of 4.95 µm and a noise of 2.6e−rms.
Figure 12. On-chip implementation of a pixel-parallel ADC. (a) Micrograph of the chip. (b) Images captured using rolling shutter operation and (c) using global shutter operation.
C. Pixel Parallel Photon Counter
Photon counting imaging, also known as quantum imaging, is a promising technique for enabling image capture with noise-free readout and high dynamic range imaging (HDR). Photon-counting image sensors using single-photon avalanche diodes (SPADs) are one of the challenges of pixel-parallel digitization through stacking techniques. The avalanche current is triggered by a single photoelectron, and in the absence of any noise from the analog front-end circuitry, the event can be viewed digitally as a photon count. This requires the implementation of complex circuits for each SPAD; whereas stacked device structures with pixel connections have the potential for highly integrated photon counting imaging.
A SPAD photon-counting image sensor with a dynamic range of 124 dB and using a subframe extrapolating architecture was reported in 2021. A backside-illuminated (BI) single-photon avalanche diode (SPAD) pixel array is stacked on the bottom chip, and the readout circuitry is connected via pixel-parallel Cu-Cu, as shown in Figure 13(a). Fig. 13(b) is a schematic diagram of a pixel unit. Each pixel has a 9-b digital ripple counter (CN) that counts the number of incident photons. The overflow carry (OF) from the counter is returned to the quench circuit to control the SPAD activation and latch the timing code (TC). A 14-b timing code (TC) is then assigned to all pixels and overrides the counter when the OF flag changes, as shown in the timing diagram in Figure 14. Read out 9-b counts of photons or latched 14-b TCs and obtain all photon counts accurately in low light conditions without counter overflow. However, when the counter overflows in bright light conditions, the overflowing pixel records the time and extrapolates the actual number of incident photons throughout the exposure.
Figure 13. Photon counting image sensor. (a) Chip configuration. (b) Simplified pixel circuit diagram.
Figure 14. Timing diagram for photon counting and subframe extrapolation.
As shown in Figure 15(a), a dynamic range of 124 dB has been demonstrated without any degradation in signal-to-noise ratio (SNR). The SNR after counter overflow under bright light conditions remains at 40 dB over the extended dynamic range, since true photon counting operations can count up to 10 240 photons, or 9 bits × 20 subframes. Figure 15(b) shows an HDR image captured at 250 fps; due to global shutter and 20-subframe HDR operation, no motion artifacts were observed even with a 225 rpm rotating fan. The 20-subframe extrapolation effectively suppresses motion artifacts, as shown in Fig. 15(c). SPAD requires a high bias voltage of about 20 V and pixel-parallel triggering of the detectors at a low supply voltage. SPAD pixels with small pitches are often difficult to achieve due to device isolation between different supply voltages. However, the stacked device structure effectively separates the SPAD and CMOS logic layers, thereby accelerating the development of small pixel configurations with SPAD and extended functionality.
Figure 15. Measurement results of photon counting. (a) Dynamic range and signal-to-noise ratio. (b) Captured HDR image. (c) Captured image with motion artifact suppression.
IV. Expansion of Sensing Capability
In addition to the previously introduced dynamic range and global shutter capabilities, stacked device technology not only enhances the image quality of the sensor architecture, but also enhances sensing capabilities such as spatial depth, temporal contrast sensing, and invisible light imaging.
A. Spatial depth
As described in Section III-C, the stacked device structure with Cu-Cu hybrid bonding is a promising approach for practical SPAD technology in a wide range of applications and reduces the SPAD pixel pitch to less than 10 µm. To improve photon detection efficiency (PDE) and reduce optical crosstalk with small pixel pitch, a BI SPAD pixel array including full trench isolation (FTI) and Cu-Cu bonding was reported in 2020. As shown in Figure 16, in the BI stacked SPAD structure, the SPAD pixel array is completely open to incident light, and all pixel transistors are implemented on the bottom chip. Metal buried FTI helps suppress crosstalk with adjacent pixels. The 10-µm pitch SPAD pixels feature a 7-µm-thick silicon layer to improve the sensitivity of near-infrared (NIR) spectroscopy measurements and achieve high PDEs of over 31.4% and 14.2% at 850 nm and 940 nm, respectively.
Figure 16. SPAD device structure. (a) FI SPAD. (b) BI-stacked SPAD.
In 2021, a 189 × 600 SPAD direct time-of-flight (ToF) sensor using a BI-stacked SPAD is reported for automotive LiDAR systems. All pixel front-end circuits are implemented in the underlying chip under the SPAD array, as shown in Figure 17. In a LiDAR system, when a reflected laser pulse is received, the SPAD generates a trigger pulse with a dead time of 6 ns and transmits it to a time-to-digital converter (TDC). The top and bottom chips use 90-nm SPAD and 40-nm CMOS processes with 10 copper layers, respectively. Due to the stacked structure, the sensor includes a coincidence detection circuit, TDC and digital signal processor (DSP) as the building blocks for depth sensing. The direct ToF sensor exhibits a distance accuracy of 30 cm over an extended range of up to 200 m, enabling it to detect objects with 95% reflectivity in sunlight at 117k lux.
Figure 17. BI stacked SPAD with direct ToF depth sensor.
The BI stacked SPAD structure is a breakthrough in SPAD-based imaging and depth sensing with improved properties. The BI stack structure improves quantum efficiency and separates the SPADs and circuits into optimal silicon layers compared to conventional pixels that place the circuits next to each SPAD. Therefore, the stacked implementation overcomes the traditional limitations of SPAD sensors and is suitable for a wider range of applications.
B. Time Contrast Sensing
Event-based vision sensors (EVS) detect single-pixel temporal contrast above preset relative thresholds to track the temporal evolution of relative light changes and define sampling points for frameless pixel-level measurements of absolute intensity. Since EVS was first reported in 2006, many applications using EVS have been proposed, such as high-speed and low-power machine vision due to the temporal precision of recorded data, inherent suppression of temporal redundancy leading to reduced post-processing costs and a wide range of in-scenarios. DR operation. Although pixel size was reduced to 9 µm pitch in 2019 through BI structures, EVS suffers from large pixel size and often small resolution due to extensive pixel-level analog signal processing. Therefore, EVSs particularly benefit from advances in stacked device structures with pixel-scale Cu-Cu connections.
1280 × 720 4.86-µm pixel pitch BI-stacked EVS was reported in 2020. Figure 18 shows the pixel block diagram of the contrast detection (CD) function and a schematic diagram of the in-pixel asynchronous readout interface and state logic blocks. The photocurrent is converted to a voltage signal, Vlog, and the contrast change is obtained by asynchronous delta modulation (ADM) detected using a level-crossing comparator. The BI-stacked EVS in Figure 19(a) achieves 1-µs row-level timestamps, a maximum event rate of 1.066 billion events per second (eps), and a data formatting pipeline of 35 nW/pixel and 137 pJ/event For high-speed, low-power machine vision applications. Figure 19(b) shows sensor operation for some example applications. Traffic scene recordings around 1 lux demonstrate low-light contrast sensitivity. High temporal accuracy from low-latency pixels and high-speed readout operations allow the sensor to decode time-encoded structured light patterns in 3D depth sensing applications. Figure 20 shows the trend of pixel pitch in EVS. Due to stacked device technology, the pixel size of EVS is now below 5 µm pitch for practical use cases of megapixels.
Figure 18. Pixel block diagram of EVS
Figure 19. BI-stacked EVS and its application example. (a) Micrograph of the chip. (b) Application Examples.
C. Invisible light imaging
Stacked device technology also facilitates invisible light imaging using non-silicon photodetectors in hybrid integration. Examples of non-silicon photodetectors with hybrid integration include InGaAs photodetectors, Ge-on-Si photodetectors, and organic photoconductive films. In this section, recent results of InGaAs sensors using Cu-Cu hybrid bonding are summarized.
The demand for imaging in the short-wave infrared (SWIR) range (i.e. wavelengths between 1000 and 2000 nm) has been increasing for industrial, scientific, medical and security applications. InGaAs devices have been used in SWIR sensors because their absorption properties in the SWIR range cannot be covered by silicon-based devices. In conventional InGaAs sensors, each pixel of the photodiode array (PDA) is connected to a readout integrated circuit (ROIC) via a flip-chip hybrid using bumps. This structure typically complicates the fabrication of fine-pitch pixel arrays due to the limited scalability of bumps. In 2019, an InGaAs image sensor was introduced in which each 5-µm pixel of the PDA was connected to the ROIC using Cu-Cu bonding. InGaAs/InP heterostructures were epitaxially grown on small commercially available InP substrates with diameters less than 4. As shown in Figure 21, epitaxial InGaAs/InP wafers are diced into chips and transferred to large silicon wafers using a III-V die-to-silicon process. After fabrication of the Cu pads, the III-V/Si heterowafer uses Cu-Cu bonding to connect each III-V pixel to the ROIC with the ROIC mix. Figure 22 shows the contact pitch trend for flip-chip bumps and Cu-Cu bonding for InGaAs sensors. Flip-chip hybrid using bumps, the traditional method of fabricating InGaAs sensors, is not suitable for scaling down the pixel pitch due to narrow process margins and poor repeatability. However, Cu-Cu hybridization has been used for mass production of CMOS image sensors with high yields since 2016 and is a key technology for scaling interconnects to InGaAs sensors. Figure 22 also shows an example of an application involving inspection and security monitoring in a foggy scenario. Thus, InGaAs image sensors enable HD SWIR imaging through pixel-level Cu-Cu connections.
Figure 21. Process flow diagram for InGaAs image sensor fabrication.
Figure 22. Flip-chip bump contact pitch trends and application examples for Cu-Cu bonding and InGaAs sensors.
V. Smart Vision Sensors
Demand for camera products with AI processing capabilities is growing in the Internet of Things (IoT) market, retail, smart cities, and similar applications. AI processing power on such edge devices can address some of the issues associated with pure cloud computing systems, such as latency, cloud communications, processing costs, and privacy concerns. Market demands for smart cameras with AI processing capabilities include small size, low cost, low power consumption, and ease of installation. However, conventional CMOS image sensors only output the raw data of the captured image. Therefore, when developing a smart camera with AI processing capabilities, it is necessary to use ICs that include image signal processor (ISP), convolutional neural network (CNN) processing, DRAM, and other capabilities.
A stacked CMOS image sensor consisting of 12.3 megapixels and a DSP dedicated to CNN computation was reported in 2021. As shown in Figure 23, the sensor contains an integrated solution with full image capture transfer to the CNN inference processor and can be processed at 120 fps, including image capture using a 4.97 TOPS/W DSP and on-chip CNN processing. The processing block has an ISP for CNN input preprocessing, a DSP subsystem optimized for CNN processing, and an 8-MB L2 SRAM for storing CNN weights and runtime memory. Figure 24 shows some examples of CNN inference results using MobileNet v1. The DSP subsystem demonstrated similar inference results to TensorFlow. Smart vision sensors are able to run the complete CNN inference process on the sensor, and can output the captured images as raw data and CNN inference results in the same frame through the MIPI interface. The sensor also supports output of CNN inference results only from the SPI interface to enable small cameras and reduce system power consumption and cost. The CNN inference processor on the sensor allows users to program their favorite AI models into embedded memory and reprogram them according to the requirements or conditions of where the system is used. For example, when installed at the entrance of a facility, it can be used to count the number of visitors entering the facility; when installed on a store shelf, it can be used to detect out-of-stock situations; when installed on the ceiling, it can be used for heat mapping store visitors. Smart vision sensors are expected to provide low-cost edge AI systems for various applications using flexible AI models.
This paper reviews recent achievements in image sensor architectures with stacked device structures. The stacked device structure greatly improves image sensor performance, especially at high frame rates and high pixel resolutions, through highly parallel ADCs implemented using sensor pixels and CMOS circuit optimized process technology. In recent work, several proposals have been made, with some results, using pixel-parallel stacking circuits and/or smarter processing units. These new challenges require higher scalability, more optimization of process technology for each function, and higher area efficiency. Photodetectors, pixel front-end circuits, analog mixed-signal and digital processors, and memories can be integrated more efficiently, as shown in Figure 25, and future image sensor architectures will gain further development to expand capabilities through device stacking techniques.