Computer vision typically requires two steps to follow: process the image to extract features required by the particular application, and analyze the extracted features to produce the results required. Parallel processing technology is extremely well suited for the first stage of computer vision applications, evident from the number of current solutions comprising a DSP with an FPGA. The FPGA, which is the ultimate parallel processing engine, is used in situations where DSPs do not have enough “juice” to pixel process the image to extract the features, and the DSP is used afterwards as a general purpose processor to analyze the extracted data.
The CV220X provides the ideal processing engines and architecture for image processing in a single package. The APEX core has 96 parallel Computational Units (CUs) running in parallel with an ARM processor. APEX performs all the “heavy” parallel processing that would typically require an FPGA, while the ARM processor analyzes the extracted feature results, in parallel. As importantly, this parallel operation is non-blocking (unlike a traditional multi-core approach) because the CUs are working on their own local memory leaving the main external SDRAM memory free for the ARM's use.

The APEX Advantage
In pure DSP solutions, filters are used to extract features using the DSP’s parallel MAC engine (4-8 MACS running in parallel). Mainly all DSPs today support caching, but the cache is not well optimized for image processing, but rather only neighborhood processing which is what the filters require. These filters typically use 3x3 to 9x9 regions around the output pixel of interest. Caching works well for this because of the reuse of the adjacent pixels which will come from the cache. The difficulty arises when processing a VGA or larger size image since the cache “follows the output data”, so by the time the processing of the image at the lower right side is done, there is no longer any valid data in the cache for processing in the next filter required. In a sense, the cache for every image processed is flushed, hence the want for larger caches for DSP solutions. With APEX, data dependencies are resolved before the kernels are executed, similar to a cache pre-fetch, which means the ICP will never stall due to data unavailabilty.
A typical DSP filter implementation has the following flow:
| Input Read | Filter1 | Temp1 Write | ||
| Temp1 Read | Filter2 | Temp2 Write | ||
| Temp2 Read | Filter3 | Temp3 Write | ||
| Etc. |
As filtering primitives are cascaded, memory accesses increase by 2x the number of primitives required. This has the following side effect for DSP only systems:
- System clock speed must be increased to keep up
- External memory speed must be increased to keep up
- Power consumption goes up
For FPGA implementations, filter operations are pipelined in the device without the need for intermediate memory storage (except for the required line buffers). As a consequence, there is no need for larger caches and multiple memory read/writes for intermediate image results. The Array Processor Unit (APU) in APEX works in the same fashion as an FPGA, however instead of RTL code for the filter and block memory for the line buffers, the APU uses the CU (computational unit) and local dedicated Computational Memory (CMEM). The equivalent image filtering flow becomes:
| Input Read | Filter1 | Filter2 | Filter3 | Etc. | Result Write |
The APEX Core Framework (ACF) software understands the filter dependencies and transfers image data in/out of the CMEM in horizontal slices for processing. This has the following effects:
- System clock speeds do not need to increase, only the APEX processor clock speed
- External memory speeds do not need to increase as only one image read/write for many cascaded operations is required
- Larger caches are not required; Memory transfered are pre-programmed implying no processor stalls and the CMEM stores the results of the processing
- Transactions to system memory are highly optimized contributing to lower power consumption
The memory bandwidth between the 96 CUs and the CMEM is 17.2 Gbytes/sec but consumes typically under 250mW because the clock does not have to run fast and the memory is co-located with the processing elements on chip.
ICPs and APEX provide customers with the following competitive advantages:
- Comprehensive image primitive library with equivalent C functions
- Custom development on APEX of new primitives and algorithms
- Support of different hardware platforms through a common ACF API which enables reuse of software across platforms
- Scalability - By increasing or decreasing the number of CUs (each with dedicated CMEM) processing capability is scaled without any increased burden on system memory bandwidth;
- Instruction set simulator and profiler allows developers to measure performance of their primitives/algorithms

