Content on this page requires a newer version of Adobe Flash Player.

Get Adobe Flash player

 Technology



Computer vision typically requires two steps to follow: process the image to extract features required by the particular application, and analyze the extracted features to produce the results required. Parallel processing technology is extremely well suited for the first stage of computer vision applications, evident from the number of current solutions comprising a DSP with an FPGA. The FPGA, which is the ultimate parallel processing engine, is used in situations where DSPs do not have enough “juice” to pixel process the image to extract the features, and the DSP is used afterwards as a general purpose processor to analyze the extracted data.

The CV220X provides the ideal processing engines and architecture for image processing in a single package. The APEX core has 96 parallel Computational Units (CUs) running in parallel with an ARM processor. APEX performs all the “heavy” parallel processing that would typically require an FPGA, while the ARM processor analyzes the extracted feature results, in parallel. As importantly, this parallel operation is non-blocking (unlike a traditional multi-core approach) because the CUs are working on their own local memory leaving the main external SDRAM memory free for the ARM's use.

APEX Core

The APEX Advantage

In pure DSP solutions, filters are used to extract features using the DSP’s parallel MAC engine (4-8 MACS running in parallel). Mainly all DSPs today support caching, but the cache is not well optimized for image processing, but rather only neighborhood processing which is what the filters require. These filters typically use 3x3 to 9x9 regions around the output pixel of interest. Caching works well for this because of the reuse of the adjacent pixels which will come from the cache. The difficulty arises when processing a VGA or larger size image since the cache “follows the output data”, so by the time the processing of the image at the lower right side is done, there is no longer any valid data in the cache for processing in the next filter required. In a sense, the cache for every image processed is flushed, hence the want for larger caches for DSP solutions. With APEX, data dependencies are resolved before the kernels are executed, similar to a cache pre-fetch, which means the ICP will never stall due to data unavailabilty.

A typical DSP filter implementation has the following flow:

Input Read arrow Filter1 arrow Temp1 Write
Temp1 Read arrow Filter2 arrow Temp2 Write
Temp2 Read arrow Filter3 arrow Temp3 Write
Etc.        

As filtering primitives are cascaded, memory accesses increase by 2x the number of primitives required. This has the following side effect for DSP only systems:

For FPGA implementations, filter operations are pipelined in the device without the need for intermediate memory storage (except for the required line buffers). As a consequence, there is no need for larger caches and multiple memory read/writes for intermediate image results. The Array Processor Unit (APU) in APEX works in the same fashion as an FPGA, however instead of RTL code for the filter and block memory for the line buffers, the APU uses the CU (computational unit) and local dedicated Computational Memory (CMEM). The equivalent image filtering flow becomes:

Input Read arrow Filter1 arrow Filter2 arrow Filter3 arrow Etc. arrow Result Write

The APEX Core Framework (ACF) software understands the filter dependencies and transfers image data in/out of the CMEM in horizontal slices for processing. This has the following effects:

The memory bandwidth between the 96 CUs and the CMEM is 17.2 Gbytes/sec but consumes typically under 250mW because the clock does not have to run fast and the memory is co-located with the processing elements on chip.

ICPs and APEX provide customers with the following competitive advantages: