Computer Vision at the Edge Can Enable AI Apps

By Reese Grimsley

Systems Applications Engineer

Texas Instruments

October 11, 2023

Blog

Computer Vision at the Edge Can Enable AI Apps
Example of a barcode scanner using edge AI processing to decode barcodes.

Computer vision refers to the technological goal to bring human vision – an information-rich and intuitive sensor – to computers, enabling applications such as assembly line inspection, security systems, driver assistance and robotics.

Unfortunately, computers lack the ability to intuit vision and imagery like humans. Instead, we must give computers algorithms to solve domain-specific tasks.

We often take our vision for granted, and how that biological ability can interpret our surroundings, from looking in the refrigerator to check food expiration dates to watching intently for a traffic light to turn green.

Computer vision dates to the 1960s and was initially used for tasks like reading text from a page (optical character recognition) and recognizing simple shapes such as circles or rectangles. Computer vision has since become one of the core domains of artificial intelligence (AI), which encompasses any computer system attempting to perceive, synthesize or infer some deeper meaning from data. There are three types of computer vision: conventional or “rules-based”, classical machine learning, and deep learning.

In this article, I’ll consider AI from the perspective of making computers use vision to perceive the world more like humans. I’ll also describe the trade-offs of each type of computer vision, especially in embedded systems that collect, process and act upon data locally, rather than relying on cloud-based resources.

Conventional Computer Vision

Conventional computer vision refers to programmed algorithms that solve tasks such as motion estimation, panoramic image stitching or line detection.

Conventional computer vision uses standard signal processing and logic to solve tasks. Algorithms such as Canny edge detection or optical flow can find contours or vectors of motion, respectively, which is useful for isolating objects in an image or motion tracking between subsequent images. These types algorithms rely on filters, transforms, heuristics and thresholds to extract meaningful information from an image or video. These algorithms are often a precursor to an application-specific algorithm such as decoding information within a 1-D barcode, where a series of rules decode the barcode upon the detection of individual bars.

Conventional computer vision is beneficial in its straightforwardness and explainability, meaning that developers can analyze the algorithm at each step and explain why the algorithm behaved as it did. This can be useful in software auditing or safety-critical applications. However, conventional computer vision often requires more expertise to implement properly.

The algorithms often have a small set of parameters that require tuning to achieve optimal performance in different environments. Implementation can be difficult, especially for optimized, high-throughput applications. Some rules, algorithmic decisions or parameter values may have unexpected effects on images that do not fit original expectations, such that it becomes possible to trick the algorithm. Such vulnerabilities and edge cases can be difficult to fix without exposing new edge cases or increasing the algorithm’s complexity.

Classical Machine Learning for Computer Vision

Machine learning emerged as a class of algorithms that use data to set parameters within an algorithm, rather than direct programming or calibration. These algorithms, such as support vector machine, multilayer perceptron (a precursor to artificial neural networks) and k-nearest neighbor, saw use in applications that were too challenging to solve with conventional computer vision. For example, “recognizing a dog” is a difficult task to program on a traditional computer vision algorithm, especially where complex scenery and objects are also present. Training a machine learning algorithm to learn parameters from 100 s or 1000 s of sample images is more tractable. Edge cases are solved by using a dataset that contains examples of those edge cases.

Training is computationally intensive, but running the algorithm on new data requires far fewer computing resources, making it possible to run in real time. These trained models generally have less explainability but are more resilient to small, unplanned variations in data, such as the orientation of an object or background noises. It is possible to fix variations that are not handled well by retraining with more data. Larger models with more parameters often boast higher accuracy, but have longer training times as well as more computations needed at run time, which has historically prevented very large models from use in real-time applications on embedded processors.

Classical machine learning-based approaches to computer vision still require an expert to “craft” the feature set on which the machine learning model is trained. Many of these features are common to conventional computer vision applications. Not all features are useful, thus requiring analysis to prune uninformative features. Implementing these algorithms effectively requires expertise in image processing as well as machine learning.

Deep Learning

Deep learning refers to very large neural network models operating on largely unprocessed or “raw” data. Deep learning has made a large impact on computer vision by pulling feature extraction operations into the model itself, such that the algorithm learns the most informative features as needed. The following figure shows the data flow in each computer vision approach.

Data flow for each computer vision approach. Feature extraction in deep learning happens automatically within the algorithm itself. The input is identical but the output may differ depending on the effectiveness of the selected approach for the application.

Deep learning has the most generality among the types of computer vision; neural networks are universal function approximators, meaning they have the capability of learning any relation between input and output (to the extent that the relation exists). Deep learning excels at finding both subtle and obvious patterns in data, and is the most tolerant to input variations. Applications such as object recognition, human pose estimation and pixel-level scene segmentation are common use cases.

Deep learning requires the least direct-tuning and image processing expertise. The algorithms rely on large and high-quality data sets to help the general-purpose algorithm learn patterns by gradually finding parameters that optimize a loss or error metric during training. Novice developers can make effective use of deep learning because the focus shifts from the algorithm’s implementation toward data-set curation. Furthermore, many deep learning models are publicly available such that they can be retrained for specific use cases. Using these publicly available models is straightforward; developing fully custom architectures does, however, require more expertise.

Compared to conventional computer vision and classical machine learning, deep learning has consistently higher accuracy and is rapidly improving due to immense popularity in research (and growingly, commercial) communities. However, deep learning typically has poor explainability since the algorithms are very large and complex; images that are completely unlike the training data set can cause unexpected, unpredictable behavior. Because of their size, deep learning models are so computationally intensive that special hardware is necessary to accelerate them for real-time operation. Training large models on large data sets can be costly, and curating a large data set is often time-consuming and tedious.

However, improvements in processing power, speeds, accelerators such as neural processing units and graphics processing units, and improved software support for matrix and vector operations have made the increase in computation requirements less consequential, even on embedded systems. Embedded microprocessors like the AM6xA portfolio leverage hardware accelerators to run deep learning algorithms at high frame rates.

Comparing the Different Types of Computer Vision

So which type of computer vision is best?

That ultimately depends on its application, as shown in Figure 2.

Comparison across computer vision technologies

In short, computer vision with classical machine learning rests between the other two methods for most attributes; the set of applications that benefit compared to the other two approaches is small. Conventional computer vision can be sufficiently accurate and highly efficient in straightforward, high-throughput or safety-critical applications. Deep learning is the most general, the easiest to develop for, and has the highest accuracy in complex applications and environments, such as identifying a tiny missing component during PCB assembly verification for high-density designs.

Some applications benefit from using multiple types of computer vision algorithms in tandem such that they cover each other’s weak points. This approach is common in safety-critical applications dealing with highly variable environments, such as driver assistance systems. For example, you could employ optical flow using conventional computer vision methods alongside a deep learning model for tracking nearby vehicles, and use an algorithm to fuse the results to ascertain whether the two approaches agree with each other. If they do not, the system could warn the driver or start a graceful safety maneuver. Alternatively, it is possible to use multiple types of computer vision sequentially. A barcode reader can use deep learning to locate regions of interest, crop those regions, and then use a conventional CV computer vision algorithm to decode.

Computer Vision in Practice

The barrier to entry for computer vision is progressively lowering. Open source libraries like OpenCV provide efficient implementations of common functions like edge detection and color conversion. Deep learning runtimes like tensorflow-lite and ONNX runtime enable deep learning models to run efficiently on embedded processors. These runtimes also provide interfaces that custom hardware accelerators can implement to simplify the developer’s experience when they are ready to move an algorithm from the training environment on PC or cloud to inference on the embedded processor. Many deep learning architectures are also openly published such that they can be reused for a variety of tasks.

Processors in the Texas Instruments (TI) AM6xA portfolio, such as the AM62A7, contain deep learning acceleration hardware as well as software support for a variety of conventional and deep learning computer vision tasks. Digital signal processor cores like the C66x and hardware accelerators for optical flow and stereo depth estimation also enable high performance conventional computer vision tasks.

With processors capable of both conventional and deep learning computer vision, it becomes possible to build tools that rival sci-fi dreams. Automated shopping carts will streamline shopping; surgical and medical robots will guide doctors to early signs of disease; mobile robots will mow the lawn and deliver packages. If you can envision it, so can the application you’ll build. See TI’s edge AI vision page to explore how embedded computer vision is changing the world.

Reese Grimsley is a Systems Applications Engineer with the Sitara MPU product line within TI’s Processors organization. At TI, Reese works on image processing, machine learning, and analytics for a variety of camera-based end-equipment in industrial markets. One of his focal areas is demystifying Edge AI to help both new and experienced customers understand how they can quickly and easily bring complex deep learning algorithms to their products and improve accuracy, performance, and robustness.

More from Reese