This post will examine the difference between deep learning training and deep learning inference. Deep learning (DL) and inference are essential for developing and using artificial intelligence (AI).
What is the Difference Between Deep Learning Training and Inference?
Deep learning training refers to the process of teaching (training) a DNN (deep neural network) using data sets to accomplish an AI task, such as image recognition or voice recognition. Deep learning inference, on the other hand, refers to the process of feeding DNNs novel (new) data, such as images that the DNN has never seen before in order for it to make a prediction as what the data represents.
What is Deep Learning Training?
Deep learning training is the most challenging and time-consuming process of creating artificial intelligence (AI). However, deep learning training is necessary to train deep neural networks to accomplish a given task. Deep neural networks (DNNs) are composed of many layers of interconnected artificial neurons. Artificial neurons must be taught how to perform a specific AI task, such as image classification, video classification, speech to text, or creating a recommendation system. Teaching or training a deep neural network is accomplished by feeding the deep neural network data, allowing the DNN to make a prediction as to what the data represents.
For example, suppose a DNN is being taught how to differentiate between three different objects, such as a dog, car, and bicycle. In that case, the first step is to put together a data set that consists of thousands of images, some of which contain dogs, cars, and bicycles.
The second step in training a DNN is to feed the images to the deep neural network, allowing it to make a prediction (inference) about what the image represents. If the DNN makes an inaccurate prediction, the artificial neurons are updated to correct for the error so that future predictions are more accurate. This process makes it more likely that the DNN will accurately predict what the image consists of the next time the same image is presented to the DNN.
The training process goes on and on until the DNN makes predictions that achieve the accuracy desired by the data scientist or organization training it. Once the desired accuracy is achieved, the DNN training is complete, and the trained model is ready to be used to make predictions using novel (new) images that the DNN has never seen before.
Is Deep Learning Training Compute Intensive?
Yes, deep learning training can be very compute-intensive. Often, billions of billions of calculations must be performed to train a DNN, so deep learning training needs powerful computing power to perform the calculations quickly. Training a model using your home desktop computer may be possible, but it will take hours or days to complete. As such, deep neural network training is performed in data centers that have massive compute power using multi-core processors, GPUs, VPUs, and other performance accelerators to accelerate AI workloads.
What is Deep Learning Inference?
Deep learning inference refers to the use of a fully trained deep neural network (DNN) to make inferences (predictions) on novel (new) data that the model has never seen before. Deep learning inference is performed by feeding new data, such as new images, to the network, giving the DNN a chance to classify the image. Taking our previous example, the DNN can be fed new images of bikes, dogs, cars, and other objects, allowing the DNN to classify such images. A fully trained DNN should make accurate predictions as to what an image represents.
After a DNN is fully trained, it can be copied to other devices; however, usually, before a DNN is deployed, it is simplified and modified in order to require less computing power, less energy, and less memory. This is so because trained DNNs can be very large once they’re fully trained with hundreds of layers of artificial neurons and billions of weights connecting them. The larger the DNN, the more compute power it needs, the more storage space it requires, the more energy it will need to run, and the more latency there will be to achieve a response from the DNN.
As such, fully trained DNNs are modified and simplified in order to run on simple hardware, use less power, and run with as little latency as possible. Of course, simplifying a DNN will result in a less accurate model. However, the small reduction in the accuracy of the model is outweighed by the benefits of its simplification.
Generally, DNNs are simplified and modified using two methods. The first method is known as pruning, and the second method is known as quantization.
The pruning method involves a data scientist feeding the DNN with data and observing it, looking for neurons or groups of them that do not fire or rarely fire. Once non-firing and rarely firing neurons are identified, they are removed from the DNN without causing a significant reduction in the accuracy of the DNN. Thus, reducing the deep neural network size and improving its latency without significantly decreasing its prediction accuracy.
The second method that can be used to simplify deep neural networks is quantization, which involves reducing the precision of weight from, for example, a 32-bit floating-point down to an 8-bit floating-point. This creates a small model size that utilizes less compute resources. Simplifying neural networks results in a small and negligible impact on the accuracy of the model. However, the models become much faster and smaller in size, using less energy and consuming less compute resources.
Deep Learning Inference at the Edge
Reducing the size and compute power required by a DNN is extremely important as the DNN is moved from data centers to the edge. This is so because edge computers are significantly less powerful than the massive compute power that’s located at data centers and the cloud.
Furthermore, many edge computing devices are subject to energy constraints; thus, DNNs must be simplified and modified to use less power. This is so because deep learning’s high accuracy requires a lot of compute power and memory; simplifying the DNN allows it to be deployed on edge devices to perform deep learning inference at the edge.
For deep learning inference at the edge, a common approach is to use a hybrid model where an edge computer gathers information from a camera or sensor and sends that information to the cloud, where deep learning inference analysis is performed on the data. However, moving the data to the cloud where inference analysis is performed, presents a couple of challenges.
First, sending data to the cloud causes some latency issues. Data often needs a few seconds to be sent to the cloud, analyzed, and sent back to the device of origin, resulting in a few seconds of latency. Some applications where real-time inference analysis must be performed cannot be performed using such a model.
Take, for example, an autonomous vehicle. An autonomous vehicle is moving down the road at 60MPH, covering over 90 feet per second. Imagine having to send images and video to the cloud for inference analysis to be performed and sent back to the origin device. This would take a few seconds, during which the vehicle may have traveled more than 100 feet without guidance.
In order for an autonomous vehicle to avoid colliding with other vehicles, pedestrians, or objects, inference analysis must be performed in real-time on an edge computing device. Edge computers armed with GPUs and other accelerators can process data in real-time in as little as a few milliseconds, which is required for applications, such as autonomous vehicles, that require real-time analysis and decision making.
Furthermore, sending raw data, especially raw video feeds to the cloud requires huge amounts of internet bandwidth. For some organizations that are on metered internet plans, this can cost them an enormous amount of money. Also, internet bandwidth could bottleneck the entire system, especially if the edge computer is deployed in a remote environment where reliable internet connectivity is not always available. Additionally, uploading all of the data gathered by an edge device to the internet results in an inefficient use of resources, especially if that data is video because video data tends to be very large.
So, what is the solution?
The solution is to invest in a powerful AI inference computer that is capable of running deep learning AI algorithms locally on an edge computing device. Running deep learning algorithms locally eliminates many of the challenges associated with running deep learning inference algorithms on the cloud.
Performing inference analysis locally on an edge device, close to the source of data generation, eliminates any latency issues. This is so because data is processed locally, eliminating the need for data to travel thousands of miles to a data center for analysis. This allows real-time analysis and decision making in just a few milliseconds vs. a few seconds that it takes to send the data to the cloud for analysis and decision making. For applications that require real-time analysis and decision making, such as autonomous vehicles, they will greatly benefit from the reduced latency that comes with performing deep learning inference analysis locally on powerful AI inference PCs.
Also, deep learning inference analysis solves the issues associated with internet bandwidth. Performing inference analysis locally on an AI inference computing solution reduces the amount of required internet bandwidth. This is so because the data is stored and analyzed locally, removing the need to transmit enormous amounts of raw data through the internet to a data center or the cloud. Performing inference analysis locally is great for organizations that are on a metered data plan. This is so because they will save a great amount of money on internet bandwidth costs since less data has to be sent to the cloud for analysis.
Furthermore, AI edge inference computers can be configured with performance accelerators such as multi-core processors, GPUs, VPUs, FPGAs, and NVMe Computational Storage Devices. The two most popular options for deep learning inference analysis are GPUs (graphic processing units) and VPUs (vision processing units).
GPUs and VPUs are often used to accelerate deep learning inferencing because they are capable of performing large numbers of linear algebra computations. Such operations can easily be parallelized by GPUs. So, instead of having the CPU perform AI inference computations, the workload is offloaded to the GPU or VPU instead. GPUs and VPUs are both better at performing math computations and will, therefore, significantly speed up the performance of inference analysis, allowing the CPU to focus on executing the rest of the application programs and run the operating system (OS).
Premio AI Edge Inference Computing Solutions
Premio offers a variety of AI edge inference computers that are capable of running machine learning and deep learning inference analysis at the edge thanks to the powerful processing power that these systems can be configured with. Furthermore, AI edge inference computing solutions can be configured with a GPU to accelerate AI workloads. If you need assistance choosing an AI inference computing solution, please contact one of our AI computing professionals and they will assist you with choosing a solution that meets your specific requirements. Premio is a trusted source of AI computers and has been designing and building embedded systems for over 30 years in the United States. So, if you have any questions or comments, please get in touch with us.