As the number of IoT devices that are coming online continues to increase, computer power is shifting to the edge to alleviate the stress placed by such devices on the cloud and data centers. Edge computing has seen many advancements thanks to the improvements in compute power and storage capacity and speed. Edge computing is capable of alleviating the stress placed on data centers by processing and analyzing information locally, eliminating the need for data to be sent to the cloud and data centers for processing and analysis.
Before you realize it, edge computing will be all around us, processing data from a multitude of devices that can be found in your homes, public infrastructures, factories, and enterprises. Because of the massive growth of edge computing, many companies are looking for the best way to improve their existing IT infrastructures to capitalize on this trend. Companies are developing various AI solutions, especially for prediction and inference analysis that can be performed on edge computers.
However, the machine learning model is not only becoming more advanced, but it’s also taking more space. In the past few years, the sizes of a machine learning model for image recognition and speech recognition compared with today's models have significantly increased in size. This creates challenges for speed and bottlenecks in processing a huge amount of data at the edge. In order to achieve powerful computing, real-time decision making, and high-speed processing, companies are configuring their systems with performance accelerators for the best performance at the edge.
What are Performance Accelerators?
Performance accelerators, also known as hardware accelerators are microprocessors that are capable of accelerating certain workloads. Typically, workloads that can be accelerated are offloaded to the performance accelerators, which are much more efficient at performing workloads, such as AI, machine vision, and deep learning. Performance acceleration integrates general-purpose processors and more specific purpose processors to work together simultaneously to perform a task. This is feasible because performance accelerators are capable of performing parallel computations instead of serial computing.
Parallel computing processes is a system that can process multiple tasks within different processors simultaneously at once. Applying this process will result in higher performance, a huge increase in processing speed, and a lower work load per processor. It makes it an optimal method for inference machine learning that is common in AI embedded system applications such as an autonomous vehicle, e-gates, traffic management, and many more applications that involve computer vision.
Processors Accelerators
1. Multi-core CPUs
2. GPUs (graphics processing units)
3. VPUs (vision processing units)
4. FPGAs & ASICs (field programmable gate array)
Storage & Connectivity Accelerators
5. NVMe Storage
6. Computational Storage
7. PCIe Protocol
Why are they Beneficial?
The End of Moore’s Law had been mentioned in the past decade. With Moore’s law the CPUs today are 300 times faster than in 1990, but the law no longer applies because of the growth of processor's single-thread performance (SpecINT) and frequency speed (MHz) have plateaued in recent years. The parts that continue to grow are the number of transistors and cores. This creates inequality in the overall performance growth. That said, by implementing parallel processing, you can still take advantage of the growing inequalities and mitigate the gaps in other growing specification areas.
Source: Chuck Moore, 2011, “Data Processing in Exascale-Class Computer Systems,” The Salishan Conference on High Speed Computing, April 27, 2011
Typically, embedded system applications require high compute power and storage capacity when performing machine vision or deep learning. Therefore, when you want to apply performance accelerations, make sure to consider the three main factors that affect machine learning performance, these are computation capacity, memory bandwidth, and communication speed. By equipping your edge computer with a more domain-specific acceleration processors, you can tackle the growing data flow challenges and create a powerful edge computer. That said, further below we will explain the essential hardware accelerators include multi-core CPUs, GPUs, VPUs, FPGAs, ASICs, and fast-speed storage and processing protocols include NVMe, computational storage, and PCIe 4.0. performance accelerations.
1. CPU - Central Processing Unit
Instead of wasting time and money to shrink components and make them more efficient, the premise was to come up with higher-quality solutions. Using parallel computing with a design called multiple instructions, multiple data (MIMD), each specific set of instructions can be done by each processor or core. This is all done in a single clock cycle by splitting the task into multiple sub-tasks and simultaneously running them in parallel. Hence, you need to choose CPU specification that can support this technique, and here are some factors to look into:
- Clock Speed
- Number of Cores
- Multithreading
- Memory Support and Channels
Clock Speed
Clock speed is essentially the number of cycles performed every second, for instance, a 3.4GHz processor can perform 3.1 billion cycles per second. This is important because of the huge number of tasks and operations that an edge computing is performing, especially when it’s running a machine learning algorithm. To complete more tasks, the processor needs to have a higher clock speed, and this will result in a faster general computing speed.
It’s important to understand not to compare CPU’s clock speed with different manufactures or generations because it’s not always faster by their clock speed. E.g., the 7th gen Intel Core i5-7500 3.4GHz isn’t more powerful than the 10th gen Intel Core i5-10500 with 3.1 GHz, where it had a lower clock speed. That said, there are other factors that you need to consider when choosing which CPU, you want to configure your system with.
Number of Cores
The rule of thumb for core count is that the more cores that a processor has, the faster processing speed it will have, and it will be able to handle and it will be able to perform more computations in parallel. Multi-core processing is excellent for heavy workloads that can utilize multiple cores. Multiple cores are simultaneously running the sub-tasks of a bigger task that has been broken down for parallel processing. Therefore, the more cores running a task, the faster the processing speed and the lower the power consumption will be. For example, a dual-core processor can run at a faster speed compared to the single-core processor does when processing instructions while consuming less energy.
Additionally, it’s more efficient and consumes less energy with twice the clock speed. With a multi-core processor, there are more transistors, shorter connections, higher capacity, and a smaller circuit working at a faster speed. For AI applications you’ll want at least 8 cores or a great CPU processor like the Intel 10th Generation Intel Core i9 processors with 10 cores and 20 threads.
Multithreading
There are strong relationships between CPU cores and threads when determining CPU’s computing power. Cores are the actual physical hardware component, while threads refer to the number of virtual cores that manage tasks. It is a unit of execution on concurrent programming of simultaneous executions that are taken in parallel computing. This is how multithreading is able to execute multiple tasks at the same time and each thread can execute individually while sharing their resources when completing tasks.
Moreover, Intel brought forward the parallel computation to end-user-based computer with hyper-threading. It’s a CPU that works as if there are two logical CPUs in an operating system. Which means, the OS will acknowledge two CPUs for each core despite having one core on the hardware that has a single set of executions resource for each CPU core. In AMD chip this is called ‘simultaneous multi-threading’ (SMT) which is basically a similar technology with Intel’s hyper-threading. That being said, the number of cores and threads are positively correlated with the efficiency and multitasking capability of the CPU.
Memory Support and Channels
A memory controller is built directly into modern CPUs that causes various types of CPUs that support different memory speeds. The memory speeds indicate the data transfer speed and is measured in MHz, the higher the speed, the faster it can transfer data. Thus, pay attention to the supported memory size and memory type. For instance, CPUs that support DDR4 speeds up to 2133MT/s doesn’t mean it supports those speed on DDR3 memory.
Moreover, the CPU's channel is usually referred to as the lane where it connects the communication between the processor and memory. You need more lanes to have faster data exchange speed. Therefore, you want to take advantage of modern CPUs that support at least dual-channel memory and more to install additional RAM modules into DIMM slots on your motherboard. That said, it’s crucial to understand CPU specifications in detail instead of just one feature, and with more knowledge, you can choose the most compatible CPU based on your application needs.
2. GPU - Graphics Processing Unit
To illustrate the difference between CPU and GPU within the animal kingdom analogy, CPUs are like elephants, powerful but few, and in contrast GPUs are like ants, smaller but massive in number. This is how you could compare the computational power especially the number of cores that a GPU has compared to CPU cores. Originally designed for video processing and rendering, today’s GPUs are used in a broader range of applications especially in calculation involving massive amounts of data.
The colossal parallelism of modern GPUs allows the computer to process billions of records instantly, accelerating heavy portions of an application while the rest continues to run on other components like the CPU. GPUs are especially winning in vector calculations that combine larger data sets, unstructured data, and more sophisticated statistical analyses that are common in data science, and especially machine learning algorithms. For instance, when processing genome sequencing GPUs can complete it in minutes, where CPUs require days to finish. Therefore, GPUs are compatible not only with heavy graphics calculations, but it’s also great for AI, machine learning, and deep learning applications.
When choosing the best GPU for embedded computing you need to select GPUs that can last for the long run and have scalability through integration and clustering. Hence, having a production-grade or data center GPUs are essential for your device’s durability and reliability. There are 2 main players in GPU’s market today, namely Nvidia and AMD. Both manufacturers offer various GPU collections and have many similar technologies implemented under a different nickname.
Moreover, Nvidia GPU’s cores are called CUDA cores and AMD GPUs are called stream processors, they both refer to the same thing which is the cores inside a GPU. Nvidia also came up with Tensor cores, it’s specially designed for machine learning computing that can run mixed-precision oppression in just one clock cycle.
Different from CPU, comparing GPU cores with different manufactures will not accurately indicate the graphical power of each GPU, it’s better to compare two GPU cards with the same architecture for more accurate comparison. Choosing the right GPUs for your edge computer performance accelerators can beneficial for deployments. Here are several important factors to make sure your GPUs are the right fit for your embedded system model.
Learn More About Machine Inference Analysis Case Study: One Fish, Two Fish
Clock Speed
Similar to CPU, GPU’s clock speed measured in MHz, indicating the number of computations per second performed by a GPU core. This means if you want a faster and more powerful performance delivered by your GPU, the higher the clock speed it will require. You can check the core clock speed and boost clock speed for better comprehension of the power offered by the GPU. For instance, NVIDIA GeForce GTX 1050 Ti has a base clock speed of 1290MHz up to 1392 boost clock speed during heavy workload. With this GPU embedded in an industrial computer such as Premio's VCO 6000 series, it can quickly perform image recognition machine learning algorithm for vision application tasks such as inference application in a fish processing plant, facial recognition, and security ID check-in airports.
Learn More About Performance Accelerator Application in Airport Security ID Checks Case Study
Memory: Type, Size, and Bandwidth
Memory in a graphics part is the most important factor to include when choosing your GPU specifications. The memory foundation SDRAM (synchronous dynamic random-access memory) is used in GPU’s graphics double data rate (GDDR). This what manufacturers are using in their GPU’s VRAM, such as GDDR5, GDDR5x, and GDDR6 VRAM. The memory of VRAM is measured in GB, and the bigger the size is better for your memory size to store a lot of graphical data. If you don’t have enough memory in your GPU, you cannot fit a bigger machine learning model size into your GPU.
With a big memory, you also need a huge memory bandwidth. Memory bandwidth is the overall product of memory clock speed, memory bus width, and transfers-per-clock speeds of your memory. Memory bus width is like lanes that can allow for more data transfer and is measured in bits. The higher all these three factors are, the more powerful your GPU will be. This allows the GPU to tackle complex machine learning workloads. For example, NVIDIA GeForce GTX 1050 Ti has a 4GB GDDR 5 type, with 112GB/sec memory bandwidth, and a 128-bit memory interface width. This GPU is used in Premio’s smart automation plant case studies.
3. VPUs – Vision Processing Units
While GPUs can be used for graphics or deep learning, engineers have come up with a smaller and compact processing unit that is perfect for deploying it at the edge. It is called a vision processing unit (VPU). VPU is a type of microprocessor specifically designed for accelerating machine learning and artificial intelligence workloads. VPUs are specialized to support image processing task for computer vision. That said, Intel has created a processor that is really powerful yet small in form factor with a low power usage. The chip embedded inside is called Movidius Myriad X which consists of high-level architecture specifically designed for neural networks, imagining, and vision accelerators.
Therefore, applying a VPU will provide more privacy, no latency, faster performance, and lower power usage. It is because this is all happening on the edge without any interaction with the cloud and you can stack multiple VPUs that are USB stick shape to easily double the computing power. This is very exciting for performance accelerators, especially ones that involve vision wherewith VPUs it can alleviate vision tasks and let the CPUs and GPUs to run other programs.
4. FPGAs and ASICs
Recently growing in popularity is the introduction to FPGAs and ASICs. FPGA stands for field-programmable gate array is a flexible integrated circuit that consists of logic blocks, I/O cells, and interconnection resources that are fully customizable and programmable based on the desired application’s functions. This chip can be reconfigured several times by changing the I/O connections and the logic blocks in different combinations. FPGAs exist in various machine vision equipment that is often embedded together with a camera at the back-end part.
FPGAs are very convenient due to their flexibility where you don’t need to change the hardware when reconfiguring the chips, you can just update the functions from the software perspective. Sometimes after getting the optimal configuration, the FPGAs design is implemented into ASICs architectures. ASICs stands for application-specific integrated chips, it’s a much more efficient chip that is hardened for performing a specific task.
Different from FPGAs, ASICs are permanent, once you produce it, it can only run the task it was originally designed to. Therefore, developers usually start with FPGAs first to figure out the optimal chip design for the AI algorithm. Then, after figuring out the optimal combination, they can start to configure the ASIC’s design. Some of the developers stick with FPGAs, because it’s very expensive to produce ASICs hardware, especially because it’s a one-way route. One of the most famous ASICs revolutionary innovation is Google’s custom-developed ASICs called the Google Cloud TPU. It is an ASIC that is very specifically designed to accelerate machine learning workloads very efficiently.
Storage and Fast Connectivity
The next thing to make sure that you will get the most out of your performance accelerators is to make sure you have the fastest storage and connections possible. By improving as many components throughout the system you can increase performance and reduce bottlenecks. Removing bottlenecks will create an optimally performing system. As mentioned, these several factors include storage technology NVMe, computational storage, and PCIe connection protocol.
5. NVMe SSD
NVMe is the standardized interface for PCIe SSDs, it stands for non-volatile memory express. It’s a protocol designed for accessing high-speed data storage media by focusing on parallelism to unleash the true potential of SSDs. First of all, non-volatile storage is a data storage device that even after the entire system is turned off, the storage device still retains the data. It is also referred to as persistent storage. Both HDDs and SSDs are a persistent storage, but in contrast, SSDs are so much faster in reading and writing speed, and have no moving mechanical parts. This is why SSDs are great for edge computer deployments.
Secondly, combining SSDs and NVMe protocol provides various advantages compared to legacy protocols such as SATA that was conceived of in the HDD’s era. Utilizing parallel computing and running on a PCIe electrical bus, and flash storage nature, NVMe can concurrently support up to 64,000 commands all at once compared to 32 single command queues offered by SATA, it’s a massive upgrade. Finally, NVMe storage comes in various form factors such as M.2 NVMe SSD storage. This is the kind of storage device you need for a high-speed edge computing that will reduce your system’s bottlenecks.
6. Computational Storage
Another important upcoming technology to watch is computational storage. All of these new technologies are constantly trying to eliminate speed bottlenecks, higher computing performance, and increasing storage capacity. To solve these challenges, engineers came up with new solutions such as computational storage. Computational storage basically brings the processing power to the storage device itself.
Having a separate processor for data storage and computing power is inefficient as the bottleneck increases when a huge data volume needs to be transferred between the CPU and the storage device. They basically create a storage subsystem that consists of a number of specific processors or general-purpose processors like CPUs. Then these processors are located right on the storage media molded in a single drive, known as computational storage drives (CSDs). Computational storage is based around an ARM cortex located in a NVMe-based storage and can even include additional processors such as FPGAs or ASICs accelerators based on the application needs. The design is moving the processing right to the data source itself and resulted in an ultra-fast computing speed.
7. PCIe Protocol
Peripheral component interconnect expresses (PCIe) protocol is a high-speed interface standard that reduces bottlenecks between components with a very fast connection speed. Making sure your motherboards have the right PCIe configurations is important for all of these above-mentioned accelerators to be fully optimized. There are up to six PCIe generation from PCIe generation 1 to 6, with bandwidth range from 8GB/s up to 256GB/s and Giga transfer rate from 2.5GT/s up to 64 GT/s, doubles every generation. Currently, the PCIe standard currently being used in devices that are on the market is PCIe 4.0 that debuted in 2017 and offers 64GB/s bandwidth speed and 16GT/s in Giga transfer. This is an extremely fast connection. That said, PCIe devices still can’t come even close to the maximum potential of the PCIe 4.0 speed.
These performance accelerators will definitely benefit a lot from the PCIe 4.0 protocol. Moreover, in a motherboard PCIe slots are available in different physical sizes from x1, x4, x8, 16, up to x32. These are how many lanes the PCIe slot has on your motherboard. PCIe slots are reversible in connection, which means you can insert PCIe 3.0 devices in PCIe 4.0 slots. Also, the serial connection in the slots let you insert bigger PCIe devices into a smaller slot. For instance, you can insert a PCIe x16 card into a PCIe 8 slots and vice versa for a smaller card into a bigger slot. The only difference is the bandwidth received from a smaller slot compared to bigger slots. Future devices that will support PCIe 5.0 or even 6.0 will add massive speed and reduces bottlenecks for faster processors in the upcoming future. Even with PCIe 4.0, you can really take advantage of your NVMe SSDs storage, increasing the bandwidth speed up to 8GB/s.
Frequently Asked Questions:
What are the ways to accelerate a computer’s performance?
Running heavy applications especially once they involve AI, machine learning, and deep learning can cause several challenges for the computer. Accelerating a computer’s performance can remove these challenges such as bottlenecks, improve latency, and power consumption. There are two main ways to accelerate your computer performance, through utilizing software accelerators and hardware accelerators.
What is software accelerator and hardware accelerator in machine learning computing?
There are solutions you can take when trying to accelerate your computer systems from a software and hardware point of view. From software to accelerate performance, you need to increase the algorithm efficiency while maintaining the machine learning accuracy.
These steps can be like pruning, weight sharing, quantization, low-rank approximation, binary/ternary net, and Winograd transformation for inferencing algorithm and parallelization, mixed precision, model distillation, and dense-sparse-dense method for training algorithm. And from a hardware acceleration perspective the common goal is to minimize memory access, reduce performance bottlenecks, increase computing speed. These can be achieved by parallel computing in specific processors like GPUs and VPUs, apply powerful CPUs, and reduce speed bottlenecks with NVMe and PCIe protocols.
What is the difference between training and inferencing in deep learning? And why should I care?
Basically speaking, training is the process of creating a machine learning algorithm from scratch by training it with a massive amount of data. Inference, on the other hand, is the process of implementing the trained machine learning to make predictions against previously unseen data. Training machine learning usually takes place in a huge data center, in contrast, inference machine learning is ready to be applied at the edge, providing vision for a computer. Inference usually is used in autonomous vehicles, smart manufacturing, and image recognition system.
What is the difference between GPU and VPU?
Both GPUs and VPUs are great for running machine learning algorithms. However, GPUs tend to be bigger in size and are made for more general use cases, meaning they are not specifically designed to perform machine learning. . GPUs can also be used for gaming in processing the heavy graphics rendering that took place when playing heavy-loaded games. In contrast, VPU is specifically designed for inference machine learning and it’s so much smaller in size. Therefore, VPU is very powerful and consumes less power when providing vision for a computer.
When do I need to apply performance acceleration to my computer?
Performance acceleration is usually applied to a computer that runs heavy-weight tasks that are mostly faced when the computer is trying to implement data analytics, artificial intelligence, machine learning, and deep learning. However, performance acceleration also can be very useful in an industry that constantly needs a fast and reliable computing power in its infrastructure like in the automation industry.