How to Build AI or ML Farms – Without Breaking the Bank

WRITTEN BY

erik.hoeboer@netgear.com

The cluster is where things can get either really expensive or limit AI performance – or both. Let’s run through the typical hardware components used and determine where you can optimize for the best performance vs cost ratio.

AI Servers

These are the workhorses of an AI server farm. And you need a lot of them. These servers typically have powerful CPUs and are often equipped with GPUs or specialised accelerators like TPUs (Tensor Processing Units) explicitly designed for the parallel processing tasks common in machine learning and deep learning. There is no compromising here. The servers’ raw computing power will make or break your AI cluster.

AI Storage Systems

AI applications often require access to large datasets. Storage solutions in a server farm can include SSDs for fast access, HDDs for larger, less frequently accessed data, and network-attached storage (NAS) or storage area networks (SAN) for shared storage solutions. Fortunately, these systems have been commoditised, and there are a lot of choices for any budget.

AI Networking Hardware Switches

NETGEAR SwitchesHigh-bandwidth, low-latency switches are crucial for handling the intense traffic demands of a server farm. They are often the bottleneck of all data transport in an AI setup. Since you can’t compromise on throughput performance or low latency, this is an area where you may find NETGEAR’s new M4350 series of 10GbE/100GbE network switches to be a particular life-saver.

These switches run on the most modern, low-latency, high-performance silicon and are also built with simplicity in management and cost-conscious customers in mind. Manufacturers of typical enterprise data centre switches have made their products unaffordable, making them a lousy bet.

Routers manage traffic between the server farm and the broader Internet or other networks. The same applies here; they could be the bottleneck with traffic to/from the Internet. However, cost-effective alternatives are available, such as NETGEAR’s PR60X Professional Router with multi-gig/10gig WAN/LAN performance.

Network interface cards (NICs), possibly with 10GbE /100GbE throughput, are essential for fast server communication. They need to be optimised for the switches you choose.
NETGEAR’s engineering team can help you design an optimal network setup with these components.

Network management software is often forgotten, but it is crucial to managing network configurations, monitoring network performance, troubleshooting, and ensuring network security. NETGEAR offers a free controller, called NETGEAR Engage, to manage and monitor small or large numbers of NETGEAR Fully Managed Switches.

AI & ML Farm Examples

NETGEAR switches are used in world-class AI/ML applications. Two examples:

  • AI/ML cluster to analyse thousands of concurrent camera feeds of self-driving cars by a third-party data analysis company.
  • AI/ML cluster to gather, scan, analyse and combine drone camera footage for the military of a NATO member state.

With this overview, we hope to have given you a high-level idea of the prime considerations in designing an AI/ML setup and where we can help you optimise performance and cost.

When you have a draft proposal for your cluster’s architecture, please contact us to discuss the network design. We will design your network for free and guarantee its functions correctly. Read more about our M4350 series of switches, suitable for AI & ML deployments.