How to Build AI or ML Farms – Without Breaking the Bank


The cluster is where things can get either really expensive or limit AI performance – or both. Let’s run through the typical hardware components used and determine where you can optimize for the best performance vs cost ratio.

AI Servers

These are the workhorses of an AI server farm. And you need a lot of them. These servers typically have powerful CPUs and are often equipped with GPUs or specialised accelerators like TPUs (Tensor Processing Units) explicitly designed for the parallel processing tasks common in machine learning and deep learning. There is no compromising here. The servers’ raw computing power will make or break your AI cluster.

AI Storage Systems

AI applications often require access to large datasets. Storage solutions in a server farm can include SSDs for fast access, HDDs for larger, less frequently accessed data, and network-attached storage (NAS) or storage area networks (SAN) for shared storage solutions. Fortunately, these systems have been commoditised, and there are a lot of choices for any budget.

AI Networking Hardware Switches

High-bandwidth, low-latency switches are crucial for handling the intense traffic demands of a server farm. They are often the bottleneck of all data transport in an AI setup. Since you can't compromise on throughput performance or low latency, this is an area where you need modern 10GbE/100GbE network switches.

These switches should run on the most modern, low-latency, high-performance silicon and be built with simplicity in management and cost-conscious customers in mind.

Routers manage traffic between the server farm and the broader Internet or other networks. The same applies here; they could be the bottleneck with traffic to/from the Internet.

Network interface cards (NICs), possibly with 10GbE /100GbE throughput, are essential for fast server communication. They need to be optimised for the switches you choose.
Engineering teams can help you design an optimal network setup with these components.

Network management software is often forgotten, but it is crucial to managing network configurations, monitoring network performance, troubleshooting, and ensuring network security. Free or commercial controllers are available to manage and monitor small or large numbers of fully managed switches.

AI & ML Farm Examples

NETGEAR switches are used in world-class AI/ML applications. Two examples:

  • AI/ML cluster to analyse thousands of concurrent camera feeds of self-driving cars by a third-party data analysis company.
  • AI/ML cluster to gather, scan, analyse and combine drone camera footage for the military of a NATO member state.

With this overview, we hope to have given you a high-level idea of the prime considerations in designing an AI/ML setup and where we can help you optimise performance and cost.

