Monitoring AI Infrastructure: Network Visibility for Security and Operations Teams

Organizations deploying GPU infrastructure for AI training and inference are facing a new set of networking challenges. Unlike traditional enterprise applications, AI workloads generate unique traffic patterns, including large-scale data ingestion before training, high-bandwidth model checkpoint transfers to distributed storage, continuous inference requests from thousands of users, and ongoing management traffic that crosses cluster boundaries.

For network and security teams, the issue is not whether this traffic exists—it travels across standard IP networks and generates NetFlow records on routers and switches at the cluster boundary. The real question is whether that traffic is being monitored, baselined, and analyzed for anomalies that could signal performance issues or security incidents.

This article explains where NetFlow Optimizer (NFO) provides measurable value for AI infrastructure monitoring and where its visibility naturally ends.

Understanding What NetFlow Can See in an AI Environment

AI cluster networks typically consist of two separate traffic planes. Knowing which one NetFlow can observe is critical for setting realistic expectations.

The East-West Compute Fabric: Outside NetFlow Visibility

During model training, GPU-to-GPU communication occurs across a dedicated compute fabric. By 2026, RoCEv2 over Ethernet has become the dominant technology for enterprise AI clusters, while InfiniBand remains common in the largest hyperscale environments. Both technologies rely on RDMA.

Because RDMA bypasses the traditional IP networking stack, this traffic is not visible to NetFlow or sFlow collectors, regardless of whether the underlying transport uses Ethernet or InfiniBand.

This distinction is important: NFO cannot observe east-west GPU training traffic in most enterprise AI deployments. Visibility into the compute fabric requires specialized RDMA monitoring solutions.

The North-South Front-End Network: Fully Visible to NetFlow

Every AI cluster connects to the wider enterprise environment through a standard IP-based front-end network. This network carries:

Data ingestion from storage platforms and data lakes
Model checkpoint transfers
Inference traffic between users, applications, and AI services
Cluster management and orchestration communications

Because this traffic runs over standard TCP/IP, it generates NetFlow records on boundary routers and switches.

While NFO cannot see inside the GPU compute fabric, it provides complete visibility into traffic crossing the cluster boundary, including inbound and outbound data transfers, inference requests, management communications, and unauthorized external connections.

Three Areas Where NFO Delivers Clear Value

1. Monitoring Data Ingestion and Storage Traffic

AI training workloads depend on moving large datasets from object storage, data lakes, and NFS repositories to GPU nodes before and during training operations. These transfers generate sustained, high-bandwidth flows that are fully visible through NetFlow records collected at the cluster boundary.

NFO provides per-flow and per-application bandwidth visibility, allowing teams to understand:

Which storage systems are serving specific GPU nodes
The volume of data being transferred
When transfers occur

This information is valuable for capacity planning and performance troubleshooting. Teams can determine whether storage network congestion is contributing to training slowdowns, identify heavily utilized storage tiers, and detect preprocessing pipelines generating unexpected traffic that competes with training workloads for bandwidth.

2. Baselining Inference Traffic and Detecting Anomalies

Inference workloads create a very different traffic profile from training workloads. Instead of large internal data exchanges, inference environments handle high volumes of concurrent requests from users and downstream applications accessing AI services.

Many organizations now operate hybrid AI architectures, using dedicated RDMA fabrics for training while serving inference traffic over standard Ethernet networks. The inference side remains fully visible through NetFlow telemetry.

NFO enriches flow records with application intelligence, GeoIP information, and, where applicable, user identity data from Active Directory, Entra ID, Okta, and VPN authentication logs before forwarding the information to SIEM and monitoring platforms.

This enriched telemetry enables organizations to establish normal traffic baselines and identify anomalies such as:

Unexpected spikes in inference request volumes
Connections originating from unauthorized source IP addresses
Access attempts outside approved operational windows
Requests from unusual geographic locations

For organizations operating AI services in regulated industries, this visibility also supports ongoing monitoring and audit requirements associated with applicable compliance frameworks.

3. Strengthening AI Infrastructure Security Monitoring

AI infrastructure has become a high-value target for attackers. Proprietary model weights, training datasets, and expensive compute resources represent attractive assets for cybercriminals and nation-state actors alike.

NFO delivers enriched flow telemetry from the cluster boundary, providing upstream security platforms with the information needed to detect several key threat scenarios.

Model and Dataset Exfiltration

Large outbound transfers from storage systems or AI cluster nodes to unexpected external destinations are often the primary indicator of data theft.

By analyzing flow duration and cumulative transfer volumes over extended periods, security teams can identify both immediate and low-and-slow exfiltration attempts. This approach aligns directly with the detection methodology discussed in Defeating the Low and Slow.

Unauthorized Access to Inference Services

Connections originating from source IP addresses outside approved access lists can be identified immediately through NFO-generated flow telemetry.

When enriched with GeoIP information and threat intelligence context, these events become significantly easier to investigate and prioritize.

Indicators of Supply Chain Compromise

Unexpected outbound communications from AI infrastructure during or after model deployment may indicate compromise.

Examples include connections to unfamiliar package repositories, unknown internet destinations, or external hosts flagged by threat intelligence feeds. Because these communications traverse the north-south network boundary, they are visible within NetFlow data and can be investigated before they develop into larger incidents.

NFO Visibility Summary for AI Infrastructure

Traffic Type	NFO Visibility	Value Delivered
Data ingestion from storage to GPU nodes	Full	Bandwidth monitoring, storage capacity planning, training bottleneck identification
Model checkpointing to distributed storage	Full	Checkpoint frequency and volume tracking, storage utilization visibility
Inference serving (users and applications to GPU nodes)	Full	Enriched data foundation for upstream baselining, anomaly detection, unauthorized access detection, and audit trail for regulated environments
Management and orchestration traffic	Full	Unexpected management connections, configuration change indicators, first-contact external destinations
Outbound connections from AI infrastructure (potential exfiltration)	Full	Enriched outbound flow data enabling exfiltration detection, supply chain compromise identification, and threat intelligence screening in upstream security systems
East-west GPU training traffic (RoCEv2 or InfiniBand compute fabric)	Not visible	RDMA bypasses the IP stack; dedicated RDMA monitoring tools required for compute fabric visibility

Deploying NFO for AI Infrastructure Visibility

NFO is software-only and deploys on standard Linux or Windows Server with no hardware changes to AI infrastructure. The deployment model is straightforward: NFO collects NetFlow or IPFIX from the boundary switches and routers connecting the AI cluster to the storage network and enterprise network, enriches the data, and delivers it to your SIEM or monitoring platform in under one hour.

For organizations delivering AI services in regulated environments, NFO’s on-premises, air-gap-compatible architecture ensures that AI infrastructure telemetry stays inside the security boundary. See the NFO Government Solution Brief for deployment architecture details relevant to classified and sensitive environments.

The Bottom Line

NFO does not provide visibility into the GPU compute fabric. That requires dedicated RDMA monitoring tools. What it provides is continuous, enriched network telemetry for everything that crosses the AI cluster boundary: the storage traffic that feeds training runs, the inference traffic that serves users, the management traffic that operates the cluster, and the outbound connections that could indicate a security incident.

For most enterprises deploying AI infrastructure, the cluster boundary is where the security and operational visibility gaps are largest and least addressed. The RDMA fabric has specialized monitoring tooling built around it. The front-end network is frequently less instrumented.

The GPU cluster is the new crown jewel of enterprise infrastructure. The network around it deserves the same visibility as any other critical asset.

About DT Asia

DT Asia began in 2007 with a clear mission to build the market entry for various pioneering IT security solutions from the US, Europe and Israel.

Today, DT Asia is a regional, value-added distributor of cybersecurity solutions providing cutting-edge technologies to key government organisations and top private sector clients including global banks and Fortune 500 companies. We have offices and partners around the Asia Pacific to better understand the markets and deliver localised solutions.

How we help

If you need to know more about Monitoring AI Infrastructure: Network Visibility for Security and Operations Teams, you’re in the right place, we’re here to help! DTA is Netflow Logic’s distributor, especially in Singapore and Asia, our technicians have deep experience on the product and relevant technologies you can always trust, we provide this product’s turnkey solutions, including consultation, deployment, and maintenance service.

Click here and here and here to know more: https://dtasiagroup.com/netflowlogic/