Tuesday 20 June 2023

Security automation with Cisco XDR

Cisco XDR, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Prep, Cisco Preparation, Cisco Learning, Cisco Certification, Cisco Guides

Security Operations Centers (SOC) continue to face new and emerging threats that test the limits of their tooling and staff. Attackers have simple, affordable access to a plethora of cloud-based computing resources and can move quicker than ever. Keeping up with threats is no longer about adding more people to the SOC to watch logs and queues. It’s about leveraging automation to match the speed of your attackers. This past April, at the RSA Conference in San Francisco, Cisco announced our new eXtended Detection and Response (XDR) product: Cisco XDR. Cisco XDR combines telemetry and enrichment from a wide variety of products, both Cisco and third party, to give you a single place to correlate events, investigate, and respond to automatically enriched incidents. No modern XDR product is complete without automation, and Cisco XDR has multiple automation features built in to accelerate how your SOC battles their enemies.

Response Playbooks


Having visibility from an incident is step one, but being able to quickly take meaningful response actions is vital. In Cisco XDR, the new incident manager has what we’re calling the response playbook. The response playbook is a series of suggested tasks and actions broken down into four phases (based on SANS PICERL):

  • Identification – Review the incident details and confirm that a breach of policy has occurred.
  • Containment – Prevent malicious resources from continuing to impact the environment.
  • Eradication – Remove the malicious artifacts from the environment.
  • Recovery – Validate eradication and recover or restore impacted systems.

Each of these four phases has their own tasks that guide the analyst through completing relevant steps, but the one to focus on from an automation perspective is containment. Let’s say you have a few endpoints you want to isolate but they’re managed by multiple different endpoint detection and response (EDR) products. Two are managed by Cisco Secure Endpoint and another is managed by CrowdStrike. With both of these products integrated into Cisco XDR, all you need to do is click “Select” on the “Contain Incident: Assets” task, select the endpoints to contain, and click “Execute.” We’ll handle the rest from there using an automated workflow in Cisco XDR Automation (explained in more detail in the next section). The workflow will check which endpoints are in which EDR and take the corresponding actions in each product. Improving the analyst’s ability to identify and execute a response action from within an incident is one of the many ways Cisco XDR helps your SOC accelerate its operations.

Cisco XDR, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Prep, Cisco Preparation, Cisco Learning, Cisco Certification, Cisco Guides

Automated Workflows


With automation being a core component of how we achieve XDR outcomes, it should come as no surprise that Cisco XDR has a fully featured automation engine built in. Cisco XDR Automation is a no-to-low code, drag-and-drop workflow editor that allows your SOC to accelerate how it investigates and responds, among other things. You can do this by importing workflows from Cisco or by writing your own. To take automation to the next level in Cisco XDR, we have a new concept called Automation Rules. These rules allow you to define criteria that determine when a workflow is executed. Here are some example rule types and when you might use them:

  • Approval Task – Take response actions after an approval task is approved, or notify the team if a request is denied.
  • Email – Investigate suspicious or user-reported emails as they arrive in a spam or phishing investigation mailbox.
  • Incident – Enrich incidents with additional context, take automated response actions, assign to an analyst, push data to other systems like ServiceNow, and more.
  • Schedule – Automate repetitive tasks like auditing configurations, collecting data, or generating reports.
  • Webhook – Integrate with other systems that can call a webhook when something interesting happens. A message being sent to a bot in Webex, for example.

Cisco XDR Automation allows you to move data between systems that don’t know how to communicate with each other, use custom or third party tools to enrich incidents as they’re generated, or tailor how your analysts respond to threats based on your standard operating procedures.

Cisco XDR, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Prep, Cisco Preparation, Cisco Learning, Cisco Certification, Cisco Guides

APIs


Finally, the core of what powers much of Cisco XDR is its APIs. Cisco XDR has a robust set of APIs that allow you to extend most of the functionality you see in the product out to other systems. You can use Cisco XDR APIs to scrape observables from a block of text (shown below in Postman), gather intelligence from integrated products, conduct an investigation, take response actions using integrated products, and more. The flexibility to use Cisco XDR via APIs allows your SOC to customize your processes at a granular level. Want to enrich tickets in your ticketing platform with intelligence from your security products? We have APIs for that. Want to allow analysts to approve remediation actions by messaging a bot in Webex? We can do that too. Cisco XDR has a full suite of APIs that can help you take your security operations to the next level.

Cisco XDR, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Prep, Cisco Preparation, Cisco Learning, Cisco Certification, Cisco Guides

Conclusion

The crucial takeaway from this blog is that automation is a key component of modern security operations. The threats we face evolve constantly, move quickly, and many security teams lack enough skilled staff to monitor all of their tools. We need to use automation to keep up and get ahead of bad actors. From an industry perspective, we also recognize that many teams are trying to do more work with fewer people. Automation can help with that too. We want to enable your SOC to automate the things they don’t want to do and accelerate the tasks that truly matter. All of this and more can be done with Cisco XDR.

Source: cisco.com

Saturday 17 June 2023

The Power of 5G for the Connected Future

Cisco Career, Cisco Tutorial and Materials, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Guides, Cisco Learning

Trains, planes, automobiles… they all move fast — their network should, too. Transportation systems depend on a strong, secure network – drivers, passengers, employees, and even autonomous operations all rely on it.

5G connectivity is key to unlocking next-gen transportation networks and applications. Given the critical importance of safety in the transportation ecosystem, in addition to ensuring a seamless user experience, having ubiquitous and extremely reliable connectivity is mission critical. Managing multi-access technologies such as public 5G, private 5G, and Wi-Fi will play a pivotal role in ensuring reliable and secure connectivity across transportation use cases.

Traditional networks forced data back to centralized nodes, which increased latency by being further way from where the data originated.  With 5G, these nodes can now be decentralized and distributed in cloud deployments, bringing applications and the internet closer to the vehicle, and allowing unprecedented low latency connectivity. Additionally, 5G provides improved security to aid car manufacturers and fleet managers to meet connected vehicle application security requirements.,

Next-gen experiences in the connected car


The connected car has evolved since the early days of sending a signal once the vehicle was in an accident.  Today’s connected vehicle has become a bidirectional communicational channel. It needs to be able to communicate with the internet, other vehicles, roadways, intersections, and more for traffic, safety and even entertainment use cases. Automotive OEMs must navigate how to seamlessly move a vehicle between environments, using multiple access technologies, and maintain network visibility, control, and reporting.

Connected cars are the most sophisticated Internet of Things (IoT) devices today with use-cases (onboard applications or services) ranging from notifying drivers of upcoming road hazards, emergency vehicles, or pedestrians in intersections, to telematics services that enable predictive maintenance of vehicle components, infotainment services to enable audio and video streaming apps (Netflix, Spotify), on-board Wi-Fi, high-definition maps, and a marketplace for retail use-cases.

In addition to these use-cases, OEMs are looking at 5G as a critical enabler for autonomous driving with V2X services – where the car communicates with neighboring vehicles, roadway infrastructure, and an edge cloud – which requires periodic mapping updates and predictive intelligence with automated assurance to detect service anomalies and drive corrective actions. Additionally, software defined vehicles require frequent software updates (FOTA/SOTA) which require reliable, high-bandwidth connectivity.

Webex integration is another application that OEMs are choosing to enable as a new service for their customers by making their vehicle a mobile connected office. Ford and Mercedes Benz AG’s recent partnerships with Cisco to enable WebEx conferencing in their vehicles pave the way for mainstream adoption by other OEMs.

Commercial Vehicle (CV) OEMs are also leading adoption of autonomous trucking (AT) technologies and building homegrown fleet management solutions. Pervasive connectivity with edge deployments supporting mission critical V2X communications is a pre-requisite for CV OEMs to embrace autonomous trucking. Platooning, considered to be the first commercial AT application, is expected to generate TCO savings of ~45% by the end of this decade. Fleet management solutions for electrified, autonomous trucks will subsequently leverage 5G connectivity for predictive diagnostics and maintenance of vehicle components and powertrain. Figure 1 has an overview of connected vehicle 5G-enhanced use-cases.

Cisco Career, Cisco Tutorial and Materials, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Guides, Cisco Learning
Figure 1. Connected vehicle 5G-enhanced use cases

Cisco’s vision for a 5G connected transportation future

To achieve this vision of a 5G connected future in transportation, we are enabling vehicle OEMs to take the control needed to deliver a safer and more sustainable fleet. That requires deep integration with networks and a deep understanding of the quality of service (QoS) that comes from it.

QoS becomes critical for services that depend on specific characteristics or SLAs like safety or autonomous driving. OEMs need to know how vehicles are performing, and to be able to address issues as they arise, not open a ticket with their communications service provider (CSP) and wait for a response. They need a framework where CSPs allow them certain control and configuration privileges, like applying a slice to a network service or deploying additional edge nodes when capacity dictates they are needed.

This level of control will allow OEMs to provide unique customer experiences, with a reliable QoS to deliver their services. The car becomes a digital extension of the passenger’s journey, whether it’s a privately owned vehicle, or a shared mobility service. And this goes beyond the connected car.

OEMs and municipalities must work together to build intelligent systems that will power the connected roads and corridors.  They must learn how to bring disparate sources of data together, process them into intelligent decisions and then feed that information back to drivers or infrastructure that can act upon it.

The next generation of both cars and networks will change transportation and mobile networks in ways we can’t even fathom yet. But unless you have a strategy for how to bring these two together, you will struggle to unlock the power that is just at our fingertips.

Source: cisco.com

Tuesday 13 June 2023

Announcing Cisco ISE 3.3

Cisco ISE 3.3, Cisco Certification, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Tutorial and Materials, Cisco Prep, Cisco Tutorial and Materials

If you were at Cisco Live 2023 in Las Vegas, you surely saw that Cisco announced a lot of new products. One of these new products was the update to Cisco Identity Services Engine (ISE 3.3).

Every network admin or security operator has the same issue: you’re trying to enhance your network’s security, while adding visibility and boosting efficiency, all without sacrificing flexibility. In other words, you want more features without the complications. Cisco ISE 3.3 has that.

Split Upgrade and Multi-Factor Classification adds flexibility


When it comes to flexibility, Cisco ISE 3.3’s Split Upgrade feature will change the way you look at ISE upgrades. Customers can be hesitant to update to the newest version of Cisco ISE, because it can take a long time for ISE nodes with large databases to complete the upgrade. Split Upgrades is a new process that is less complex, as files are downloaded before upgrades and prechecks are done. Split Upgrade gives you better control on which ISE nodes to upgrade at any given time, without any downtime.

Cisco ISE 3.3, Cisco Certification, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Tutorial and Materials, Cisco Prep, Cisco Tutorial and Materials
Cisco Identity Services Engine (ISE) Dashboard

Another feature in Cisco ISE 3.3 provides a way to easily identify clusters of unidentified endpoints found on the network. These endpoints are unidentified because oftentimes a variety of endpoints connect to the network that are not directly provisioned by IT. This feature uses AI/ML Profiling and multi-factor classification (MFC) to quickly identify clusters of identical unknown endpoints via a cloud-based ML engine. From there, the devices can be reviewed by proposed profiling policies via the ML engine and have the devices labeled as either MFC Hardware Manufacturer, MFC Hardware Model, MFC Operating System and MFC Endpoint Type.

By placing the unidentified device into one of these four buckets, Cisco ISE has taken a big chunk of guessing what goes where out of the equation. From there it’s easier for the customer to determine what the endpoints are and what policies should govern them when on the network.

Unique to Cisco: Wi-Fi Edge Analytics


A Cisco-only feature called Wi-Fi Edge Analytics will allow network admins to mine data from Apple, Intel and Samsung devices to better improve profiling. Cisco Catalyst 9800 wireless controllers will pass along endpoint-specific attributes, such as model, OS version, firmware, among others, to ISE via RADIUS. From there this information will be used to profile common endpoints found on the network. Network Admins will now have more data allowing them to create more defined profiles. The more information that is at the fingertips of the admin, the more precise the profile.

Even More Flexibility with Controlled Application Restart


To increase efficiency, predictability and reduce downtime, Cisco ISE 3.3 offers Controlled Application Restart. It benefits customers by saving them time and eliminating a lot of the headaches that come with managing ISE admin certificates. Customers are now given the ability to control the replacement of the ISE administrative certificate allowing them the ability to plan for maintenance once their current certificate expires. Prior to this new feature, a certification replacement required a complete reboot of all the PSNs in the deployment without the ability to know or control the order to the reboot, which can cause some admins to allow the certification to lapse.

Changes to certificates require a restart since it affects systemwide configuration and cannot be done during operational hours since it requires significant downtime. However, Cisco ISE 3.3 now provides flexibility for these certifications to be scheduled the restart at the network admins’ convenience; during the middle of the night or on weekend when network usage is low. This eliminates the need for that downtime and helps to smooth security updates without disruption.

Controlled Application Restart is a response to an industry trend where customers are moving to a short-term certificate due to added security. This new feature is beneficial as the maintenance needed to update the certification—which can take upwards of 30 minutes per certificate—can be scheduled for the middle of the night, when network use is low, saving both time and resources.

Improved Insights with pxGrid Direct Visibility


pxGrid Direct Visibility has improved visibility from the last iteration of Cisco ISE (ISE 3.2) and now customers get improved endpoint attributes via external databases such as Service Now. These attributes can now be shown in Context Visibility. Whether the data comes from endpoints, users, devices or which apps are running over the network and its different attributes, it provides a lot of information such as the device type, device owner and other things like whether the device is operational.

Getting this endpoint data in an easily accessible fashion allows you to make better network decisions based on facts. This data can then be spun to run the network in a more efficient manner allowing for a safer network and less time spent on translating information.

Tougher Security with the TPM Chip


The new TPM Chip (for supported hardware) is a response to the need for increased security. Found on the new SNS-3700 models and in some virtual environments (in a form of Virtual TPM), the TPM chip is a dedicated chip where sensitive information can be stored. Previously if Cisco ISE used a password to connect to a database, it was stored in the file system, which is less secure. But now with the information housed on the physical TPM Chip, and with the ability to create true random numbers for key generation, it has proven to be more difficult to access thus providing a more secure place for information to be stored.

With the number of new features and functionality that comes to you with the latest Cisco ISE 3.3 update, your network’s security will be enhanced, and you will notice an increase in efficiency and visibility.

Cisco ISE 3.0 Overview Demo

Source: cisco.com

Thursday 8 June 2023

Empowering an extensible observability ecosystem with the Cisco Full-Stack Observability Platform

Cisco Career, Cisco Skill, Cisco Job, Cisco Learning, Cisco Tutoria and Materials, Cisco Preparation

Businesses today are digitally led, providing experiences to their customers and consumers through applications. The environments these applications are built upon are complex and evolve rapidly — requiring that IT teams, security teams, and business leaders can observe all aspects of their applications’ performance and be able to tie that performance to clear business outcomes. This calls for a new type of platform that can scale as a business scale and easily extend across an organization’s entire infrastructure and application lifecycle. It’s critical for leaders to have complete visibility, context and control of their applications to ensure their stakeholders — from employees to business partners to customers — are empowered with the best experiences possible.

What is Cisco Full-Stack Observability (FSO) Platform?


The Cisco FSO Platform is an open and extensible, API-driven full stack observability (FSO) platform focused on OpenTelemetry and anchored on metrics, events, logs, and traces (MELT), providing AI/ML driven analytics as well as a new observability ecosystem delivering relevant and impactful business insights through new use-cases and extensions.

Benefits of The Cisco FSO Platform


Cisco’s FSO Platform is future-proof and vendor agnostic, bringing data together from multiple domains — including application, networking, infrastructure, security, cloud, sustainability — and business sources. It is a unified observability platform enabling extensibility from queries, data ingestion pipelines and entity models all the way to APIs and a composable UI framework.

This provides Cisco customers with in-context, correlated, and predictive insights which enables them to reduce time to resolve issues, optimize their own users’ experiences, and minimize business risk — all with the additional flexibility to extend the Cisco FSO Platform’s capabilities with the creation of new or customized business use cases. This extensibility unleashes a diverse ecosystem of developers who can create new solutions or build upon existing ones to rapidly add value with observability, telemetry, and actionable insights.

Cisco Career, Cisco Skill, Cisco Job, Cisco Learning, Cisco Tutoria and Materials, Cisco Preparation
Cisco FSO Platform Diagram

First Application on the Cisco FSO Platform – Cloud Native Application Observability


Cloud Native Application Observability is a premier solution delivered on the Cisco FSO Platform. Cisco’s extensible application performance management (APM) solution for cloud native architectures, Cloud Native Application Observability with business context – now on the Cisco FSO Platform – helps customers achieve business outcomes, make the right digital experiences related decisions, ensure performance alignment with end-user expectations, prioritize, and reduce risk while securing workloads.

The following are some of the modules built on Cisco FSO Platform that work with Cloud Native Application Observability.

Modules built by Cisco

Cost Insights: This module provides visibility and insights into application-level costs alongside performance metrics, helping businesses understand the fiscal impact of their cloud applications. It leverages advanced analytics and automation to identify and eliminate unnecessary costs, while also supporting sustainability efforts.

Application Resource Optimizer: This module provides deeper insights into a Kubernetes workload and provides visibility into the workload’s resource utilization. It helps to identify the best candidates for optimization—and reduce your resource utilization. Running continuous AI/ML experiments on workloads, the Application Resource Optimizer creates a utilization baseline, and offers specific recommendations to help you improve. It analyzes and optimizes application workloads to maximize resource usage and reduce excessive cloud spending.

Security Insights: This module provides Business Risk Observability for cloud-native applications. It provides cloud native infrastructure insights to locate threats and vulnerabilities; runtime data security to detect and protect against leakage of sensitive data; and business risk prioritization for cloud security. By integrating features from our market-leading portfolio of security solutions, security and application teams have expanded threat visibility, and the intelligent business risk insights to respond in real-time to revenue-impacting security risks and reduce overall organizational risk profiles.

Cisco AIOps: This module helps to visualize contextualized data relevant to infrastructure, network, incidents, and performance of a business application, all in one place. It simplifies and optimizes the IT operations needs and accelerates time-to market for customer-specific AIOps capabilities and requirements.

Modules built by Partners

Evolutio Fintech: This module helps to reduce revenue losses for financial customers resulting from credit card authorization failures. It monitors infrastructure health impact on hourly credit card authorizations aggregated based on metadata like region, schemas, infra components and merchants.

CloudFabrix vSphere Observability and Data Modernization: This module helps to observe vSphere through the FSO platform and enriches vShpere and vROps data with your environment’s Kubernetes and infrastructure data.

Kanari Capacity Planner and Forecaster: This module provides insights into infrastructure risk factors that have been determined through predictive ML algorithms (ARIMA, SARIMA, LSTM). It helps to derive capacity forecasts and plans using these insights and baseline capacity forecast to analyze changing capacity needs overtime.

Source: cisco.com

Tuesday 6 June 2023

Understanding Application Aware Routing (AAR) in Cisco SD-WAN

One of the main features used in Cisco SD-WAN is Application Aware Routing (AAR). It is often advertised as an intelligent mechanism that automatically changes the routing path of applications, thanks to its active monitoring of WAN circuits to detect anomalies and brownout conditions.


Customers and engineers alike love to wield the power to steer the application traffic away from unhealthy circuits and broken paths. However, many may overlook the complex processes that work in the background to provide such a flexible instrument.

In this blog, we will discuss the nuts and bolts that make the promises of AAR a reality and the conditions that must be met for it to work effectively.

Setting the stage


To understand what AAR can and cannot do, it’s important to understand how it works and the underlying mechanisms running in unison to deliver its promises.

To begin, let’s first define what AAR entails and its accomplices:

Application Aware Routing (AAR) allows the solution to recognize applications and/or traffic flows and set preferred paths throughout the network to serve them appropriately according to their application requirements. AAR relies on Bidirectional Forwarding Detection (BFD) probes to track data path characteristics and liveliness so that data plane tunnels between Cisco SD-WAN edge devices can be established, monitored, and their statistics logged. It uses the collected information to determine the optimal paths through which data plane traffic is sent inside IPsec tunnels. These characteristics encompass packet loss, latency, and jitter.

The information above describes the relationship between AAR and BFD, but it’s crucial to note that they are separate mechanisms. AAR relies on the BFD daemon by polling its results to determine the preferred path configured,  based on the results of the BFD probes sent through each data plane tunnel.

It is a logical next step to explain how BFD works in SD-WAN as described in the Cisco SD-WAN Design Guide:

On Cisco WAN Edge routers, BFD is automatically started between peers and cannot be disabled. It runs between all WAN Edge routers in the topology encapsulated in the IPsec tunnels and across all transports. BFD operates in echo mode, which means when BFD packets are sent by a WAN Edge router, the receiving WAN Edge router returns them without processing them. Its purpose is to detect path liveliness and it can also perform quality measurements for application aware routing, like loss, latency, and jitter. BFD is used to detect both black-out and brown-out scenarios.

Searching for ‘the why’


Understanding the mechanism behind AAR is essential to comprehend its creation and purpose. Why are these measurements taken, and what do we hope to achieve from them? As Uncle Ben once said to Spider-Man, “With great power comes great responsibility.”

Abstraction power and transport independence require significant control and management. Every tunnel built requires a reliable underlay, making your overlay only as good as the underlay it uses.

Service Level Agreements (SLAs) are crucial for ensuring your underlay stays healthy and peachy, and your contracted services (circuits) are performing as expected. While SLAs are a legal agreement, they may not always be effective in ensuring providers fulfill their part of the bargain. In the end, it boils down to what you can demonstrate to ensure that providers keep their i’s dotted and their t’s crossed.

In SD-WAN, you can configure SLAs within the AAR policies to match your application’s requirements or your providers’ agreements.

Remember the averaged calculations I mentioned before? They will be compared against configured thresholds (SLAs) in the AAR policy. Anything not satisfying those SLAs will be flagged, logged, and won’t be used for AAR path selections.

Measure, measure, measure!


Having covered the what, who, and the often-overlooked why, it’s time to turn our attention to the how! ?

As noted previously, BFD measures link liveliness and quality. In other words, collecting, registering, and logging the resulting data. Once logged, the next step is to normalize and compare the data by subsequently averaging the measurements.

Now, how does SD-WAN calculate these average values? By default, quality measurements are collected and represented in buckets. Those buckets are then averaged over time. The default values consist of 6 buckets, also called poll intervals, with  each bucket being 10 minutes long, and each hello sent at 1000 msec intervals.

Cisco Career, Cisco Skills, Cisco Jobs, Cisco Prep, Cisco Preparation, Cisco Guides, Cisco Tutorial and Materials, Cisco

Putting it all together (by default):

◉ 6 buckets
◉ Each bucket is 10 minutes long
◉ One hello per second, or 1000 msec intervals
◉ 600 hellos are sent per bucket
◉ The average calculation is based on all buckets

Finding the sweet spot


It’s important to remember that these calculations are meant to be compared against the configured SLAs. As the result is a moving average, voltage drops or outages may not be considered by AAR immediately (but they might already be flagged by BFD). It takes around 3 poll intervals to motivate the removal of a certain transport locator (TLOC) from the AAR calculation, when using default values.

Cisco Career, Cisco Skills, Cisco Jobs, Cisco Prep, Cisco Preparation, Cisco Guides, Cisco Tutorial and Materials, Cisco

Can these values be tweaked for faster AAR decision making? Yes, but it will be a trade-off between stability and responsiveness. Modifying the buckets, multipliers (numbers of BFD hello packets), and frequency may be too aggressive for some circuits to meet their SLAs.

Let’s recall that these calculations are meant to be compared against SLAs configured.

Cisco Career, Cisco Skills, Cisco Jobs, Cisco Prep, Cisco Preparation, Cisco Guides, Cisco Tutorial and Materials, Cisco

Phew, who would have thought that magic can be so mathematically pleasing? ?

Source: cisco.com

Saturday 3 June 2023

The Future of Work is Here – and it’s Hybrid

We are excited to be announcing a new blog channel for Cisco – we don’t do this often but believed it was necessary to have a space to tell stories that cut across people, technology, and spaces in one place. In this “Future of Work” channel we’ll be highlighting trends, solutions, and any relevant and interesting topics with a goal of making your journey to great work experiences faster, easier, and more rewarding.

Next week will be our Cisco Live USA event, starting on June 4, 2023 . Whether you are attending in-person in Las Vegas or digitally,we’ll talk about the Future of Work in various sessions and showcase technology solutions live, both in our partner areas and the Cisco Solution Showcase.

One of the big questions we wondered about last year at Cisco Live was the extent to which “hybrid work” – as in the flexibility to work remotely or in the office – was truly here to stay, or if employees would all come back to the office?

The data indicates that – at least in the USA – that about 30% of work days are being taken at home.

Cisco Certification, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Prep, Cisco Preparation, Cisco Tutorial and Materials

Based on this we’re confident that there will be remote workers, at least for the foreseeable future. Employers are also now starting to make the connection between great hybrid work experiences, achieving corporate sustainability goals, reducing real estate space needs, and the role technology plays in it all going forward. This means investing in the right security, collaboration tools, and network to ensure that teams are empowered no matter where their members are located.

Source: cisco.com

Thursday 1 June 2023

Building AI/ML Networks with Cisco Silicon One

It’s evident from the amount of news coverage, articles, blogs, and water cooler stories that artificial intelligence (AI) and machine learning (ML) are changing our society in fundamental ways—and that the industry is evolving quickly to try to keep up with the explosive growth.

Unfortunately, the network that we’ve used in the past for high-performance computing (HPC) cannot scale to meet the demands of AI/ML. As an industry, we must evolve our thinking and build a scalable and sustainable network for AI/ML.

Today, the industry is fragmented between AI/ML networks built around four unique architectures: InfiniBand, Ethernet, telemetry assisted Ethernet, and fully scheduled fabrics.

Each technology has its pros and cons, and various tier 1 web scalers view the trade-offs differently. This is why we see the industry moving in many directions simultaneously to meet the rapid large-scale buildouts occurring now.

This reality is at the heart of the value proposition of Cisco Silicon One.

Customers can deploy Cisco Silicon One to power their AI/ML networks and configure the network to use standard Ethernet, telemetry assisted Ethernet, or fully scheduled fabrics. As workloads evolve, they can continue to evolve their thinking with Cisco Silicon One’s programmable architecture.

Cisco Certification, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Tutorial and Materials, Cisco Learning, Cisco Silicon One
Figure 1. Flexibility of Cisco Silicon One

All other silicon architectures on the market lock organizations into a narrow deployment model, forcing customers to make early buying time decisions and limiting their flexibility to evolve. Cisco Silicon One, however, gives customers the flexibility to program their network into various operational modes and provides best-of-breed characteristics in each mode. Because Cisco Silicon One can enable multiple architectures, customers can focus on the reality of the data and then make data-driven decisions according to their own criteria.

Cisco Certification, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Tutorial and Materials, Cisco Learning, Cisco Silicon One
Figure 2. AI/ML network solution space

To help understand the relative merits of each of these technologies, it’s important to understand the fundamentals of AI/ML. Like many buzzwords, AI/ML is an oversimplification of many unique technologies, use cases, traffic patterns, and requirements. To simplify the discussion, we’ll focus on two aspects: training clusters and inference clusters.

Training clusters are designed to create a model using known data. These clusters train the model. This is an incredibly complex iterative algorithm that is run across a massive number of GPUs and can run for many months to generate a new model.

Inference clusters, meanwhile, take a trained model to analyze unknown data and infer the answer. Simply put, these clusters infer what the unknown data is with an already trained model. Inference clusters are much smaller computational models. When we interact with OpenAI’s ChatGPT, or Google Bard, we are interacting with the inference models. These models are a result of a very significant training of the model with billions or even trillions of parameters over a long period of time.

In this blog, we’ll focus on training clusters and analyze how the performance of Ethernet, telemetry assisted Ethernet, and fully scheduled fabrics behave.

AI/ML training networks are built as self-contained, massive back-end networks and have significantly different traffic patterns than traditional front-end networks. These back-end networks are used to carry specialized traffic between specialized endpoints. In the past, they were used for storage interconnect, however, with the advent of remote direct memory access (RDMA) and RDMA over Converged Ethernet (RoCE), a significant portion of storage networks are now built over generic Ethernet.

Today, these back-end networks are being used for HPC and massive AI/ML training clusters. As we saw with storage, we are witnessing a migration away from legacy protocols.

The AI/ML training clusters have unique traffic patterns compared to traditional front-end networks. The GPUs can fully saturate high-bandwidth links as they send the results of their computations to their peers in a data transfer known as the all-to-all collective. At the end of this transfer, a barrier operation ensures that all GPUs are up to date. This creates a synchronization event in the network that causes GPUs to be idled, waiting for the slowest path through the network to complete. The job completion time (JCT) measures the performance of the network to ensure all paths are performing well.

Cisco Certification, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Tutorial and Materials, Cisco Learning, Cisco Silicon One
Figure 3. AI/ML computational and notification process
 
This traffic is non-blocking and results in synchronous, high-bandwidth, long-lived flows. It is vastly different from the data patterns in the front-end network, which are primarily built out of many asynchronous, small-bandwidth, and short-lived flows, with some larger asynchronous long-lived flows for storage. These differences along with the importance of the JCT mean network performance is critical.

To analyze how these networks perform, we created a model of a small training cluster with 256 GPUs, eight top of rack (TOR) switches, and four spine switches. We then used an all-to-all collective to transfer a 64 MB collective size and vary the number of simultaneous jobs running on the network, as well as the amount of network in the speedup.

The results of the study are dramatic.

Unlike HPC, which was designed for a single job, large AI/ML training clusters are designed to run multiple simultaneous jobs, similarly to what happens in web scale data centers today. As the number of jobs increases, the effects of the load balancing scheme used in the network become more apparent. With 16 jobs running across the 256 GPUs, a fully scheduled fabric results in a 1.9x quicker JCT.

Cisco Certification, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Tutorial and Materials, Cisco Learning, Cisco Silicon One
Figure 4. Job completion time for Ethernet versus fully scheduled fabric

Studying the data another way, if we monitor the amount of priority flow control (PFC) sent from the network to the GPU, we see that 5% of the GPUs slow down the remaining 95% of the GPUs. In comparison, a fully scheduled fabric provides fully non-blocking performance, and the network never pauses the GPU.

Cisco Certification, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Tutorial and Materials, Cisco Learning, Cisco Silicon One
Figure 5. Network to GPU flow control for Ethernet versus fully scheduled fabric with 1.33x speedup
 
This means that for the same network, you can connect twice as many GPUs for the same size network with fully scheduled fabric. The goal of telemetry assisted Ethernet is to improve the performance of standard Ethernet by signaling congestion and improving load balancing decisions.

As I mentioned earlier, the relative merits of various technologies vary by each customer and are likely not constant over time. I believe Ethernet, or telemetry assisted Ethernet, although lower performance than fully scheduled fabrics, are an incredibly valuable technology and will be deployed widely in AI/ML networks.

So why would customers choose one technology over the other?

Customers who want to enjoy the heavy investment, open standards, and favorable cost-bandwidth dynamics of Ethernet should deploy Ethernet for AI/ML networks. They can improve the performance by investing in telemetry and minimizing network load through careful placement of AI jobs on the infrastructure.

Customers who want to enjoy the full non-blocking performance of an ingress virtual output queue (VOQ), fully scheduled, spray and re-order fabric, resulting in an impressive 1.9x better job completion time, should deploy fully scheduled fabrics for AI/ML networks. Fully scheduled fabrics are also great for customers who want to save cost and power by removing network elements, yet still achieve the same performance as Ethernet, with 2x more compute for the same network.

Cisco Silicon One is uniquely positioned to provide a solution for either of these customers with a converged architecture and industry-leading performance.

Cisco Certification, Cisco Career, Cisco Skills, Cisco Jobs, Cisco Tutorial and Materials, Cisco Learning, Cisco Silicon One
Figure 6. Evolve your network with Cisco Silicon One

Source: cisco.com