Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Wednesday, 1 September 2021

Accelerate Data Lake on Cisco Data Intelligence Platform with NVIDIA and Cloudera

Cisco Data Intelligence Platform, Cisco Prep, Cisco Learning, Cisco Guides, Cisco Tutorial and Materials, Cisco Preparation, Cisco Career

The Big Data (Hadoop) ecosystem has evolved over the years from batch processing (Hadoop 1.0) to streaming and near real-time analytics (Hadoop 2.0) to Hadoop meets AI (Hadoop 3.0). These technical capabilities continue to evolve, delivering the data lake as a private cloud with separation of storage and compute. Future enhancements include support for a hybrid cloud (and multi-cloud) enablement.

Cloudera and NVIDIA Partnerships

Cloudera released the following two software platforms in the second half of 2020, which, together, enables the data lake as a private cloud:

◉ Cloudera Data Platform Private Cloud Base – Provides storage and supports traditional data lake environments; introduced Apache Ozone, the next generation filesystem for data lake

◉ Cloudera Data Platform Private Cloud Experiences – Allows experience- or persona-based processing of workloads (such as data analyst, data scientist, data engineer) for data stored in the CDP Private Cloud Base.

Today we are excited to announce that our collaboration with NVIDIA has gone to the next level with Cloudera, as the Cloudera Data Platform Private Cloud Base 7.1.6. will bring in full support of Apache Spark 3.0 with NVIDIA GPU on Cisco CDIP.

Cisco Data Intelligence Platform (CDIP)

Cisco Data Intelligence Platform (CDIP) is a thoughtfully designed private cloud for data lake requirements, supporting data-intensive workloads with the Cloudera Data Platform (CDP) Private Cloud Base and compute-rich (AI/ML) and compute-intensive workloads with the Cloudera Data Platform Private Cloud Experiences — all the while providing storage consolidation with Apache Ozone on the Cisco UCS infrastructure. And it is all fully managed through Cisco Intersight. Cisco Intersight simplifies hybrid cloud management, and, among other things, moves the management of servers from the network into the cloud.

CDIP as a private cloud is based on the new Cisco UCS M6 family of servers that support NVIDIA GPUs and 3rd Gen Intel Xeon Scalable family processors with PCIe Gen 4 capabilities. These servers include the following:

◉ Cisco UCS C240 M6 Server for Storage (Apache Ozone and HDFS) with CDP Private Cloud Base — extends the capabilities of the Cisco UCS rack server portfolio with 3rd Gen Intel Xeon Scalable Processors, supporting more than 43% more cores per socket and 33% more memory than the previous generation.

◉ Cisco UCS® X-Series for CDP Private Cloud Experiences — a modular system managed from the cloud (Cisco Intersight). Its adaptable, future-ready, modular design meets the needs of modern applications and improves operational efficiency, agility, and scale.

Cisco Data Intelligence Platform, Cisco Prep, Cisco Learning, Cisco Guides, Cisco Tutorial and Materials, Cisco Preparation, Cisco Career

CDIP is designed for hybrid clouds to help customers address the needs of modern apps and extensible data platforms. They can further accelerate their AI/ML and ETL workloads on their data lake with GA of Apache Spark 3.0 enabling GPU-accelerated workloads powered by NVIDIA RAPIDS data science libraries in the CDP Private Cloud Base 7.1.6.

The NVIDIA RAPIDS suite of open-source software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. RAPIDS uses NVIDIA CUDA and exposes GPU parallelism to accelerate ETL and machine-learning workloads. NVIDIA RAPIDS Accelerator for Apache Spark leverages GPUs to accelerate data processing in Apache Spark 3.0 using the RAPIDS libraries. This allows users to run existing Apache Applications ten times faster with no code changes.

On the AI/ML side, NVIDIA GPUs integrates with libraries like TensorFlow and PyTorch to accelerate the training of Neural Networks for various use cases, such as Computer Vision and Natural Language processing, on a single GPU node or on multiple nodes, reducing the training time from weeks to days (or hours). This saves our customers valuable time.

The Cisco, NVIDIA, and Cloudera three-way partnership brings our joint customers a much richer data lake experience through solution technology advancements, validated designs, and it all comes with full product support.

Source: cisco.com

Thursday, 20 February 2020

Answering The Big Three Data Science Questions At Cisco

Data Science Applied In Business


In the past decade, there has been an explosion in the application of data science outside of academic realms. The use of general, statistical, predictive machine learning models has achieved high success rates across multiple occupations including finance, marketing, sales, and engineering, as well as multiple industries including entertainment, online and store front retail, transportation, service and hospitality, healthcare, insurance, manufacturing and many others. The applications of data science seem to be nearly endless in today’s modern landscape, with each company jockeying for position in the new data and insights economy. Yet, what if I told you that companies may be achieving only a third of the value they could be getting with the use of data science for their companies? I know, it sounds almost fantastical given how much success has already been achieved using data science. However, many opportunities for value generation may be getting over looked because data scientists and statisticians are not traditionally trained to answer some of the questions companies in industry care about.

Most of the technical data science analysis done today is either classification (labeling with discrete values), regression (labeling with a number), or pattern recognition. These forms of analysis answer the business questions ‘can I understand what is going on’ and ‘can I predict what will happen next’. Examples of questions are ‘can I predict which customers will churn?’, ‘can I forecast my next quarter revenue?’, ‘can I predict products customers are interested in?’, ‘are there important customer activity patterns?’, etc… These are extremely valuable questions to companies that can be answered by data science. In fact, answering these questions is what has caused the explosion in interest in applying data science in business applications. However, most companies have two other major categories of important questions that are being totally ignored. Namely, once a problem has been identified or predicted, can we determine what’s causing it? Furthermore, can we take action to resolve or prevent the problem?

I start this article discussing why most data driven companies aren’t as data driven as they think they are. I then introduce the idea of the 3 categories of questions companies care about the most (The Big 3), discuss why data scientists have been missing these opportunities. I then outline how data scientists and companies can partner to answer these questions.

Why Even Advanced Tech Companies Aren’t as Data Driven As They Think They Are.


Many companies want to become more ‘data driven’, and to generate more ‘prescriptive insights’. They want to use data to make effective decisions about their business plans, operations, products and services. The current idea of being ‘data driven’ and ‘prescriptive insights’ in the industry today seems to be defined as using trends or descriptive statistics about about data to try to make informed business decisions. This is the most basic form of being data driven. Some companies, particularly the more advanced technology companies go a step further and use predictive machine learning models and more advanced statistical inference and analysis methods to generate more advanced descriptive numbers. But that’s just it. These numbers, even those generated by predictive machine learning models, are just descriptive (those with a statistical background must forgive me for the overloaded use of the term ‘descriptive’). They may be descriptive in different ways, such as machine learning generating a predicted number about something that may happen in the future, while a descriptive statistic indicates what is happening in the present, but these methods ultimately focus on producing a number. To take action to bring about a desired change in an environment requires more than a number. It’s not enough to predict a metric of interest. Businesses want to use numbers to make decisions. In other words, businesses want causal stories. They want to know why a metric is the way it is, and how their actions can move that metric in a desired direction. The problem is that classic statistics and data science falls short in pursuit of answers to these questions.

Take the example diagram shown in figure 1 below. Figure 1 shows a very common business problem of predicting the risk of a customer churning. For this problem, a data scientist may gather many pieces of data (features) about a customer and then build a predictive model. Once a model is developed, it is deployed as a continually running insight service, and integrated into a business process. In this case, let’s say we have a renewal manager that wants to use these insights. The business process is as follows. First, the automated insight service that was deployed gathers data about the customer. It then passes that data to the predictive model. The predictive model then outputs a predicted risk of churn number. This number is then passed to the renewal manager. The renewal manager then uses their gut intuition to determine what action to take to reduce the risk of churn. This all seems straightforward enough. However, we’ve broken the chain of being data driven. How is that you ask? Well, our data driven business process stopped at the point of generating our churn risk number. We simply gave our churn risk number to a human, and they used their gut intuition to make a decision. This isn’t data driven decision making, this is gut driven decision making. It’s a subtle thing to notice, so don’t feel too bad if you didn’t see it at first. In fact, most people don’t recognize this subtlety. That’s because it’s so natural these days to think that getting a number to a human is how making ‘data driven decisions’ works. The subtlety exists because we are not using data and statistical methods to evaluate the impact of actions the human can take on the metric they care about. A human sees a number or a graph, and then *decides* to take *action*. This implies they have an idea about how their *action* will *effect* the number or graph that they see. Thus, they are making a cause and effect judgement about their decision making and their actions. Yet, they aren’t using any sort of mathematical methods for evaluating their options. They are simply using their personal judgement to make a decision. What can end up happening in this case is that a human may see a number, make a decision, and end up making that number worse.

Let’s take the churn risk example again. Let’s say the customer is 70% likely to churn and that they were likely to churn because their experience with the service was poor, but assume that the renewal manager doesn’t know this (this too is actually a cause and effect statement). Let’s also say that a renewal manager sends a specially crafted renewal email to this customer in an attempt to reduce the likelihood of churn. That seems like a reasonable action to take, right? However, this customer receives the email, and is reminded of how bad their experience was, and is now even more annoyed with our company. Suddenly the likelihood to churn increases to 90% for this customer. If we had taken no action, or possibly a different action (say connecting them with digital support resources) then we would have been better off. But without an analysis of cause and effect, and without systems that can analyze our actions and prescribe the best ones to take, we are gambling with the metrics we care about.

Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Certifications, Ciscoa Learning
Figure 1

So how can we attempt to solve this problem? We need to incorporate mathematical models and measurement into the business process after the number is generated. We need to collect data on what actions are being taken, measure their relationship with the metrics we care about, and then optimize over our actions using causal inference models and AI systems. Figure 2 below shows how we can insert an AI system into the business process to help track, measure, and optimize the actions our company is taking. Using a combination of mathematical analysis methods, we can begin to optimize the entire process using data science end to end. The stages of this process can be abstracted and generalized as answering 3 categories of questions companies care about. Those 3 categories are described in the next section.

Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Certifications, Ciscoa Learning

Comparing Machine Learning to Causal Analysis (Inference)


To get a better understanding of what machine learning does and where it falls short, we introduce figure 3 and figure 4 below. Figure 3 and Figure 4 both describe the problem space of understanding cancer. Machine learning can be used to do things like predict whether or not a patient will get cancer given characteristics that have been measured about them. Figure 3 shows this by assigning directed arrows from independent variables to the dependent variable (in this case cancer). These links are associative by their construction. The main point is that machine learning focuses on numbers and the accurate production of a number. This can in many cases be enough to gain a significant amount of value. For example, predicting the path of a hurricane has value on it’s own. There exists no confusion about what should be done given the prediction. If you are in the predicted path of the hurricane, the action is clearly to get out of the way. Sometimes however, we want to know why something is happening. Many times we want to play ‘what-if’ games. What if the patient stopped smoking? What if the patient had less peer pressure? To answer these questions, we need to perform a causal analysis.

Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Certifications, Ciscoa Learning
Figure 3

Figure 4 below shows a visual example of what causal analysis provides. Causal analysis outputs stories, not just numbers. The diagram shows the directed causal links between all variables in an environment. For example, given this diagram anxiety causes smoking. Causal stories are important any time we or our business stakeholders want to take action to improve the environment. The causal story allows us to quantify cause and effect relationships, play what-if scenarios, and perform root-cause analysis. Machine learning falls short of being able to do this because these all require a modeling of cause and effect relationships.

What Are the Big 3?


Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Certifications, Ciscoa Learning
Figure 5

Figure 5 above describes what ‘The Big 3’ questions companies care about are. The big 3 questions seem fairly obvious. In fact, these questions are at the foundation of most of problem solving in the real world. Yet, almost all data science in industry today revolves around answering only the first question. What most data scientists understand as supervised, unsupervised, and semi-supervised learning revolves around answering what is happening or what will happen. Even with something like a product recommendation system (which you might believe prescribes something because of the term ‘recommend’), we only know what products a customer is interested in (thus it’s only an indication of interest, not a reason for interest). We don’t know the most effective way to act on that information. Should we send an ad? Should we call them? Do certain engagements with them cause a decrease their chances of purchase? To answer what is *causing* something to happen, we need to rely on foundational work in the area of Causal Inference developed by researchers like Ronald Fisher, Jerzy Neyman, Judea Pearl, Donald B. Rubin, Paul Holland, and many others. Once we understand what is causing a metric we care about, we can at least begin to think intelligently about the actions we can take to change those metrics. This is where the third question mentioned in figure 3 above comes in. To answer this question we can rely on a wide variety of techniques that have been developed including causal inference for the cause and effect relationship between actions and the metrics they are supposed to affect, statistical decision theory, decision support systems, control systems, reinforcement learning, and game theory.

Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Certifications, Ciscoa Learning
Figure 6

Figure 6 above breaks down some of the methods in a more technical way. The methodology column outlines the major methods, fields, and approaches that can be in general used each of the big 3 questions in turn. The algorithms column lists some specific algorithms that may be applied to answer each of the big 3 questions. While some of these algorithms should be familiar to the average data scientist (deep neural networks, random forests, etc.), others are maybe only known in passing (multi-armed bandits, reinforcement learning, etc). Still more algorithms are likely to be totally new to some data scientists (Difference in Differences, Propensity Score Matching, etc). The main paper delves into each question and the important technical details of the methods used to answer each question. It’s very important to understand these methods, particularly for performing causal analysis and optimizing actions. These methods are highly nuanced, with many difference kinds of assumptions. Naively applying these methods without understanding their limitations and assumptions will almost certainly lead to incorrect conclusions.

Example Use Case for Renewals.


A well known question, in which we have applied the big 3 methodology, is for understanding Cisco product and service renewals. Understanding and predicting renewals is a prime example of how many companies are attempting to get value through data science. The problem is also typically referred to as predicting churn, churn risk prediction, predicting attrition. Focusing on renewals is also useful for demonstration purposes because most of the data science applied to problems of this kind fall short of providing full value. That’s because renewals is a problem where providing a number is not the goal. Simply providing likelihood of a customer to renew is not enough. The company wants to **do** something about this. The company wants to take action to cause an increase in the likelihood to renew. For this, and any other time where the goal is to **do** something, we rely on causal inference and methods for optimizing actions.

Question 1: What is happening or will happen?

As we’ve already stated, the main question that is typically posed to a team of data scientists is ‘Can we accurately predict which customers will renew and which ones won’t?’ While this is a primary question asked by the business, there are many other questions that fall into the area of prediction and pattern mining including,

1. How much revenue can we expect from renewals? What does the distribution look like?
2. What’s the upper/lower bound on the expected revenue predicted by the models?
3. What are the similar attributes among customers likely to churn versus not churn?
4. What are the descriptive statistics for customers likely to churn vs not churn collectively, in each label grouping, and in each unsupervised grouping?

Each of the above questions can be answered systematically by framing them as problems either in prediction or pattern mining, and by using the wide variety of mathematical methods found in the referenced materials in the main paper here. These are the questions and methods data scientists are most familiar, and will most commonly be answered for a business.

Question 2: Why is this happening or going to happen?

Given this first question, the immediate next question is why. Why are customers likely or not likely to churn? For each question that we can build a model for, we can also perform a causal analysis. Thus, we can already potentially double the value that a data science project returns by simply adding on a causal analysis to each predictive model built. It’s important to bring up again that this question is so important that most data scientists are either answering it incorrectly, or are misrepresenting the information from statistical associations.

Specifically, when a data scientist is asked the question of why a customer is likely to churn, they almost exclusively turn to feature importance and local models such as LIME, SHAP, and others. These methods for describing the reason for a prediction are almost always incorrect because there is a disconnect between what the business stakeholder is asking for and what the data scientist is providing because of two different interpretations of the term ‘why’. Technically, one can argue that feature importance measures what features are important to ‘why’ a model makes a prediction, and this would be correct. However, a business stakeholder usually wants to know ‘what is causing the metric itself’ and not ‘what is causing the metric prediction’. The business stakeholder wants to know the causal mechanisms for why a metric is a particular number. This is something that feature importance absolutely does not answer. The stakeholder wants to use the understanding of the causal mechanisms to take an action to change the prediction to be more in their favor. This requires a causal analysis. However, most data scientists simply take the features with highest measured importance and present them to the stakeholder as though they are the answer to their causal question. This is objectively wrong, yet is time and again presented to stakeholders by seasoned statisticians and data scientists.

The issue is compounded by the further confusion added by discussions around ‘interpretable models’ and by the descriptions of feature importance analysis. LIME describes it’s package as ‘explaining what machine learning classifiers (or models) are doing’. While still a technically correct statement, these methods are being used to incorrectly answer causal questions, leading stakeholders to take actions that may have the opposite effect of what they intended.

While we’ve outlined the main causal question, there are a number of questions that can also be asked, and corresponding analysis that can be performed including,

1. How are variables correlated with each other and the churn label? (A non-causal question)

2. What are the important features for prediction in a model in general? (A non-causal question)

3. What are the most important features for prediction for an individual? Do groupings of customers with locally similar relationships exist? (A non-causal question)

4. What are the possible confounding variables? (A causal question)

5. After controlling for confounding variables, how do the predictions change? (A non-causal question benefiting from causal methods)

6. What does the causal bayes net structure look like? What are all of the reasonable structures? (A causal question)

7. What are the causal effect estimates between variables? What about between variables and the class label? (A causal question).

Many of these questions can be answered in whole or in part by a thorough causal analysis using the methods we outlined in the corresponding causal inference section of the main paper here, and further multiply the value returned by a particular data science project.

Question 3: How can we take action to make the metrics improve?

The third question to answer is ‘what actions can a stakeholder take to prevent churn?’ This is ultimately the most valueable of the three questions. The first two question set the context for who to focus on and where to focus efforts. Answering this question provides stakeholders with a directed and statistically valid means to improve the metrics they care about given complex environments. While still challenging given the methods available today (and presented in the section on intelligent action), it provides one of the greatest value opportunities. Some other questions that can be answered related to intelligent action that stakeholders may be interested in include,

1. What variables are likely to reduce churn risk if our actions could influence them?

2. What actions have the strongest impact on the variables that are likely to influence churn risk, or to reduce churn risk directly?

3. What are the important pieces of contextual information relevant for taking an action?

4. What are the new actions that should be developed and tested in an attempt to influence churn risk?

5. What actions are counter-productive or negatively impact churn risk?

6. What does the diminishing marginal utility of an action look like? At what point should an action no longer be taken?

The right method to use for prescribing intelligent action depends largely on the problem and the environment. If the environment is complex, the risks high, and there is not much chance for an automated system to be implemented, then methods from causal inference, decision theory, influence diagrams, and game theory based analysis are good candidates. However, if a problem and stakeholder are open to the use of an automated agent to learn and prescribe intelligent actions, then reinforcement learning may be a good choice. While possibly the most valuable of the big three questions to answer, it also exists as one of the most challenging. There still many open research questions related to answering this question, but the value proposition means that it’s likely an area that will see increased industry investment in the coming years.

How We Are Improving CX By Using Data Science to Answer the Big 3 at Cisco.


Like many other companies, Cisco has many models for answering the first of the big 3 questions. The digital lifecycle journey data science team has many predictive models for understanding Cisco’s customers. This includes analysis of customer purchasing behavior, digital activity, telemetry, support interaction, and renewal activity using a wide variety of machine learning based algorithms. We also apply the latest and greatest forms of advanced statistical and deep learning based supervised learning methods for understanding and predicting the expected behavior of our customers, their interactions with Cisco, and their interactions with Cisco products and services. We go a step further in this area by attempting to quantify and predict metrics valuable to both Cisco and Cisco’s customers. For example, we predict metrics like how a customer is going to keep progressing through the expected engagement with their product over the next several days to next several weeks. This is just one of the many metrics we are trying to understand about the Cisco customer experience. Others include customer satisfaction, customer health, customer ROI, renewal metrics, and many others. These metrics allow us to understand where there may be issues with our journey so that we can start trying to apply data science methods to answer the ‘why’ and ‘intelligent action’ questions we’ve previously mentioned.

We are also using causality to attempt to understand the Cisco’s customer experience, and what causes a good or bad customer experience. We go a step further by trying to complete the causal chain of reasoning to quantify how a customer experience causes Cisco’s business metrics to rise and fall. For example, we’ve used causal inference methods to measure the cause and effect aspects of customer behavior, product utilization, and digital engagements on a customer’s likelihood to renew Cisco services. Using causal inference, we are gaining deeper insights into what is causing our customers and Cisco to succeed or fail, and are using that information to guide our strategy for maximizing the customer experience.

Finally, to answer the third of the big three questions, we are employing causality, statistical decision theory, intelligent agent theory, and reinforcement learning to gain visibility to the impact our activities have on helping our customers and improving Cisco’s business metrics, and to learn to prescribe optimal actions over time to maximize these metrics. We have developed intelligent action systems that we working to integrate with our digital email engagements journeys to optimize our interactions with customers to help them achieve a return on investment. We are, in general, applying this advanced intelligent agent system to quantify the impact of our digital interactions, and to prescribe the right digital customer engagements to have, with the most effective content, at the right time, in the right order, personalized to each and every individual customer.

Why Many Data Scientists Don’t Know the Big 3, or How to Answer Them.


Those learned readers experienced with data science may be asking themselves, ‘is anything new being said here’? It’s true that no new technical algorithm, mathematical analysis, or in depth proof is being presented. I’m not presenting some new mathematical modeling method, or some novel comparison of existing methods. What is new is how I’m framing the problems for data science in industry, and the existing methodologies that can start to solve those problems. Causal inference has been used heavily in medicine for observational studies where controlled trials aren’t feasible, and for things like adverse drug effect discovery. However, Causal Inference hasn’t received wide spread application in areas outside of the medical, economic, or social science fields yet. The idea of prescribed actions is also something that isn’t totally new. Prescribed actions can be thought of as just a restatement of the field of control systems, decision analysis, and intelligent agent theory. However, the utilization of these methods for completing the end-to-end data driven methodology for business hasn’t received wide spread application in industry applications. Why is this? Why aren’t data scientists and businesses working together to frame all of their problems this way?

There could be a couple of reasons for this. The most obvious answer is that most data scientists are trained to answer the first of the big 3 questions. Most data scientists and statisticians are trained on statistical inference, classification, regression, and general unsupervised learning methods like clustering and anomaly detection. Statistical methods like causal inference aren’t widely known, and are therefore not widely taught. Register with any online course, university, or other platform for learning about data science and machine learning and you’ll be hard pressed to find discussions about identifying causal patterns in data sets. The same goes for the ideas of intelligent agents, control systems, and reinforcement learning to a lesser degree. These methods tend to be relegated to domains that have simulators and a tolerance for failure. Thankfully there is less of a gap for these methods. They typically are given their own special courses in either machine learning, electronics, and signals and systems processing curriculums.

Another possible explanation may be that many data scientists in industry tend to be enamored with the latest and greatest popular algorithm or methodology. As math and tech nerds we become enamored with the technical intricacies of how things work, particularly mathematical algorithms and methodologies. We tend to develop models and then go looking for solutions rather than the other way around, potentially blinding us to the methods in data science that can provide business value time and time again.

Yet another explanation may be that many data scientists are not well versed enough in statistics and the statistical literature. Many data scientists are asked questions about how a predictive model produced a number. For example, in our churn risk problem, renewal managers typically want to know why someone is at risk. The average data scientist hears this, and then uses methods like feature importance and more interpretable models to answer this question. However this doesn’t really answer the actual question being asked. The data scientist provides what might be important associations between model inputs and the predicted metric, but this doesn’t provide the information the renewal manager wants. They want to know about information they can act on, which requires cause and effect analysis. This is a classic case of ‘correlation is not causation’ that everyone seems to know but can still trip up even statistically minded data scientists. It’s such an issue that many companies I’ve talked with that claim to provide ‘next best actions’ are statistically invalid (mainly because they use feature importance and sensitivity analysis type methods instead of understanding basic counter factual analysis and confounding variables).

Moving forward the data science community operating in industry domains will become more aware of the big 3 questions and the analysis methods that can be performed to answer them. Companies that can quickly realize value from answering these questions using data science will be at the head of the pack in the emerging data science and insights economy. Companies that focus on answering all of the big 3 questions will have a distinct competitive advantage, and will have transformed themselves to be truly data driven.

Sunday, 20 January 2019

Performance, Scale, and Flexibility for Accelerating AI / ML

A famous line: “I feel the need–the need for speed!” When I talk to data scientists, they often express the same sentiment. Charged with the responsibility of mining value out of numerous data sources, they often feel that they need ever more speed to do their jobs quickly. To support the data scientists, IT teams are always looking for performance, scale, and flexibility. Clearly, the need for accelerating artificial intelligence and machine learning performance is critical to enterprise’s success.

Performance


In anticipation of Cisco Live Barcelona, I am excited to share some of the recent development on the UCS team to help customers to accelerate AI/ML deployment.

First, Cisco has published the performance characteristics of the UCS C480 ML. With 8 x NVIDIA V100 Tesla GPUs, this server is optimized for deep learning workloads as demonstrated with the various popular convolutional neural networks for image classification.

Cisco Study Material, Cisco Guides, Cisco Learning, Cisco Tutorial and Material
C480 ML TensorFlow Training Performance

Scale


While the performance of a single server is important, customers are demanding that a cluster of servers to scale to increasing demands. At Cisco Live Barcelona, we are looking forward to show 2 demos.

NGC on Hortonworks 3

Many customers already have a big data lake, and almost everyone is looking to mine additional value out of the data. With Hortonworks 3, the Hadoop cluster is able to support containers running on CPUs and GPUs. This unified cluster architecture enables customers to do traditional data extract, transform, and load (ETL) before feeding the data to a deep learning algorithm. Since Cisco announced intent to support NVIDIA NGC-Ready program, which provides GPU enabled software containers, Cisco Live Barcelona will be the first time where Cisco demonstrate NGC containers running on UCS, in this case, as part of the Hortonworks cluster. In short, the Hortonworks Hadoop cluster is able to support the complete data pipeline.

Cisco Study Material, Cisco Guides, Cisco Learning, Cisco Tutorial and Material
NGC on Hortonworks 3

NGC on Red Hat OpenShift

While some customers prefer Hadoop as the clustering tool, others are choosing Red Hat OpenShift as the Kubernetes container orchestrator with enterprise support. In Barcelona, we will also show NGC running on OpenShift enabling support for multiple data scientists with both interactive and batch workloads.

Cisco Study Material, Cisco Guides, Cisco Learning, Cisco Tutorial and Material
NGC on Red Hat OpenShift

Flexibility


Kubeflow Pipelines

While many customers have the bulk of their data on-premise, some data scientists would like to do machine learning experiments in the cloud. In November, 2018, Google announced Kubeflow Pipelines. At Cisco Live Barcelona, in partnership with Google, Cisco will demonstrate Kubeflow Pipelines running on both Google Cloud and UCS enabling a true hybrid cloud experience.

Cisco Study Material, Cisco Guides, Cisco Learning, Cisco Tutorial and Material

Inferencing on Hyperflex 4.0

Inferencing is also a critical part of machine learning. At Barcelona, we are also showcasing HyperFlex 4.0 as a edge computing system in a retail environment where video streams can be used for customer sentiment analysis.

Cisco Study Material, Cisco Guides, Cisco Learning, Cisco Tutorial and Material
HyperFlex 4.0 for Inferencing on the Edge

Cisco Live Barcelona 2019


I can’t wait to share with all of you the UCS solutions that can accelerate artificial intelligence and machine learning adoption. With the raw performance of the UCS C480 ML, scale with Hortonworks 3 and Red Hat OpenShift, and flexibility with Kubeflow and inferencing on HyperFlex 4.0, I am looking forward to help our customers to make a genuine difference with the business problems.

Friday, 5 October 2018

Enabling Enterprise-Grade Hybrid Cloud Data Processing with SAP and Cisco – Part 1

The journey towards the intelligent enterprise


When talking about modern data processing in the digital economy, data is often regarded as the new oil. Enterprise companies are already competing in a race for the best mining, extraction and processing technologies to gain better insights into their companies, deals and processes. Winning in this race will ultimately lead to a competitive advantage in the market, since companies with a deep understanding for their businesses will be able to take the most profitable decisions and establish the most beneficial optimizations. For this reason, the way companies are handling data and analytics is changing from pure transactional, ETL-like processing towards adopting modern technologies such as machine learning, intelligent analytics, stream processing on-premise and in the cloud. We regard to this transition as the move towards the ‘intelligent enterprise’.

However, the transition also leads to the fact that data processing landscapes are getting more complex and more heterogeneous. Data is processed in a growing collection of different system, it is distributed over different places; its volume is growing by the day and customers need to orchestrate and integrate cloud technologies with classical on-premise systems. Among all the challenges that come with the journey towards digitalization and the ‘intelligent enterprise’ the following three main challenges emerge as the most pressing points in our customer base:

Cisco Certifications, Cisco Guides, Cisco Learning, Cisco Tutorial and Material

Figure 1 Enterprise data landscapes are growing increasingly complex.

1. The tale about data governance and the lack of data knowledge, security and visibility 


One of the biggest challenges in complex modern enterprise data landscape is the distribution of the data over a growing number of stores and processing systems. This leads to a missing knowledge about data positioning, data characteristics and governance. “What data is available in which store?”, “What are the major characteristics of my data sets?”, “Who changed the data, in what way and who has the permissions to access it?” are typical questions that are hard to answer even within a single company. However, it is key to find a strategy that allows a holistic data governance and data management across the entire company.

2. The legend about enterprise readiness of big data technologies 


In the world of modern data processing technologies and big data management, we observe an incredible growth in tools and technologies a customer can choose from. While in the first place, choice seems to be an advantage, one quickly recognizes that this leads to a zoo of non-integrated system each exhibiting different characteristics, life cycles and environments. It is left to the customer to manage, organize and orchestrate those systems leading to a very high effort to arrive at an enterprise ready data landscape with a well-organized interplay of all components.

3. The story about easily processing enterprise data and big data together 


The adoption of modern big data technologies mainly comes from the fact that augmenting classical enterprise data such as sales figures and revenue data with big data such as sensor streams, social media data collections or mobile device data, allows to create deeper and advanced analyses. However, in most cases enterprise data and big data are kept in different silos and exhibit totally different characteristics. Enterprise data typically comes from classical transactional systems such as ERP systems or transactional data bases, it is well structured and adheres a standardized schema. On the other hand, big data often arrives in its pure form as data streams or data collections stored in data lakes (e.g. Hadoop, S3, GCS). It is often unstructured, misses clear data types and might not adhere to a clear data schema. Accordingly, creating an end-to-end data pipeline across the enterprise that combines business data with big data comes with a considerable effort.

SAP and Cisco jointly recognize that our mutual customers need innovative new solutions that would help them overcome these hurdles in order to fully leverage the value of their distributed data and turn them into actionable insights.

Monday, 12 February 2018

Five things that are bigger than the Internet: Findings from this year’s Global Cloud Index

The scale of the Internet is awe-inspiring. By 2021, there will be 4.6 billion people and 27 billion devices connected to the Internet, and Internet traffic will reach 2.8 trillion Gigabytes (or 2.8 Zettabytes) per year.

Even with multiple Zettabytes crossing the network each year, there are some things that dwarf even the Internet. We started the Global Cloud Index (GCI) seven years ago in order to capture the scale of data more generally, as it relates to the advent of the cloud. In our most recent report, there are five categories of data that meet or exceed Internet traffic volume by 2021.

Thursday, 11 January 2018

Bridging the gap between IT and Business through Big Data Analytics

The role of IT and networking specifically as we know it, is changing in the eyes of many Cisco stakeholders, primarily: (1) IT Experts and (2) Business Leaders. This combined with the changing expectations of the millennial generation will have a huge impact on the next market transition. Simplistically, network’s power to influence business outcomes cannot be underestimated. By combining network data with secondary sources of business critical information: cost, service management, user and compliance data and device diagnostics, companies like Cisco can empower businesses to optimize operations, profitability and decision-making. Concisely, using telemetry we can improve the productivity at any level of the organization.

Saturday, 16 September 2017

Will Driverless Cars Dream in Cisco Hues?

Google, Uber, Ford. They are the big names in driverless vehicles. But did you know Cisco is already empowering nearly every Connected Car on Earth? Through its cloud-based IoT platform, Jasper, Cisco has become a key player in Smart Car technology, helping Honda launch its connected car services across Europe and working with Hyundai to create a new concept for connected car data collection. Plus, Cisco is already partnering with communities like yours to deploy fog and speed sensors along public roads. So we can just imagine what the visionaries at Cisco have in store for driverless cars.