Saturday, 29 February 2020

An Introduction Into Kubernetes Networking – Part 1

Cisco Study Materials, Cisco Guides, Cisco Tutorial and Material, Cisco Learning, Cisco Kubernetes

Cisco Live Barcelona recently took place and there was a lot of focus on Kubernetes, including the launch of the Cisco Hyperflex Application Platform(HXAP). Cisco HXAP delivers an integrated container-as-a-service platform that simplifies provisioning and ongoing operations for Kubernetes across cloud, data center, and edge.

With every new technology comes a learning curve and Kubernetes is no exception.

1. Container to Container Communications


The smallest object we can deploy in Kubernetes is the pod, however within each pod you may want to run multiple containers. A common usecase for this is a helper where a secondary container helps a primary container with tasks such as pushing and pulling data.

Container to container communication within a K8s pod uses either the shared file system or the localhost network interface.

We can test this by using the K8s provided example, two-container-pod, and modifying it slightly.

https://k8s.io/examples/pods/two-container-pod.yaml

When we deploy this pod we can see two containers, “nginx-container” and “debian-container“. I’ve created two separate options to test, one with a shared volume, and one without a shared volume but using localhost instead.

Shared Volume Communication


Cisco Study Materials, Cisco Guides, Cisco Tutorial and Material, Cisco Learning, Cisco Kubernetes

Cisco Study Materials, Cisco Guides, Cisco Tutorial and Material, Cisco Learning, Cisco Kubernetes

When we use the shared volume, Kubernetes will create a volume in the pod which will be mapped to both containers. In the “nginx-container”, files from the shared volume will map to the “/usr/share/nginx/html” directory, while in the “debian-container” files will map to the “/pod-data” directory. When we update the “index.html” file from the Debian container, this change will also be reflected in our Nginx container, thereby providing a mechanism for our helper (Debian) to push and pull data to and from Nginx.

Localhost Communication


Cisco Study Materials, Cisco Guides, Cisco Tutorial and Material, Cisco Learning, Cisco Kubernetes

Cisco Study Materials, Cisco Guides, Cisco Tutorial and Material, Cisco Learning, Cisco Kubernetes

In the second scenario shared volume has been removed from the pod and a message has been written in the “index.html” file which only resides in the Nginx container. As previously mentioned, the other method for multiple containers to communicate within a pod is through the localhost interface and the port number to which they’re listening.

In this example Nginx is listening on port 80, therefore when we run the “curl https://localhost” command from the Debian container we can see that the “index.html“ page is served back to us from Nginx.

Here’s the “nginx-container” showing the contents of the “index.html” file.

Cisco Study Materials, Cisco Guides, Cisco Tutorial and Material, Cisco Learning, Cisco Kubernetes

Confirmation that we’re receiving the file when we Curl from the “debian-container”

Cisco Study Materials, Cisco Guides, Cisco Tutorial and Material, Cisco Learning, Cisco Kubernetes

Friday, 28 February 2020

Accelerate Your SMB Opportunity with High Velocity Managed Services

Cisco Exam Prep, Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Collaboration

Small and medium business (SMB)* IT trends are consistent regardless of what you read or who you talk to.  IT budgets are growing, security is top of mind, and the shift to cloud in full force. For Cisco, these trends couldn’t align better. We have the buyer interest^, product leadership, and a world-class partner ecosystem to respond to the needs of these customers. 

And we are laser-focused on improving our traction in this market with a new small business-specific portfolio and a series of right-sized partner programs launched late last year.

In this blog, I will discuss the very exciting work we are doing to accelerate the SMB opportunity with our global Service Provider Partners, using a comprehensive new approach called High Velocity Managed Services. Simply put, High Velocity Managed Services does what it says: it accelerates the build-out and launch of managed services offerings targeting smaller customers, making it easy, scalable, and efficient to reach this segment.

Major Opportunity for SP Managed Services


When it comes to where and how to buy, there is no one size fits all in SMB. Sales cycles, typically one month, are much shorter than with enterprise customers (typically months or years for similar solutions) and SMBs want to purchase solutions on-demand and often, online.

However, when it comes to IT infrastructure services like network, security, and collaboration, new Cisco research suggests that many of the same companies are interested in using managed services if the provider can make it easy, affordable, and bundle it with other services – especially security (more on this later). And if the managed services provider can expand the bundle to deliver a complete IT package, including internet, the value proposition becomes extremely favorable.

Cisco Exam Prep, Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Collaboration

Obviously, this means our global service provider ecosystem is in a great spot to better serve small business customers. They already market and deliver connectivity to SMBs and can use that scale to layer in new value-added services these buyers are looking for. No brainer, right?

Upping the Game with Managed Services


Yes, trends and business models suggest that service providers are poised to capture more managed services revenues. But this needs to be done with a few key tenets in mind:

◉ First and foremost, an acknowledgment that enterprises and SMB are very different. Sales and go-to-market processes need to be simplified to reach SMB buyers.

◉ For winning service providers, this comes with the recognition that SMBs can’t be served with enterprise offerings at a reduced price.

◉ Rather, solutions need to be tuned for this space by delivering the right set of features that solve the broadest set of customer pain points (with optional add-ons for vertical customers) messaged with a set of specific personas and business outcomes in mind, and set up for easy cross-sell.

Born in the cloud, Cisco Meraki delivers a simple value proposition for managed service providers – and is where the High Velocity journey starts. Providers can easily create a Meraki service template (i.e. offering) that suits a majority of customer needs, use platform APIs to connect to backend services, and then sell and provision with turnkey speed. This plug and play model can be used to deliver a secure WiFi offering all the way to a full managed network/LAN solution for rapid deployment and serviceability of desk phones across the WAN. All with a brandable end customer portal to provide important solution visibility that comes out of the box or customized through application development partners such as Encapto.

Security is King


71% of SMBs who are very interested in purchasing managed services from their service provider ranked security as the top value proposition of such a solution – higher than streamlined support and reporting tools.

Cisco Exam Prep, Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Collaboration

With Meraki integrations with Cisco’s security products such as Cisco Advanced Malware Protection (AMP) and Umbrella, partners can lead with a value proposition centered on keeping out malware and ransomware that can cripple business productivity. Backed by the world’s most comprehensive threat intelligence research entity in Cisco Talos, providers can further showcase how well the solution covers the complex and evolving cyber threat landscape. For companies investing in collaboration, the Webex platform also delivers top-end security. Together, providers can market the complete suite of IT infrastructure services with security as the lead message.

Cisco Exam Prep, Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Collaboration

More than Just a Product Pitch


As partners shift from niche managed services players to leading digital service providers, they need to adapt their own go-to-market programs and resources. It is not an overnight shift to start selling network security, SD-WAN, and complete IT infrastructure bundles. Nor is it sufficient to just start selling a managed service without thinking about the right product set, target segments, and packages as mentioned previously.

To help providers with the High Velocity Managed Services opportunity, we’ve built an arsenal of best-practice go-to-market resources to assist with sales and marketing enablement. We’ve developed these assets across the offers depicted above, and they can be used off the shelf or tailored to a provider’s specific campaign goals through a Cisco-led engagement.

Thursday, 27 February 2020

Threat hunting doesn’t have to be difficult—Taking a proactive position with your cybersecurity

Cisco Study Materials, Cisco Guides, Cisco Learning, Cisco Exam Prep

Your Endpoint Protection Platform (EPP) is up to date with the latest version. Your Endpoint Detection and Response (EDR) technology has all of the latest framework rules and automaton in place. Vulnerabilities and patches for hardware and software are all covered. Your Defense in Depth strategy appears to be keeping your organization secure. But, and there is always a “but”, some adversarial techniques are difficult to DETECT even on a good day. Exfiltration can be quite difficult to detect even if you are looking for it.

As advanced threats continue to proliferate throughout an organizations’ IT resources, threat hunting as a practice has appeared. For an elite security organization, threat hunting takes a more proactive stance to threat detection. Threat hunting was a natural, security progression saved for the most mature environments where skilled personnel leverage knowledge and tools to formulate and investigate hypotheses relating to their organization’s security across the landscape. Now with technology advancements and automation, threat hunting has now become within reach for every organization.

Threat hunting is an analyst-centric process that enables organizations to uncover hidden, advanced threats, missed by automated preventative and detective controls.

Security professionals are beginning to discover threat hunting practices to advance their detection and response monitoring. Threat hunting requires a highly skilled person as well as wide-ranging data forensics and live response across the IT environment. There are only a handful of companies in verticals such as financial services, high-tech manufacturing, and defense that can claim to have advanced threat hunting teams that deliver results.

Today’s threat actors are well-organized, highly intelligent, motivated and focused on their targets. These adversaries could be lurking on your network or threating to break into it, using increasingly sophisticated methods to reach their goal. In addition, the attacks can come from many different threat surfaces to exploit the many vulnerabilities that may be present across an organizations’ network and people. Worst of all, organizations do not know by whom, when, where or how a well-planned attack will occur. Today’s rule-based defenses and solutions have limitations, even advanced detection mechanisms struggle to anticipate how attack vectors will evolve. To mitigate threats more proactively, organizations must move quicker than the speed of the threat. The easiest way to put it, when the existing rules are undermined, it is time to start threat hunting.

Cisco Study Materials, Cisco Guides, Cisco Learning, Cisco Exam Prep
Pyramid of Pain

Threat Hunting also allows security teams to address the top most tiers of the Pyramid of Pain, making more difficult for adversaries to impact environments. At the “Tools” level, analysts are taking away one or more specific tools that an adversary would use in an attack. At the apex of the pyramid are the TTPs (Tactics,Techniques and Procedures), when analysts detect and respond at this level, they are operating directly on the adversary’s behaviors, not against their tools, forcing them to learn new behaviors.

There are three types of hunts.

◉ Intelligence-Driven (Atomic Indicators) – These are low-hanging fruit hunts. They are generally known threats that bypass traditional security controls

◉ TTP-Driven (Behavioral and Compound Indicators) – These are hunts looking for techniques used by advanced attackers, where analysts take a methodological approach for discovering unknowns. Generally attempting to interrupt the adversaries TTPs (Techniques, Tactics, and Procedures)

◉ Anomaly-Driven (Generic Behaviors) – These hunts are based on low-prevalence artifacts and outlier behaviors. These are unknown threat leads.

Benefits of Starting a Threat Hunting Practice


There are many benefits from starting a threat hunting practice. Obviously, discovering and thwarting an attack before it causes significant damage. However, what about a threat hunt that doesn’t find anything? Is that really a bad thing? Having stronger knowledge of vulnerabilities and risks on the network will allow a hardening of your security environment which in turn should equate to fewer breaches and breach attempts. Moreover, the insights gathered from threat hunts will aid in reducing the attack surface. Another key result from beginning a threat hunting practice is that security teams will realize increased speed and accuracy of threat responses. Ultimately, organizations should witness measurable improvements for key security indicators such as mean time to detect and mean time to respond.

In-House or Outsourced?


Through outsourcing, threat hunting can be accessible for organizations of all sizes, but especially for small and medium sized organization as they often do not have a Security Operations Center (SOC) as it often is too expensive to build and support. Many Mid-Market sized companies have a SOC and are considering the addition of threat hunting to their current environment. Enterprise and large organizations perhaps are looking for assurance by augmenting existing threat hunting efforts. And in many cases, these enterprise organizations simply want to empower and educate their staff.

Just in time for RSAC, Cisco is pleased to announce that it will be adding Threat Hunting as a feature to our Cisco AMP for Endpoints offering. Our new threat hunting by Cisco Talos uniquely identifies advanced threats, alerting our customers before they can cause any further damage by:

◉ Uncovering hidden threats faster across the attack surface using MITRE ATT&CK™ and other industry best practices

◉ Performing human-driven hunts based on playbooks producing high fidelity alerts

◉ Continually developing systematic playbooks, executing on broad, low-level telemetry on product backend

Our new threat hunting capability:

◉ Is provided by Cisco Talos, the largest non-governmental threat intelligence organization on the planet

◉ Is not limited to just one control point (i.e.: endpoint), instead, we hunt across multiple environments

◉ Uniquely combines our new Orbital Advanced Search technology with expertise from elite threat hunters to proactively find more sophisticated threats

Tuesday, 25 February 2020

Introducing SecureX

Making Security an Enabler, so Your Business Can Take an Exponential Leap


I joined the Cisco Security team the week after the RSA Conference in 2017. At that time there was a lot of discussion around the journey Cisco Security was on, particularly around our efforts to deliver an integrated architecture. For the previous years we had been integrating threat intelligence, context sharing and our anti-malware engine across our portfolio and were seeing dramatic improvements in key metrics such as time to detection.


But from the perspective of a security practitioner’s daily experience with our portfolio, we were failing. The user experience was siloed, it took too long to stitch our products (and third-party products) together, and even the navigation and look and feel of our products varied dramatically.

Shortly after that RSA we made the decision to focus our attention on the operational experience of our Security products, realizing that the usability component was equally as important as the underlying architecture. We stood up a team to lead us on that journey and began laying the foundation for what would become a huge leap forward for Cisco Security and for our customers.


Today we are introducing Cisco SecureX – a new way for users to experience Cisco’s Security portfolio.  Cisco SecureX streamlines our customers’ operations with increased visibility across their security portfolio and provides out-of-box integrations, powerful security analytics, and automated workflows to speed threat detection and response. SecureX is an open, cloud-native platform that connects Cisco’s integrated security portfolio and customers’ security portfolios for a simpler, more consistent experience across endpoints, cloud, network, and applications.

The foundational capabilities of SecureX


SecureX builds on the foundational work we’ve been doing over the past 2.5 years, including Cisco Threat Response, common user experience, single sign on, secure data sharing between on-prem and the cloud and more. But it does a whole lot more. The best way to experience SecureX is to visit us at the RSA conference. For those of you who can’t make it, here are some of the most important capabilities of the platform:

Unified visibility

SecureX provides unified visibility across all parts of your security portfolio – Cisco or third-party solutions – delivering metrics, activity feed and the latest threat intelligence.  I am particularly excited about the operational metrics capabilities of SecureX: Mean Time to Detection, Mean Time to Remediation, and Incident burndown times.  These metrics are derived from full case management capabilities native to the SecureX platform.  Case management enables SecureX customers to assign cases, track them to closure, and add relevant artifacts captured during investigation.

Automation

SecureX brings full multi-domain orchestration and automation capabilities to our customers using a no/low-code approach and intuitive drag-and-drop interface to deliver high-performance and scalable playbook capability.  The SecureX orchestration and automation capabilities use an adapter model that allows users to quickly and easily orchestrate across Security, Networking, IoT, Cloud, Collaboration, and Data Centers.  SecureX already has 50+ adapters across these domains and will continue to develop more.

Playbooks

SecureX will deliver pre-built playbooks, and customers can also develop their own playbooks tailored to their own environment of Cisco and non-Cisco products.  With our phishing playbook for example, end users can submit suspicious email to SecureX to get a recommendation of whether it is malicious or not.  If the submitted email is malicious, the end user will be notified of recommended next steps, and an event will be generated in SecureX alerting the security team.  To deliver this capability, the playbook pre-processes email to extract observables, determines the verdict for observables, hunts for targets involved and takes mitigation and/or preventative actions such as isolating the targets involved, blocking the malicious domain as necessary, etc.

Managed threat hunting

Only Cisco can bring multi-domain managed threat hunting capability across endpoint, cloud, email, etc. because of the breath and scope of our product portfolio.  Multi-domain managed threat hunting detects threats leveraging a combination of intel and data techniques to surface activity that might have slipped past traditional threat, behavioral, and ML-based techniques.  High fidelity threats confirmed by our Talos and Research teams are then communicated to customers through the SecureX activity panel as well as via emails with detail artifacts, targets involved, and remediation recommendations.

Fast time to value

Unlike other security platforms in the market, SecureX helps customers get value quickly.  Getting started is simple – if you have a CCO account, login and add products to SecureX by providing API keys and adding on-prem devices (for Firewall and on-prem Email solutions).  If you don’t have a CCO account, create a SecureX account on the homepage, add products to SecureX by providing an API key and adding on-prem devices (for Firewall and on-prem Email solutions).  You are ready to go in minutes vs. hours and days.

Saturday, 22 February 2020

Best way to Prepare Cisco CCNA Collaboration (CICD) 210-060 Certificatio...



Exam Name: Implementing Cisco Collaboration Devices

Exam Number: 210-060 CICD

Exam Price: $300 USD

Duration: 75 minutes

Number of Questions: 55-65

Passing Score Variable: (750-850 / 1000 Approx.)

Recommended Training:

Implementing Cisco Collaboration Devices (CICD)

Exam Registration: PEARSON VUE

Sample Questions: Cisco 210-060 Sample Questions

Practice Exam: Cisco Certified Network Associate Collaboration Practice Test

Friday, 21 February 2020

Is Your Firewall Permitting and Denying the Correct Flows?

Two days prior, a large US city fell victim to a ransomware attack that disabled a sizable portion of the municipal network. I found myself on an airplane a few hours later.

Cisco Prep, Cisco Exam Prep, Cisco Tutorial and Material, Cisco Certification

Our first order of business was to quarantine potentially infected systems away from known-clean systems. In the interest of time, we installed a large Cisco ASA firewall into the datacenter distribution layer as a crude segmentation barrier. Once management connectivity was established, I was tasked with firewall administration, a job that is generally monotonous, thankless, and easy to scapegoat when things go wrong.

Long story short, the network was changing constantly. Applications were being moved between subnets, systems were being torn down and rebuilt, and I was regularly tasked with updating the firewall rules to permit or deny specific flows. After about 30 minutes, I decided to apply some basic automation. Managing firewall configurations using Python scripts is powerful but not particularly new or interesting, so I won’t focus on that aspect today.

Instead, consider the more difficult question. How do you know that your firewall is permitting and denying the correct flows? Often times, we measure the effectiveness of our firewall by whether the applications are working. This is a good measure of whether the firewall is permitting the desirable flows, but NOT a good measure of whether the firewall is denying the undesirable flows. Answering this question was critical to our incident response efforts at the customer site.

How would you solve this problem?


I decided to use the `packet-tracer` command built into the ASA command line. The command supports native `xml` formatting, which makes it easy to parse into Python data structures. Using Python parlance, the output contains a list of dictionaries, each one describing a phase in the ASA processing pipeline. Stages might include an ACL check, route lookup, NAT, QoS, and more. After the phase list, a summary dictionary indicates the final result.

ASA1#packet-tracer input inside tcp 10.0.0.1 50000 192.168.0.1 80 xml
<Phase>
  <id>1</id>
  <type>ROUTE-LOOKUP</type>
  <subtype>Resolve Egress Interface</subtype>
  <result>ALLOW</result>
  <config/>
  <extra>found next-hop 192.0.2.1 using egress ifc management</extra>
</Phase>
<Phase>
  <id>2</id>
  <type>ACCESS-LIST</type>
  <subtype/>
  <result>DROP</result>
  <config>Implicit Rule</config>
  <extra>deny all</extra>
</Phase>
<result>
  <input-interface>UNKNOWN</input-interface>
  <input-status>up</input-status>
  <input-line-status>up</input-line-status>
  <output-interface>UNKNOWN</output-interface>
  <output-status>up</output-status>
  <output-line-status>up</output-line-status>
  <action>DROP</action>
  <drop-reason>(acl-drop) Flow is denied by configured rule</drop-reason>
</result>

Now, how to automate this? I decided to use a combination of Python packages:

1. Nornir, a task execution framework with concurrency support
2. Netmiko, a library for accessing network device command lines

The high-level logic is simple. On a per firewall basis, define a list of `checks` which contain all the inputs for a `packet-tracer` test. Each check also specifies a `should` key which helps answers our original business question; should this flow be allowed or dropped? This allows us to test both positive and negative cases explicitly. Checks can be TCP, UDP, ICMP, or any arbitrary IP protocol. The checks can also be defined in YAML or JSON format for each host. Here’s an example of a `checks` list for a specific firewall:

---
checks:
  - id: "DNS OUTBOUND"
    in_intf: "inside"
    proto: "udp"
    src_ip: "192.0.2.2"
    src_port: 5000
    dst_ip: "8.8.8.8"
    dst_port: 53
    should: "allow"
  - id: "HTTPS OUTBOUND"
    in_intf: "inside"
    proto: "tcp"
    src_ip: "192.0.2.2"
    src_port: 5000
    dst_ip: "20.0.0.1"
    dst_port: 443
    should: "allow"
  - id: "SSH INBOUND"
    in_intf: "management"
    proto: "tcp"
    src_ip: "fc00:192:0:2::2"
    src_port: 5000
    dst_ip: "fc00:8:8:8::8"
    dst_port: 22
    should: "drop"
  - id: "PING OUTBOUND"
    in_intf: "inside"
    proto: "icmp"
    src_ip: "192.0.2.2"
    icmp_type: 8
    icmp_code: 0
    dst_ip: "8.8.8.8"
    should: "allow"
  - id: "L2TP OUTBOUND"
    in_intf: "inside"
    proto: 115
    src_ip: "192.0.2.1"
    dst_ip: "20.0.0.1"
    should: "drop"
...

Inside of a Nornir task, the code iterates over the list of checks, assembling the proper `packet-tracer` command from the check data, and issuing the command to the device using Netmiko. Note that this task runs concurrently on all firewalls in the Nornir inventory, making it a good fit for networks with distributed firewalls. Below is a simplified version of the Python code to illustrate the high-level logic.

def run_checks(task):
    # Iterate over all supplied checks
    for chk in checks:
        # Build the string command from check details (not shown)
        cmd = get_cmd(chk)
        # Use netmiko to send the command and collect output
        task.run(
            task=netmiko_send_command,
            command_string=cmd
        )

Behind the scenes, the code transforms this XML data returned by the ASA into Python objects. Here’s what that dictionary might look like. It contains two keys: `Phase` is a list of dictionaries representing each processing phase, and `result` is the final summarized result/

{
  "Phase": [
    {
      "id": 1,
      "type": "ROUTE-LOOKUP",
      "subtype": "Resolve Egress Interface",
      "result": "ALLOW",
      "config": None,
      "extra": "found next-hop 192.0.2.1 using egress ifc management"
    },
    {
      "id": 2,
      "type": "ACCESS-LIST",
      "subtype": None,
      "result": "DROP",
      "config": "Implicit Rule",
      "extra": "deny all"
    }
  ],
  "result": {
    "input-interface": "UNKNOWN",
    "input-status": "up",
    "input-line-status": "up",
    "output-interface": "UNKNOWN",
    "output-status": "up",
    "output-line-status": "up",
    "action": "DROP",
    "drop-reason": "(acl-drop) Flow is denied by configured rule"
  }
}

In the interest of brevity, I won’t cover the extensive unit/system tests, minor CLI arguments, and dryrun process in this blog. Just know that the script will automatically output three different files. I used the new “processor” feature of Nornir to build this output. Rather than traversing the Nornir result structure after the tasks have completed, processors are event-based and will run at user-specified points in time, such as when a task starts, a task ends, a subtask starts, a subtask ends, etc.

One of the output formats is terse, human-readable text which contains the name of the check and the result. If a check should be allowed and it was allowed, or if a check should be dropped and it was dropped, is it considered successful. Any other combination of expected versus actual results indicates failure. Other formats include comma-separated value (CSV) files and JSON dumps that provide even more information from the `packet-tracer` result. Here’s the `terse` format when executed on two ASAs with hostnames `ASAV1` and `ASAV2`:

ASAV1 DNS OUTBOUND -> FAIL
ASAV1 HTTPS OUTBOUND -> PASS
ASAV1 SSH INBOUND -> PASS
ASAV1 PING OUTBOUND -> PASS
ASAV1 L2TP OUTBOUND -> PASS
ASAV2 DNS OUTBOUND -> PASS
ASAV2 HTTPS OUTBOUND -> FAIL
ASAV2 SSH INBOUND -> PASS
ASAV2 PING OUTBOUND -> PASS
ASAV2 L2TP OUTBOUND -> PASS

For the visual learners, here’s a high-level diagram that summarizes the project’s architecture. After Nornir is initialized with the standard `hosts.yaml` and `groups.yaml` inventory files, the host-specific checks are loaded for each device. Then, Nornir uses Netmiko to iteratively issue each `packet-tracer` command to each device. The results are recorded in three different output files which aggregate the results for easy viewing and archival.

Cisco Prep, Cisco Exam Prep, Cisco Tutorial and Material, Cisco Certification

If you’d like to learn more about the project or deploy it in your environment, check it out here on the Cisco DevNet Code Exchange. As a final point, I’ve built Cisco FTD support into this tool as well, but it is experimental and needs more in-depth testing. Happy coding!

Thursday, 20 February 2020

Answering The Big Three Data Science Questions At Cisco

Data Science Applied In Business


In the past decade, there has been an explosion in the application of data science outside of academic realms. The use of general, statistical, predictive machine learning models has achieved high success rates across multiple occupations including finance, marketing, sales, and engineering, as well as multiple industries including entertainment, online and store front retail, transportation, service and hospitality, healthcare, insurance, manufacturing and many others. The applications of data science seem to be nearly endless in today’s modern landscape, with each company jockeying for position in the new data and insights economy. Yet, what if I told you that companies may be achieving only a third of the value they could be getting with the use of data science for their companies? I know, it sounds almost fantastical given how much success has already been achieved using data science. However, many opportunities for value generation may be getting over looked because data scientists and statisticians are not traditionally trained to answer some of the questions companies in industry care about.

Most of the technical data science analysis done today is either classification (labeling with discrete values), regression (labeling with a number), or pattern recognition. These forms of analysis answer the business questions ‘can I understand what is going on’ and ‘can I predict what will happen next’. Examples of questions are ‘can I predict which customers will churn?’, ‘can I forecast my next quarter revenue?’, ‘can I predict products customers are interested in?’, ‘are there important customer activity patterns?’, etc… These are extremely valuable questions to companies that can be answered by data science. In fact, answering these questions is what has caused the explosion in interest in applying data science in business applications. However, most companies have two other major categories of important questions that are being totally ignored. Namely, once a problem has been identified or predicted, can we determine what’s causing it? Furthermore, can we take action to resolve or prevent the problem?

I start this article discussing why most data driven companies aren’t as data driven as they think they are. I then introduce the idea of the 3 categories of questions companies care about the most (The Big 3), discuss why data scientists have been missing these opportunities. I then outline how data scientists and companies can partner to answer these questions.

Why Even Advanced Tech Companies Aren’t as Data Driven As They Think They Are.


Many companies want to become more ‘data driven’, and to generate more ‘prescriptive insights’. They want to use data to make effective decisions about their business plans, operations, products and services. The current idea of being ‘data driven’ and ‘prescriptive insights’ in the industry today seems to be defined as using trends or descriptive statistics about about data to try to make informed business decisions. This is the most basic form of being data driven. Some companies, particularly the more advanced technology companies go a step further and use predictive machine learning models and more advanced statistical inference and analysis methods to generate more advanced descriptive numbers. But that’s just it. These numbers, even those generated by predictive machine learning models, are just descriptive (those with a statistical background must forgive me for the overloaded use of the term ‘descriptive’). They may be descriptive in different ways, such as machine learning generating a predicted number about something that may happen in the future, while a descriptive statistic indicates what is happening in the present, but these methods ultimately focus on producing a number. To take action to bring about a desired change in an environment requires more than a number. It’s not enough to predict a metric of interest. Businesses want to use numbers to make decisions. In other words, businesses want causal stories. They want to know why a metric is the way it is, and how their actions can move that metric in a desired direction. The problem is that classic statistics and data science falls short in pursuit of answers to these questions.

Take the example diagram shown in figure 1 below. Figure 1 shows a very common business problem of predicting the risk of a customer churning. For this problem, a data scientist may gather many pieces of data (features) about a customer and then build a predictive model. Once a model is developed, it is deployed as a continually running insight service, and integrated into a business process. In this case, let’s say we have a renewal manager that wants to use these insights. The business process is as follows. First, the automated insight service that was deployed gathers data about the customer. It then passes that data to the predictive model. The predictive model then outputs a predicted risk of churn number. This number is then passed to the renewal manager. The renewal manager then uses their gut intuition to determine what action to take to reduce the risk of churn. This all seems straightforward enough. However, we’ve broken the chain of being data driven. How is that you ask? Well, our data driven business process stopped at the point of generating our churn risk number. We simply gave our churn risk number to a human, and they used their gut intuition to make a decision. This isn’t data driven decision making, this is gut driven decision making. It’s a subtle thing to notice, so don’t feel too bad if you didn’t see it at first. In fact, most people don’t recognize this subtlety. That’s because it’s so natural these days to think that getting a number to a human is how making ‘data driven decisions’ works. The subtlety exists because we are not using data and statistical methods to evaluate the impact of actions the human can take on the metric they care about. A human sees a number or a graph, and then *decides* to take *action*. This implies they have an idea about how their *action* will *effect* the number or graph that they see. Thus, they are making a cause and effect judgement about their decision making and their actions. Yet, they aren’t using any sort of mathematical methods for evaluating their options. They are simply using their personal judgement to make a decision. What can end up happening in this case is that a human may see a number, make a decision, and end up making that number worse.

Let’s take the churn risk example again. Let’s say the customer is 70% likely to churn and that they were likely to churn because their experience with the service was poor, but assume that the renewal manager doesn’t know this (this too is actually a cause and effect statement). Let’s also say that a renewal manager sends a specially crafted renewal email to this customer in an attempt to reduce the likelihood of churn. That seems like a reasonable action to take, right? However, this customer receives the email, and is reminded of how bad their experience was, and is now even more annoyed with our company. Suddenly the likelihood to churn increases to 90% for this customer. If we had taken no action, or possibly a different action (say connecting them with digital support resources) then we would have been better off. But without an analysis of cause and effect, and without systems that can analyze our actions and prescribe the best ones to take, we are gambling with the metrics we care about.

Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Certifications, Ciscoa Learning
Figure 1

So how can we attempt to solve this problem? We need to incorporate mathematical models and measurement into the business process after the number is generated. We need to collect data on what actions are being taken, measure their relationship with the metrics we care about, and then optimize over our actions using causal inference models and AI systems. Figure 2 below shows how we can insert an AI system into the business process to help track, measure, and optimize the actions our company is taking. Using a combination of mathematical analysis methods, we can begin to optimize the entire process using data science end to end. The stages of this process can be abstracted and generalized as answering 3 categories of questions companies care about. Those 3 categories are described in the next section.

Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Certifications, Ciscoa Learning

Comparing Machine Learning to Causal Analysis (Inference)


To get a better understanding of what machine learning does and where it falls short, we introduce figure 3 and figure 4 below. Figure 3 and Figure 4 both describe the problem space of understanding cancer. Machine learning can be used to do things like predict whether or not a patient will get cancer given characteristics that have been measured about them. Figure 3 shows this by assigning directed arrows from independent variables to the dependent variable (in this case cancer). These links are associative by their construction. The main point is that machine learning focuses on numbers and the accurate production of a number. This can in many cases be enough to gain a significant amount of value. For example, predicting the path of a hurricane has value on it’s own. There exists no confusion about what should be done given the prediction. If you are in the predicted path of the hurricane, the action is clearly to get out of the way. Sometimes however, we want to know why something is happening. Many times we want to play ‘what-if’ games. What if the patient stopped smoking? What if the patient had less peer pressure? To answer these questions, we need to perform a causal analysis.

Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Certifications, Ciscoa Learning
Figure 3

Figure 4 below shows a visual example of what causal analysis provides. Causal analysis outputs stories, not just numbers. The diagram shows the directed causal links between all variables in an environment. For example, given this diagram anxiety causes smoking. Causal stories are important any time we or our business stakeholders want to take action to improve the environment. The causal story allows us to quantify cause and effect relationships, play what-if scenarios, and perform root-cause analysis. Machine learning falls short of being able to do this because these all require a modeling of cause and effect relationships.

What Are the Big 3?


Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Certifications, Ciscoa Learning
Figure 5

Figure 5 above describes what ‘The Big 3’ questions companies care about are. The big 3 questions seem fairly obvious. In fact, these questions are at the foundation of most of problem solving in the real world. Yet, almost all data science in industry today revolves around answering only the first question. What most data scientists understand as supervised, unsupervised, and semi-supervised learning revolves around answering what is happening or what will happen. Even with something like a product recommendation system (which you might believe prescribes something because of the term ‘recommend’), we only know what products a customer is interested in (thus it’s only an indication of interest, not a reason for interest). We don’t know the most effective way to act on that information. Should we send an ad? Should we call them? Do certain engagements with them cause a decrease their chances of purchase? To answer what is *causing* something to happen, we need to rely on foundational work in the area of Causal Inference developed by researchers like Ronald Fisher, Jerzy Neyman, Judea Pearl, Donald B. Rubin, Paul Holland, and many others. Once we understand what is causing a metric we care about, we can at least begin to think intelligently about the actions we can take to change those metrics. This is where the third question mentioned in figure 3 above comes in. To answer this question we can rely on a wide variety of techniques that have been developed including causal inference for the cause and effect relationship between actions and the metrics they are supposed to affect, statistical decision theory, decision support systems, control systems, reinforcement learning, and game theory.

Cisco Prep, Cisco Tutorial and Material, Cisco Guides, Cisco Certifications, Ciscoa Learning
Figure 6

Figure 6 above breaks down some of the methods in a more technical way. The methodology column outlines the major methods, fields, and approaches that can be in general used each of the big 3 questions in turn. The algorithms column lists some specific algorithms that may be applied to answer each of the big 3 questions. While some of these algorithms should be familiar to the average data scientist (deep neural networks, random forests, etc.), others are maybe only known in passing (multi-armed bandits, reinforcement learning, etc). Still more algorithms are likely to be totally new to some data scientists (Difference in Differences, Propensity Score Matching, etc). The main paper delves into each question and the important technical details of the methods used to answer each question. It’s very important to understand these methods, particularly for performing causal analysis and optimizing actions. These methods are highly nuanced, with many difference kinds of assumptions. Naively applying these methods without understanding their limitations and assumptions will almost certainly lead to incorrect conclusions.

Example Use Case for Renewals.


A well known question, in which we have applied the big 3 methodology, is for understanding Cisco product and service renewals. Understanding and predicting renewals is a prime example of how many companies are attempting to get value through data science. The problem is also typically referred to as predicting churn, churn risk prediction, predicting attrition. Focusing on renewals is also useful for demonstration purposes because most of the data science applied to problems of this kind fall short of providing full value. That’s because renewals is a problem where providing a number is not the goal. Simply providing likelihood of a customer to renew is not enough. The company wants to **do** something about this. The company wants to take action to cause an increase in the likelihood to renew. For this, and any other time where the goal is to **do** something, we rely on causal inference and methods for optimizing actions.

Question 1: What is happening or will happen?

As we’ve already stated, the main question that is typically posed to a team of data scientists is ‘Can we accurately predict which customers will renew and which ones won’t?’ While this is a primary question asked by the business, there are many other questions that fall into the area of prediction and pattern mining including,

1. How much revenue can we expect from renewals? What does the distribution look like?
2. What’s the upper/lower bound on the expected revenue predicted by the models?
3. What are the similar attributes among customers likely to churn versus not churn?
4. What are the descriptive statistics for customers likely to churn vs not churn collectively, in each label grouping, and in each unsupervised grouping?

Each of the above questions can be answered systematically by framing them as problems either in prediction or pattern mining, and by using the wide variety of mathematical methods found in the referenced materials in the main paper here. These are the questions and methods data scientists are most familiar, and will most commonly be answered for a business.

Question 2: Why is this happening or going to happen?

Given this first question, the immediate next question is why. Why are customers likely or not likely to churn? For each question that we can build a model for, we can also perform a causal analysis. Thus, we can already potentially double the value that a data science project returns by simply adding on a causal analysis to each predictive model built. It’s important to bring up again that this question is so important that most data scientists are either answering it incorrectly, or are misrepresenting the information from statistical associations.

Specifically, when a data scientist is asked the question of why a customer is likely to churn, they almost exclusively turn to feature importance and local models such as LIME, SHAP, and others. These methods for describing the reason for a prediction are almost always incorrect because there is a disconnect between what the business stakeholder is asking for and what the data scientist is providing because of two different interpretations of the term ‘why’. Technically, one can argue that feature importance measures what features are important to ‘why’ a model makes a prediction, and this would be correct. However, a business stakeholder usually wants to know ‘what is causing the metric itself’ and not ‘what is causing the metric prediction’. The business stakeholder wants to know the causal mechanisms for why a metric is a particular number. This is something that feature importance absolutely does not answer. The stakeholder wants to use the understanding of the causal mechanisms to take an action to change the prediction to be more in their favor. This requires a causal analysis. However, most data scientists simply take the features with highest measured importance and present them to the stakeholder as though they are the answer to their causal question. This is objectively wrong, yet is time and again presented to stakeholders by seasoned statisticians and data scientists.

The issue is compounded by the further confusion added by discussions around ‘interpretable models’ and by the descriptions of feature importance analysis. LIME describes it’s package as ‘explaining what machine learning classifiers (or models) are doing’. While still a technically correct statement, these methods are being used to incorrectly answer causal questions, leading stakeholders to take actions that may have the opposite effect of what they intended.

While we’ve outlined the main causal question, there are a number of questions that can also be asked, and corresponding analysis that can be performed including,

1. How are variables correlated with each other and the churn label? (A non-causal question)

2. What are the important features for prediction in a model in general? (A non-causal question)

3. What are the most important features for prediction for an individual? Do groupings of customers with locally similar relationships exist? (A non-causal question)

4. What are the possible confounding variables? (A causal question)

5. After controlling for confounding variables, how do the predictions change? (A non-causal question benefiting from causal methods)

6. What does the causal bayes net structure look like? What are all of the reasonable structures? (A causal question)

7. What are the causal effect estimates between variables? What about between variables and the class label? (A causal question).

Many of these questions can be answered in whole or in part by a thorough causal analysis using the methods we outlined in the corresponding causal inference section of the main paper here, and further multiply the value returned by a particular data science project.

Question 3: How can we take action to make the metrics improve?

The third question to answer is ‘what actions can a stakeholder take to prevent churn?’ This is ultimately the most valueable of the three questions. The first two question set the context for who to focus on and where to focus efforts. Answering this question provides stakeholders with a directed and statistically valid means to improve the metrics they care about given complex environments. While still challenging given the methods available today (and presented in the section on intelligent action), it provides one of the greatest value opportunities. Some other questions that can be answered related to intelligent action that stakeholders may be interested in include,

1. What variables are likely to reduce churn risk if our actions could influence them?

2. What actions have the strongest impact on the variables that are likely to influence churn risk, or to reduce churn risk directly?

3. What are the important pieces of contextual information relevant for taking an action?

4. What are the new actions that should be developed and tested in an attempt to influence churn risk?

5. What actions are counter-productive or negatively impact churn risk?

6. What does the diminishing marginal utility of an action look like? At what point should an action no longer be taken?

The right method to use for prescribing intelligent action depends largely on the problem and the environment. If the environment is complex, the risks high, and there is not much chance for an automated system to be implemented, then methods from causal inference, decision theory, influence diagrams, and game theory based analysis are good candidates. However, if a problem and stakeholder are open to the use of an automated agent to learn and prescribe intelligent actions, then reinforcement learning may be a good choice. While possibly the most valuable of the big three questions to answer, it also exists as one of the most challenging. There still many open research questions related to answering this question, but the value proposition means that it’s likely an area that will see increased industry investment in the coming years.

How We Are Improving CX By Using Data Science to Answer the Big 3 at Cisco.


Like many other companies, Cisco has many models for answering the first of the big 3 questions. The digital lifecycle journey data science team has many predictive models for understanding Cisco’s customers. This includes analysis of customer purchasing behavior, digital activity, telemetry, support interaction, and renewal activity using a wide variety of machine learning based algorithms. We also apply the latest and greatest forms of advanced statistical and deep learning based supervised learning methods for understanding and predicting the expected behavior of our customers, their interactions with Cisco, and their interactions with Cisco products and services. We go a step further in this area by attempting to quantify and predict metrics valuable to both Cisco and Cisco’s customers. For example, we predict metrics like how a customer is going to keep progressing through the expected engagement with their product over the next several days to next several weeks. This is just one of the many metrics we are trying to understand about the Cisco customer experience. Others include customer satisfaction, customer health, customer ROI, renewal metrics, and many others. These metrics allow us to understand where there may be issues with our journey so that we can start trying to apply data science methods to answer the ‘why’ and ‘intelligent action’ questions we’ve previously mentioned.

We are also using causality to attempt to understand the Cisco’s customer experience, and what causes a good or bad customer experience. We go a step further by trying to complete the causal chain of reasoning to quantify how a customer experience causes Cisco’s business metrics to rise and fall. For example, we’ve used causal inference methods to measure the cause and effect aspects of customer behavior, product utilization, and digital engagements on a customer’s likelihood to renew Cisco services. Using causal inference, we are gaining deeper insights into what is causing our customers and Cisco to succeed or fail, and are using that information to guide our strategy for maximizing the customer experience.

Finally, to answer the third of the big three questions, we are employing causality, statistical decision theory, intelligent agent theory, and reinforcement learning to gain visibility to the impact our activities have on helping our customers and improving Cisco’s business metrics, and to learn to prescribe optimal actions over time to maximize these metrics. We have developed intelligent action systems that we working to integrate with our digital email engagements journeys to optimize our interactions with customers to help them achieve a return on investment. We are, in general, applying this advanced intelligent agent system to quantify the impact of our digital interactions, and to prescribe the right digital customer engagements to have, with the most effective content, at the right time, in the right order, personalized to each and every individual customer.

Why Many Data Scientists Don’t Know the Big 3, or How to Answer Them.


Those learned readers experienced with data science may be asking themselves, ‘is anything new being said here’? It’s true that no new technical algorithm, mathematical analysis, or in depth proof is being presented. I’m not presenting some new mathematical modeling method, or some novel comparison of existing methods. What is new is how I’m framing the problems for data science in industry, and the existing methodologies that can start to solve those problems. Causal inference has been used heavily in medicine for observational studies where controlled trials aren’t feasible, and for things like adverse drug effect discovery. However, Causal Inference hasn’t received wide spread application in areas outside of the medical, economic, or social science fields yet. The idea of prescribed actions is also something that isn’t totally new. Prescribed actions can be thought of as just a restatement of the field of control systems, decision analysis, and intelligent agent theory. However, the utilization of these methods for completing the end-to-end data driven methodology for business hasn’t received wide spread application in industry applications. Why is this? Why aren’t data scientists and businesses working together to frame all of their problems this way?

There could be a couple of reasons for this. The most obvious answer is that most data scientists are trained to answer the first of the big 3 questions. Most data scientists and statisticians are trained on statistical inference, classification, regression, and general unsupervised learning methods like clustering and anomaly detection. Statistical methods like causal inference aren’t widely known, and are therefore not widely taught. Register with any online course, university, or other platform for learning about data science and machine learning and you’ll be hard pressed to find discussions about identifying causal patterns in data sets. The same goes for the ideas of intelligent agents, control systems, and reinforcement learning to a lesser degree. These methods tend to be relegated to domains that have simulators and a tolerance for failure. Thankfully there is less of a gap for these methods. They typically are given their own special courses in either machine learning, electronics, and signals and systems processing curriculums.

Another possible explanation may be that many data scientists in industry tend to be enamored with the latest and greatest popular algorithm or methodology. As math and tech nerds we become enamored with the technical intricacies of how things work, particularly mathematical algorithms and methodologies. We tend to develop models and then go looking for solutions rather than the other way around, potentially blinding us to the methods in data science that can provide business value time and time again.

Yet another explanation may be that many data scientists are not well versed enough in statistics and the statistical literature. Many data scientists are asked questions about how a predictive model produced a number. For example, in our churn risk problem, renewal managers typically want to know why someone is at risk. The average data scientist hears this, and then uses methods like feature importance and more interpretable models to answer this question. However this doesn’t really answer the actual question being asked. The data scientist provides what might be important associations between model inputs and the predicted metric, but this doesn’t provide the information the renewal manager wants. They want to know about information they can act on, which requires cause and effect analysis. This is a classic case of ‘correlation is not causation’ that everyone seems to know but can still trip up even statistically minded data scientists. It’s such an issue that many companies I’ve talked with that claim to provide ‘next best actions’ are statistically invalid (mainly because they use feature importance and sensitivity analysis type methods instead of understanding basic counter factual analysis and confounding variables).

Moving forward the data science community operating in industry domains will become more aware of the big 3 questions and the analysis methods that can be performed to answer them. Companies that can quickly realize value from answering these questions using data science will be at the head of the pack in the emerging data science and insights economy. Companies that focus on answering all of the big 3 questions will have a distinct competitive advantage, and will have transformed themselves to be truly data driven.