Resilience
Protecting your business and making it stronger with every threat.
There's a reason why the very forefront of technology is called bleeding edge. These highly advanced and, most times, experimental technologies are rarely ready to be put up into the harsh reality outside a research lab.
Bringing something that works inside a research lab to the daily operations of a business is risky. Many of these systems haven't yet been battle-tested, and they will most likely fail, in one way or another, sooner or later. And that's ok, if you are expecting and ready for it.
The business context is not a controlled environment like the ones you find in a research lab; it's a chaotic, complex environment. Besides all the internal variables that you don't even fully comprehend yet, businesses and markets are also dependent on external variables you have no control of.
Even when adequately tested, many data science and artificial intelligence systems will fail when put into production. No matter what people will tell you, like a rocket, there's no way to know if it will succeed until you launch it.
Those are the bad news. The good news is that you can prepare for it.
Stress testing
Software or systems testing is not something new. You create a batch of tests against your system, and the system should only be deemed satisfactory once it passes all (or at least the majority) of them.
For non-technical people, this is how it might look like: imagine you have a website with a simple email signup form, there's several tests you can do for it:
- Submit invalid text as email;
- Submit a fake email;
- Submit a very high number of emails per second;
- Make a raw request to submit an image instead of an email;
Some tests are more advanced than others, leading to the creation of stronger or weaker systems. When doing tests, engineers should not only think of the simple cases, but they should also assume not all users are well-intentioned, and some might be trying to attack your system. For this reason, you should also look into edge cases that will most likely cause errors and sometimes even vulnerabilities that allow others to take control over your software or data.
The concept of antifragile
I've been fascinated by the concept of antifragile even before I read the works of Nassim Nicholas Taleb. I didn't have a name for it; probably the most similar one I knew at the time was "evolutionary pressure".
The concept of anti-fragile can be explained quite simply. If you have something fragile that is exposed to some stress, it is likely to break. If you have something stronger that is exposed to some stress, it may break, but it may also stay the same. If you have something antifragile, and you expose it to stress, it will improve from it, becoming stronger. Even though the concept is presented in opposition to resilience, I see it as a necessity for resilience, as I believe that resilience can't exist without some anti-fragility. Even though it might not be present anymore, it probably was during the development of that entity. I also argue that without a constant application of antifragility, any strong system would most likely become fragile sooner or later.
This concept can be seen everywhere in nature, for example, in evolution. Think about bacteria that is exposed to antibiotics that are strong but not strong enough to kill them. In the right conditions, this exposure will not make the bacteria weaker. Instead, that same bacteria can develop a resistance to those antibiotics. The same happens with a person, both physically and mentally; if someone doesn't expose their bodies to the slightest stresses, it will most likely atrophy, and sometimes that might even lead to disease. Businesses are no exception. Businesses that have been exposed to stresses for a longer time and have survived are more likely to survive than others. This is sometimes described as the Lindy Effect.
Applying the concept of anti-fragility
Ok, but how can you apply this concept to your business? Well, most likely, you're already using it, either you want it or not. Your business is exposed to stressors: bad clients, bad products, bad employees, recessions, etc. The more stresses it is exposed to and recovers from, the more likely it is to survive in the future. Same with the people you work with. Having said that, there are ways to have a more active approach to the application of the concept, but actively and purposely exposing it to stresses.
"What are you saying? That I should attack myself?" In one word: Yes. In a sentence, "Yes, but not so much that you cause permanent damage and just enough that you cause positive change.". If you exercise, you know the difference between overtraining, undertraining, and training effectively. You want to reach a balance where you are imposing enough stress on your systems to create change but not enough to damage them.
No, I'm not saying people should be under constant stress; this is a book about data science and artificial intelligence, after all. What I am saying is that your systems should be exposed to these stresses. It will help to expose their weaknesses, so then your team can work on them. To minimize the strain on your team, I suggest you do it first during the development stage, even though others swear by aggressively attacking the systems in productions, suggesting that is the only "realistic way" to test and stress your system.
In either case, all companies are different so try to understand what would best fit your context.
Before you attack yourself, you can always cheat and protect yourself before you do them. But always test after you do. Think of the following cases:
- What would happen if someone spammed your social media?
- What would happen if someone spammed your website with fake comments?
- What would happen if your website went down today?
- What would happen if 5000 people visited your website in the next 5 minutes?
- and so on.
Now ask yourself the question, "Are we ready for any of this?". Answering it will help you understand what you can do to protect yourself.
Finally, do it. Expose your systems to a little bit more stress than they are used to. Trust me; it's much better when you control and schedule the attack than when it comes from a malicious person or company, or more commonly, bugs in your system in a completely unexpected time.
If you were able to handle the situation successfully, good. If not, learn from that experience, plug all the leaks and weaknesses and try it again. In the end, your system will be super resilient and ready for anything you throw at it.
Netflix is known for applying this concept very effectively. They even have automated programs that will shut off internal systems, just to make sure the whole Netflix system still works, even when some of its subsystems are down. These tests include shutting down entire cloud regions or services while keeping the system working. Now that's resiliency.
Microservices
Risking oversimplification, Microservices are like micro business units, usually responsible for a single, simple task in a business.
If you have a business, you could have a microservice for product recommendations, a microservice for churn prediction, a microservice delivering sales reports, and so on. Usually, they do one thing, and they do it well.
Even though microservices have their disadvantages, such as the additional complexity of a distributed system or the sometimes necessary additional resources, there are many advantages of microservices for creating resilience. Because they represent simple micro "business units", they are usually:
- Highly maintainable - usually small projects are easier to understand and modify;
- Highly testable - tests are focused on that single system, making it easy to explore all relevant possibilities;
- Highly independent development - the system needs to do what is it asked and expected for by other systems, nothing else, nothing more;
- Independently deployable - deploying the system is mostly independent of other systems;
- Independent of other systems - good microsystems assume other microsystems will fail and create their own backup paths for when this happens;
- Capable of being developed by a small team - because they are small projects, allowing for higher speed and faster onboarding of new developers.
Because they can be designed to assume that other microservices may or may not be available, they also contribute to a more resilient system. When this happens, even if 50% of your microservices go down, your whole system will still operate, maybe suboptimally, but it will never be completely down.
What about Security?
Following the concepts of testing and anti-fragility, I believe the best way to achieve security is through penetration testing and bug bounties.
A bug bounty is the offering of a reward to hackers that find critical vulnerabilities in a system. In contrast, penetration testing is usually the hiring of a security team composed of hackers that will try by any means to hack their way into your systems and sometimes even physical infrastructure and adcquire sensitive data, or control over a critical system.
If it sounds scary, it's because it is, and also because you probably have never exposed your systems to this kind of test. The truth is, your systems are likely to have bugs and vulnerabilities right now, exposed to the outside. But hopefully, they haven't been attacked yet. The number of cyberattacks is increasing each year, and it's not expected to stop anytime soon as technology becomes more and more part of our life.
Privacy by design
Even some of the most secure systems get hacked sometimes, and information gets exposed. You can think all you want about this, but your company is not different. It is hackable.
If you are hacked, you will be grateful if you apply privacy by design to your business.
If you use strong encryption schemes to encrypt your client, financial, and product data, you might have a second layer of security provided by that encryption even if you're hacked. If you manage to protect the keys, all the attackers will get is a bunch of encrypted data, useless without the right encryption keys. Keep in mind, though, that this encrypted data is only as safe as your encryption keys. If these are exposed somehow, the encryption is useless.
Moving forwards
Now that you know the importance of having resilient systems, we can now talk about creating real change and improvement in your organization. Without going through resilience, you might have the most advanced, automated, AI-powered system that fails at the minimum error or cyberattack. You probably don't want that.
In the next chapter, we'll talk about using data and AI to improve your current business processes by identifying the best opportunities for its applications and ensuring real impact. See you there.