Several cases of Agile adoption in a set of big and mid-size companies. Also presented key benefits, challenges and outputs of an agile adoption
Today, over 50% of the Fortune 500 companies from the year 2000 no longer exist. GE is stumbling. BlackBerry (RIM) is gone, and so is most of Nokia, raised to a $150 billion corporation. (…) John Boyd developed a methodology for operating in such situations, called the OODA Loop. The speed of executing the loop is the essential element of survival. It involves testing one’s premises by actual Observation, Orienting your corporation with respect to the situation. Then Deciding on a course of action, and then executing that plan by Acting. This is the meaning of being Agile. (…) Data is the new gold.
MIT – Cloud & DevOps course – 2020
Pros of agile software development:
Customers have frequent and early opportunities to see the work being delivered and to make decisions and changes throughout the development of the project.
The customer gains a strong sense of ownership by working extensively and directly with the project team throughout the project.
If time to market is a greater concern than releasing a full feature set at initial launch, Agile is best. It will quickly produce a basic version of working software that can be built upon in successive iterations.
Development is often more user-focused, likely a result of more and frequent direction from the customer
Cons of Agile Software Development:
Agile will have a high degree of customer involvement in the project. It may be a problem for some customers who simply may not have the time or interest for this type of participation.
Agile works best when the development team are completely dedicated to the project.
The close working relationships in an Agile project are easiest to manage when the team members are located in the same physical space, which is not always possible.
The iterative nature of Agile development may lead to frequent refactoring if the full system scope is not considered in the initial architecture and design. Without this refactoring, the system can suffer from a reduction in overall quality. This becomes more pronounced in larger-scale implementations, or with systems that include a high level of integration.
Managing Complexity of Organizations and operations
As companies grow, their complexity grows. And they have to manage that complexity, otherwise, it’s gonna turn into chaos. The problem is that they usually manage that putting processes in place: you have to sign X docs, follow Y procedures, etc. The problem is that we tail employee freedom, and the side effect is that the high performing employees tend to leave our company.
Netflix’s solution to this scenario was different. They decided to let the smart workers manage the complexity instead of putting processes in place.
The problem for the traditional approach is that when the market shifts we’re unable to move fast. We have had so many processes and fixed culture that our teams won’t adapt and innovative people won’t stick to these environments.
That leads us to three bad options of managing our growing organizations:
Stay creative and small company (less impact)
Avoid rules (and suffer the chaos)
Use process (and cripple flexibility and ability to thrive when the market changes)
Back to Netflix case: they believed that high performing people can contain the chaos. With the right people, instead of a culture of process adherence, you have a culture of creativity and self-discipline, freedom, and responsibility.
Comparing waterfall and agile software development model
Every cloud journey has an important point: which cloud are we going to? This is a decision matrix recently developed as the first step for an important cloud journey about to start. There are three main areas in this article: the Goals of the journey, the Adopted Criteria, and the Final Decision.
1 – Goals for the cloud migration
Overall company speed – essential for keeping competitive time to market.
Teams autonomy – one more important move to keep time-to-market as fast as possible and foster DevOps adoption.
Cost savings – use the cloud benefit of the pay-as-you-go.
Security – improve security while handing over a few of the key concerns to the cloud provider.
AWS gets the higher score according to specialists due to the granularity it allows
Ease to configure security
Ease to configure security
Ease to configure security
Google gets a higher score due to ease to configure and abstraction capacity
AWS is the one that invests the most on security
Security community support
AWS has a bigger community
Security community support
Security community support
2.7. Machine Learning and Data Science features
Why this criterion is important: looking for the future, it’s important to think about new services to be consumed. This feature received a low maximum score because it is not something critical for this stage of the cloud adoption.
Why this criterion is important: the ability to hire qualified professionals for the specific cloud vendor is crucial for the application lifecycle. This research was performed on LinkedIn with the query “certified cloud architect <vendor>”.
183k people found
90k people found
22k people found
2.10. Professionals cost
Why this criterion is important: as important as professionals availability, the cost involved in hiring each of these professionals is also something important to keep in mind.
There was no difference found between each professional
There was no difference found between each professional
2.11. Companies already present in each cloud
Why this criterion is important: taking a look at companies help to understand where the biggest and most innovative companies are heading to. And if they are doing so, there must be a good reason for that.
Why this criterion is important: since this is intended to be a company wide adoption, some areas will have more or less maturity to migrate to a new paradigm of cloud native software development. The more the cloud provider can assist with simpler migration strategies such as an “AS IS”, the better for this criterion.
Due to the company having a big number of Windows-based services, Microsoft native tools have an advantage
No resources to keep both cloud and on-premises workloads working together were found
3 – The final result
Below is presented the final result for this comparison. Having reached this, I intend to help you cloud journey adoption decisions, but please do not stick to these criteria presented here. Always take a look at what will make sense to your company and business cases.
This adoption must also come hand-by-hand with an internal plan to improve people’s knowledge of the selected cloud. The cloud brings several benefits compared to on-premises services, and like everything in life there are trade-offs and new challenges will appear.
Introduced the serverless paradigm, pros and cons, limits and the evolution to reach it
Serverless computing is a Cloud-based solution for a new computing execution model in which the manager of the server architecture and the application developers are distinctly divided. A frictionless connection exists in that the application does not need to know what is being run or provisioned on it; in the same way that the architecture does not need to know what is being run.
The journey that led us to serverless (image below).
A true microservice:
Does not share data structure and database schema
Does not share internal representation of objects
You must be able to update it without notifying the team
Your functions become Stateless: you have to assume your function will always run a new recently deployed container.
Cold starts: since every time your function will run in a new container, you have to expect some latency for the container to be spun up. After the first execution, the container is kept for a while, and then the call will become a “warm start”.
Cloud provider takes care of most back-end services
Autoscaling of services
Pay as you go and for what you use
Many aspects of security provided by cloud provider
Patching and library updates
Software services, such as user identity, chatbots, storage, messaging, etc
Shorter lead times
Managing state is difficult (leads to difficult debug)
Complex message routing and event propagation (harder to track bugs)
Introduced the regular high-ground phases of a Digital Transformation (and cases for exploring them), which are:
1 – Initial Cloud Project
2 – Foundation
3 – Massive Migration
4 – Reinvention
Cloud computing services are divided in three possible categories:
IaaS – using the computational power of cloud computing data centers to run your previous on-prem workloads.
PaaS – using pre-built components to speed up your software development. Examples: Lambda, EKS, AKS, S3, etc.
SaaS – third-party applications allowing you to solve business problems. Examples: Salesforce, Gmail, etc.
XaaS – Anything as a service.
An abstraction of Overall Phases of adoption:
1 – Initial Cloud Project – Decide and execute the first project
2 – Foundation – Building blocks: find the next steps to solve the pains of the organization. Provide an environment that makes going to the cloud more attractive to the business units. Examples: increase security, increase observability, reduce costs.
1st good practice: During this phase, you can create a “Cloud Center of Excellence” committee to start creating tools to make the cloud shift more appealing to the rest of the organization.
2nd good practice: Build reference architectures to guide people with less knowledge.
3rd good practice: Teach best practices to other engaging business units.
3 – Migration – Move massively to the cloud
One possible strategy is to move As Is and then modernize the application in the future (the step below).
4 – Reinvention – modernize the apps (here you start converting private software to open source, Machine Learning, Data Science, etc).
See the picture below for an illustration of these 4 steps:
The pace of adoption is always calm. Even for aggressive companies, like Netflix that took 7 years to become a cloud-first company.
“The best way to avoid failure is to fail constantly”
“The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most in the event of an unexpected outage”
The DevOps revolution: importance of continuous feedback, data-driven decisions, pillars of DevOps and metrics
Today, software development is no longer characterized by designers throwing their software ‘over-the-wall’ to testers repeating the process with software operations. These roles are now disappearing: today software engineers design, develop, test and deploy their software by leveraging powerful Continuous Integration and Continuous Delivery (CI/CD) tools
Docker, Containers Orchestration and Public Key Infrastructure (PKI)
How the stack of software components used to run an application got more and more complex when compared to past years.
In past years a huge number of web applications used to run on top of LAMP (Linux, Apache, MySQL, and PHP/Pearl). Nowadays we have several different possible approaches for each one of the layers of this acronym.
Containers are the most recent evolution we have for running our apps. They followed these steps:
The dark age: undergoing painful moments to run your app on a new machine (probably using more time to run the app than actually writing it).
Virtualizing (using VMs) to run our apps, but having the trade-off of VMs’ slowness.
Containers – They are a lightweight solution that allows us to write our code in any operating system and then rerun it easily in another operating system.
The difference between Virtual Machines and Docker:
The analogy between the evolution of how humanity solved the problem of transporting goods across the globe using containers (real and physical containers) compared to how software developers used the containers abstraction to make our lives way easier when trying to run an application for the first time.
Kubernetes is introduced and the benefits approached:
Less work for DevOps teams.
Easy to collect metrics.
Automation of several tasks like metrics collecting, scaling, monitoring, etc.
Public key infrastructure:
More and more needed interaction machine-to-machine requires more sophisticated methods of authentication rather than user and password.
Private and public keys are used to hash and encrypt/decrypt messages and communications.
Exercises and Assignments
Exercise 1: Running a simple node app with docker (building and running the app)
The course has two big parts: (1) technical base and (2) business applications and strategies
The second module introduces benefits, trade-offs, and new problems of developing applications for scaling.It also covers the complexity of asynchronous development.
To start with, they approached the whole web concept (since its creation by Tim Berners Lee).
How Google changed the game creating Chrome and the V8 engine.
The creation of Node.JS.
Implementing a simple webserver at Digital Ocean.
The evolution of complexity between the web first steps and the day we are right now: Open Source, JSON protocol, IoT and Big Data and Machine Learning more recently.
The world of making computation in an asynchronous world/architecture.
Exercises and Assignments
Exercise 1: forking a project at GitHub.com and sending a pull request back.
Assignment 1: Running a simple node application locally (a PacMan game) to understand the communication between the client (browser) and server (Node.JS), and also retrieving metrics through an endpoint using JSON as a communication pattern.
The first module is putting everybody’s knowledge up to date about the internet and software development practices evolution.
Assignments of the first module are simple when technically speaking
Disclaimer: I won’t post the course content and deeper details here for obvious reasons. Everything mentioned here is my learning and key takeaways from each class/content.
The first module is very introductory. Concepts like the internet creation and explanations about how the information flow evolved from the first internet connection to the cloud are approached very briefly.
More than being introductory, it is very straightforward and hands-on (which I consider great). There are forum discussions for the participants to get to know each other, and an open Q&A about the exercises and assignments.
Exercises and Assignments
Exercise 2: examining a BIG JSON at the Chrome console to show how things can get complex eventually.
Exercise 3: running a Node simple app to analyze the BIG JSON file from exercise 2.
Assignment 2: Creating a simple static personal website at GitHub.io. For this one, I went a bit further and added a small set of free static CSS and HTML pre-built to reach something better than just the “hello world”: https://guisesterheim.github.io/
Amid Covid-19 news, there one single thing that is driving business leaders nowadays: how can I make digital business? We are living-history of an evolution in our pattern of consumption. The retail industry (using Via Varejo’s example here) already changed. Via Varejo managed to keep 70% of its regular revenue even with its more than 1 thousand physical stores closed. And they kept that number through their single store that never closes: the e-commerce.
COVID-19-highlighted pros and cons
How did Via Varejo (and Zoom to mention a non-retail example) hold all of the needed infrastructures to maintain their results during this crisis? They certainly invested differently in technology and their product’s software architecture than a few bad examples. A friend of mine reported a virtual-line waiting to shop at a supermarket in Lisbon of 12h. Even after making his purchase, he had to wait for about a week to receive his groceries at home. Also, Netflix and Youtube reduced their bandwidth consumption in Europe to not overload the band available in that region. And a worst-case scenario happened to Clear Corretora’s system. They went down a couple of times and now their clients are asking for refunds based on operations they could not do because the software was unavailable.
Software is present in every organization that wants to grow at some scale. The software helps the company’s employees to be more productive and to reduce manual and repetitive works. The software can have the best possible interface, turned 100% to people productiveness, but as a basic item, it must work when it must work. When vital software to an organization fails, the difference between regular software and good software is discovered. And here I talk about its architecture.
When the architecture makes the difference
To talk about this subject, I’ll cite two real cases of two clients in the media area. They are close to me and had their business growth held for some while due to bad quality software.
Please, no more load
The mentioned software was a portal to Brazillian population access to a known TV show. Under normal circumstances, the portal used to handle everything. But when one of its TV presenters mentioned something about the portal when they were live in the national network, it was deadly. We just had to count time and a few minutes after, the portal was down. It could be simple stuff like saying live “access our website for a chance of winning a prize”. Or “access our website to talk to the actress X”. And the problem that repeatedly happened, taking the portal down, was a technical problem related to its architecture. It was not prepared to receive so much access like it did in fact. In this example, we see a powerful trigger to an entire population wasted because of a software malfunction. The population (customers) already convinced to look for something lost interest automatically. The TV show image was affected negatively, and it failed at enhancing its brand, increasing its overall engaged public and eventually losing some revenue.
The second example is about a scenario when the software was a big aggregator of information coming from many different sources. It was responsible for some important transactions related to the company invoices. It operated fine for many months after its first release. But when the database went over some load, it started to behave slower than it used to. Also, a big project or renewing the entire UI was undergoing. But it’s big failure if something beautiful is released without actually working. The loading screens were too slow and were forcing the user to wait for many seconds for some feedback. The impact on business was quite relevant because the selling process for a few of their products was also affected.
Solutions and business relief
For both scenarios, a new architecture is live now. The first one focused heavily on caching. The second focused on shortening the times to get information. But both passed by a deep architecture remodeling.
Nowadays the first one counts with millions of users reaching their portal to interact with the brand. They also count with a stable portal, which allows the decision-makers to make wiser decisions than when they were under pressure. The second was able to release the new UI, which enhanced the relationship with B2B demanding customers, and no more transactions are lost. Now the roadmap is welcome again.
Costs. An application not prepared to scale accordingly to its demand will cost more to be kept than others. The cloud advantage must be used at most in this scenario. An example: the websites to buy tickets for concerts spend a huge portion of their time working under low or regular demand. But when a well-known group announces a new concert, thousands of people rush to their environment to get a ticket. (find a similar reading here). A useful analogy: if you don’t prepare your application to scale according to demand, it’s like you are always driving an RV even if you are just going to the supermarket instead of going to vacations. You don’t need to carry 5 or 6 people with you to go to the supermarket. As well as your app doesn’t need to be fully-armed at 3 am.
As we can see in the image below, people use the internet with different intensity according to time each day.
Peeks of usage. Maintaining a portal with around 100 visits per day (like my own) is fine. But a different approach will be needed to another one with One Million views on the same timeframe. But more important than that, be prepared for peeks of usage to maintain the brand’s reliability and company growth. Zoom is an excellent successful example of application scaling. But they are the minority amid hundreds of bad examples that are impacting our lives. I.E: check New Jersey’s government ask for help with a very very old application).
How to prepare a fast and reliable architecture
Architecting for scalability
Use the advantages of the cloud’s existing tools. All cloud players have efficient tools for load balancing the application’s requests. Microsoft’s Load Balancer, Google Load Balancing, and AWS Elastic Load Balancing are very easy to set up. Once defined rules for load balancing, auto-scaling groups improve the application power to handle requests. Using the auto-scaling groups you can set different behaviors for your app. Both based on demand from users and also patterns you already know they exist (3 am driving an RV). If all of this is new for you, keep in mind that new solutions bring new challenges. Listed below are few things you have to take a look when setting up auto-scaling behavior:
Speed to startup a new server – When you need to scale you probably will need to scale FAST. To help with that, have pre-built images (AWS AMIs) to speed up your new servers boot time. Kubernetes orchestrating your app will also help with this.
Keeping database consistency – Luckily the big cloud players have solutions to keep the databases synchronized between your different availability zones almost seamless. But once you start working with multiple regions, this will become one more thing to establish a plan and handle.
Keep low latency between different regions – Multiple regions can solve latency for your users, but will bring the latency to you. Once again talking about multiple regions (either if you are building a disaster/recovery plan or just moving infrastructure closer to your users to reduce latency). The latency between regions has to be mitigated both on databases and on your app communications.
The attention points above pay-off. Once you have all set, the cloud can keep itself. Looking for alerts on CPU, memory, network, and other usages, and triggering self-healing actions will be part of its day.
Architecting for reliability
To increase your app reliability, I list two good strategies to apply:
On the infrastructure and the app level. Adding several layers of tests and health checks is the most basic action for reliability.
Architecting for multi-region. Using pilot-light (slower), passing by warm standby and active/active multi-region (faster) architecture solutions for failover and disaster/recovery plans are good approaches. The faster one (active/active) requires the same infrastructure to be deployed exactly the same in two regions. Also, an intelligent DNS routing rule has to be set.
Reducing risk with infrastructure deploy automation. Examples of services like CloudFormation (AWS), Resource Templates (MS), and Cloud Deployment (Google). It helps you to create a single point of infrastructure description to be used across multiple regions.
Architecture is a living subject, just like digital products are. Looking for scalability and reliability on the same environment will make you achieve a fast and reliable architecture.