Our old monolithic applications used to be very simple and keeping everything at our sight was very easy. We had one or two databases, one or more app servers, and that’s it! Everything ready to turn into chaos. Modern architecture patterns have one big tradeoff, which is they require a plethora of components and that increases the difficulty to keep eyes over such a big environment.
For that reason, plugging in applications that will help us with the observability, and even developing our own tools is needed so we can understand what is going on with our apps. Otherwise we can easily fall into a rabbit hole looking for all the edges to see which one has the root problem causing our clients to slow down.
The Market Options
Talking about tools, I did compare Dynatrace, New Relic, Elastic and Splunk. For Now, New Relic is the chosen one simply due to budget. Elastic seems to lack some features and is speeding up. Splunk, Dynatrace and New Relic’s AI powered features are amazing.
Adding New Relic to your microservice
In this article I’m gonna cover the addition of New Relic observability to Java microservives.
Create your free account and grab your account Id – https://newrelic.com/signup. The free account will allow you to upload up to 100gb of data per month.
Then you’ll need the yaml with configurations for your account and app:
Once you are logged in click on “APM” on the top bar, then click “Add more on the top right area”. This is gonna open a tab with several options. Just click Java and you will see somthing like this:
Also use the command with black background to download the new relic jav you’re gonna need in a step beyond.
Once you have downloaded the zip file, unzip it and grab the newrelic.jar
Finally, place the generated yaml file along with the jar in a folder that is accessible by your project
At last, change your dockerfile to include the -javaagent command
PS: in this example I’m adding a specific version of the newrelic jar. For production purposes, I recommend storing the jar in your own library server, such as Nexus or Jfrog.
Watching your app
Minutes after deploying your pod, you’ll be able to login to New Relic and start seeing the dashboards built for your application.
Important metrics that can be seen right without any additional configuration are:
Throughput (the lower the better).
App server response time (the lower the better).
Most called URLs.
Time each process takes to process your requests divided in layers: app server, database, response time, etc.
TL;DR: example of Jenkins as code. There’s a step by step to configure your Jenkins as code using Ansible tool, and configuration-as-code plug-in at the end of the article. The final version of the OS will have Docker, Kubectl, Terraform, Liquibase, HAProxy (for TLS), Google SSO instructions, and Java installed for running the pipelines.
Why having our Jenkins coded?
One key benefit from having infrastructure and os level coded is the safety it gives to the software administrators. Think with me: what happens if your Jenkins stops working suddenly? What if something happens and nobody can log into it anymore? If these questions make you chill, let’s code our Jenkins!
What we will cover
This article covers the tools presented in the image above:
Vagrant for local tests.
Packer tool for creating your SO image with your Jenkins ready to use
Ansible for installing everything you need on your SO image (Jenkins, Kubectl, Terraform, etc).
JCasC (Jenkins Configuration as Code) to configure your Jenkins after it is installed.
You can also find some useful content for the Terraform part here and here.
Special thanks to many Ansible roles I was able to found on GitHub and geerlingguy for many of the playbooks we’re using here.
1. How to run it
Running locally with Vagrant to test your configuration
The Vagrantfile is used for local tests only, and it is a pre-step before creating the image on your cloud with Packer
Vagrant commands:
Have (1) Vagrant installed (sudo apt install vagrant) and (2) Oracle’s VirtualBox
How to run: navigate to the root of this repo and run sudo vagrant up. After everything is complete, it will create a Jenkins accessible from your host machine at localhost:5555 and localhost:6666. This will create a virtual machine and will install everything listed on the Vagrantfile
How to SSH into the created machine: run sudo vagrant ssh
How to destroy the VM: run sudo vagrant destroy
Using packer to build your AMI or Az VM Image
Packer is a tool to create an OS image (VM on Azure OR AMI on AWS)
Once you have your AMI or Az VM Image created, go for your cloud console and create a new machine pointing to the newly created image.
Checkout the file packer_config.json to see how packer will create your SO image and Azure instructions for it
PS: This specific packer_config.json file is configured to create an image on Azure. You can change it to run on AWS if you have to.
2. Let’s configure our Jenkins as Code!
I’m listing here a few key configurations among the several you will find in each of these Ansible playbooks:
Java version: on ansible_config/site.yml
Liquibase version: on ansible_config/roles/ansible-role-liquibase/defaults/main.yml
Docker edition and version
Terraform version
Kubectl packages (adding kubedm or minikube as an example) on ansible_config/roles/ansible-role-kubectl/tasks/main.yml
Jenkins configs (I will comment further)
HAProxy for handling TLS (https) (will comment further)
3. Configuring your Jenkins
Jenkins pipelines and credentials files
This Jenkins is configured automatically using the Jenkins plugin configuration as code. All the configuration is listed on file jenkins.yaml in this root. On that file, you can add your pipelines and credentials for those pipelines to consume. Full documentation and possibilities can be found here: https://www.jenkins.io/projects/jcasc/
Below is the example you will find on the main repo:
You can define your credentials on block one. There are a few possible credential types here. Check them all on the plugin’s docs
With this, we create a folder
Item 3 creates one pipeline job as example fetching it from a private GitLab repo that uses the credentials defined in item 1
Your hostname: change it to a permanent hostname instead of localhost once you are configuring TLS
The plugins list you want to have installed on your Jenkins
You can change Jenkins default admin password on file ansible_config/roles/ansible-role-jenkins/defaults/main.yml attribute “jenkins_admin_password”. Check the image below:
You can change admin user and password
Another configuration you will change when activating TLS (https)
Jenkins’ configuration-as-code plug-in:
For JCasC to work properly, the file jenkins.yml in the project root must be added to Jenkins’ home (default /var/lib/jenkins/). This example has the keys to be used on pipelines and the pipelines as well. There are a few more options on JCasC docs.
Activating TLS (https) and Google SSO
As shown on step “Jenkins Configuration”‘s images: Go for ansible_config/roles/ansible-role-jenkins/defaults/main.yml. Uncomment line 15 and change it to your final URL. Comment line 16
Go for ansible_config/roles/ansible-role-haproxy/templates/haproxy.cfg. Change line 33 to use your final organization’s URL
Rebuild your image with Packer (IMPORTANT! Your new image won’t work locally because you changed Jenkins configuration)
Go for your cloud and deploy a new instance using your just created image
3.1 – TLS: Once you have your machine up and running, connect through SSH to perform the last manual steps: TLS and SSO Google authentication:
Generate the .pem certificate file with the command cat STAR.mycompany.com.crt STAR.mycompany.com.key > fullkey.pem. Remember to remove the empty row that is kept inside the generated fullkey.pem between the two certificates. To look at the file use cat fullkey.pem
Move the generated file to your running instance’s folder /home/ubuntu/jenkins/
Restart HAProxy with sudo service haproxy restart
Done! Your Jenkins is ready to run under https with valid certificates. Just point your DNS to the running machine and you’re done.
3.2 – Google SSO:
Log in to Jenkins using regular admin credentials. Go to “Manage Jenkins” > “Global Security”. Under “Authentication” select “Login with Google” and fill in like below:
Client id = client_id generated on your G Suite account.
TL;DR: 7 resources will be added to your Azure account. 1 – Configure Terraform to save state lock files on Azure Blob Storage. 2 – Use Terraform to create and keep track of your Service Bus Queues
Azure Service Bus has two ways of interacting with it: Queues and Topics (SQS and SNS on AWS respectively). Take a look at the docs on the difference between them and check which one fits your needs. This article covers Queues only.
What are we creating?
The GRAY area on the image above shows what this Terraform repo will create. The retry queue automation on item 4 is also created by this Terraform. Below is how the information should flow in this infrastructure:
Microservice 1 generates messages and posts them to the messagesQueue.
Microservice 2 listens to messages from the Queue and process them. If it fails to process, post back to the same queue (for up to 5 times).
If it fails for more than 5 times, post the message to the Error Messages Queue.
The Error Messages Queue automatically posts back the errored messages to the regular queue after one hour (this parameter can be changed on file modules/queue/variables.tf)
Whether there’s an error or success, Microservice 2 should always post log information to Logging Microservice
Starting Terraform locally
To keep track of your Infrastructure with Terraform, you will have to let Terraform store your tfstate file in a safe place. The command below will start Terraform and store your tfstate in Azure Blob Storage. Use the following command to start your Terraform repo:
terraform init \
-backend-config "container_name=<your folder inside Azure Blob Storage>" \
-backend-config "storage_account_name=<your Azure Storage Name>" \
-backend-config "key=<file name to be stored>" \
-backend-config "subscription_id=<subscription ID of your account>" \
-backend-config "client_id=<your username>" \
-backend-config "client_secret=<your password>" \
-backend-config "tenant_id=<tenant id>" \
-backend-config "resource_group_name=<resource group name to find your Blob Storage>"
If you don’t have the information for the variables above, take a look at this post to create your user for your Terraform+Azure interaction.
Should everything goes well you should get a screen similar to the one below and we are ready to plan our infrastructure deployment!
Planning your Service Bus deploy
The next step is to plan your deployment. Use the following command so Terraform can prepare to deploy your resources:
TL;DR: 3 resources will be added to your Azure account. 1 – Configure Terraform to save state lock files on Azure Blob Storage. 2 – Use Terraform to create and keep track of your AKS. 3 – How to configure kubectl locally to set up your Kubernetes.
This article follows best practices and benefits of infrastructure automation described here. Infrastructure as code, immutable infrastructure, more speed, reliability, auditing and documentation are the concepts you will be helped to achieve after following this article.
Terraform has a good how to for you to authenticate. In this link you’ll find how to retrieve the following needed authentication data:
subscription_id, tenant_id, client_id, and client_secret.
To find the remaining container_name, storage_account_name, key and resource_group_name, create your own Blob Storage container in Azure. And use the names as the suggestion below:
The top red mark is your storage_account_name
In the middle you have your container_name
The last one you have your key (file name)
Starting Terraform locally
To keep track of your Infrastructure with Terraform, you will have to let Terraform store your tfstate file in a safe place. The command below will start Terraform and store your tfstate in Azure Blob Storage. So navigate to folder tf_infrastructure and use the following command to start your Terraform repo:
terraform init \
-backend-config "container_name=<your folder inside Azure Blob Storage>" \
-backend-config "storage_account_name=<your Azure Storage Name>" \
-backend-config "key=<file name to be stored>" \
-backend-config "subscription_id=<subscription ID of your account>" \
-backend-config "client_id=<your username>" \
-backend-config "client_secret=<your password>" \
-backend-config "tenant_id=<tenant id>" \
-backend-config "resource_group_name=<resource group name to find your Blob Storage>"
Should everything goes well you should a screen similar to the one below and we are ready to plan our infrastructure deployment!
Planning your deploy – Terraform plan
The next step is to plan your deploy. Use the following command so Terraform can prepare to deploy your resources:
terraform plan \
-var 'client_id=<client_id>' \
-var 'client_secret=<secret_id>' \
-var 'subscription_id=<subscription_id>' \
-var 'tenant_id=<tenant_id>' \
-var 'timestamp=<timestamp>' \
-var 'acr_reader_user_client_id=<User client ID to read ACR>' \
-var 'acr_reader_user_secret_key=<User secret to read ACR>' \
-var-file="<your additional vars file name. Suggestion: rootVars-dev.tfvars>" \
-out tfout.log
Some of the information above are the some as we used in Terraform init. So go ahead and copy them. The rest of them are:
TIMESTAMP – this is the timestamp of when you are running this terraform plan. It is intended to help with the blue/green deployment strategy. The timestamp is a simple string that will be added to the end of your resource group name. The resource group name will have the following format: “fixedRadical-environment-timestamp”. You can check how it’s built on file tf_infrastructure/modules/common/variables.tf
ACR_READER_USER_CLIENT_ID – This is the client_id used by your Kubernetes to go and read the ACR (Azure Container Registry) to retrieve your docker images for deployment. You should use a new one with fewer privileges than the main client_id we’re using.
ACR_READER_USER_SECRET_KEY – This is the client secret (password) of the above client_id.
-VAR-FILE – Terraform allows us to add variables in a file instead of on the command line like we’ve been using. Do not store sensitive information inside this file. You have an example on tf_infrastructure/rootVars-dev.tfvars file
TFOUT.LOG – This is the name of the file to which Terraform will store the plan to achieve your Terraform configuration
Should everything goes well you’ll have a screen close to the one below and we’ll be ready to finally create your AKS!
Take a look at the “node_labels” tag on AKS and also on the additional node pool. We will use this in the Kubernetes config file below to tell Kubernetes in which node pool to deploy our Pods.
Deploying the infrastructure – Terraform apply
All the hard work is done. Just run the command below and wait for about 10 minutes and your AKS will be running
terraform apply tfout.log
Once the deployment is done you should see a screen like this:
Configuring kubectl to work connected to AKS
Azure CLI does the heavy lifting on this part. So run the command below to make your Kubectl command-line tool to easily point to the newly deployed AKS:
az aks get-credentials --name $(terraform output aks_name) --resource-group $(terraform output resource_group_name)
If you don’t have the Azure CLI configured yet, follow the instructions here.
Applying our configuration to Kubernetes
Now navigate back on your terminal to the folder kubernetes_deployment. Let’s apply the commands and then run through the files to understand what’s going on:
PROFILE=dev – it is setting an environment variable on your terminal to be read by kubectl and applied to the docker containers. I used a spring application, so you can see it being used on k8s_deployment-dev.yaml here:
Kubernetes will grab our PROFILE=dev environment variable and pass on to Spring Boot.
The path where Kubernetes will pull our images from using ACR credentials.
Liveness probe teaches Kubernetes how to understand if that container is running or not.
NodeSelector tells Kubernetes in which node pool (using the node_labels we highlighted above) where the Pods should be run.
Configure K8S
kubectl apply -f k8s_deployment-dev.yaml
Kubernetes allows us to store all our configuration in a single file. This is the file. You will see two deployments (pods instructions): company and customer. Also, you will see one service that exposes each of them: company-service and customer-service.
The services (example below) use the ClusterIP strategy. It will tell Kubernetes to create an internal Load Balancer to balance requests to your pods. The port tells which port receives requests and the targetPort tells which port in the service will handle requests. More info here.
Services example
Ingress strategy is the most important part:
nginx is the class for your ingress strategy. It uses nginx implementation to load balance requests internally.
/$1$2$3 is what Kubernetes should forward as the request URL to our pods. $1 means (api/company) highlighted in item 5. $2 means (/|$) and $3 means (.*)
/$1/swagger-ui.html this is the default app root for our Pods
Redirect from www – true – self-explanatory
Path is the URL structure to pass on as variables to item 2
To add TLS yo our Kubernetes you have to generate your certificate and past key and crt on the highlighted areas below on base64 format. An example on Linux is like first image below. When adding the info to the file remember to past it as a single row without spaces, line breaks or others. Second image shows where to put the crt and key respectivelly.
The Cloud Native foundation as a good source to check new moves from the cloud industry
The content
Cloud-Native Computing Foundation: For an application to be considered truly Cloud Native they need to be:
Built for fault tolerance
Horzontally scalable
Written in a manner that takes full advantage of what cloud providers have to offer.
Cloud Native Applications prioritize the following:
Speed
Short cycles
Microservices
Loosely coupled
DevOps
Pet vs cattle way of handling our servers:
As a developer, you care about the application being hand-cared for − when it is sick, you care of it, and if it dies, then it is not easy to replace. It’s like when you name a pet and take care of it; if one day it is missing, everyone will notice. In the case of cattle, however, you expect that there will always be sick and dead cows as part of daily business; in response, you build redundancies and fault tolerance into the system so that ‘sick cows’ do not affect your business. Basically, each server is identical and if you need more, you create more so that if any particular one becomes unavailable, no one will notice.
Cloud native action spectrum:
Cloud native roadmap of adoption (the majority of companies are on step 4):
There’s a landscape map listing tons of vendors on the cloud native foundation for each specific need: http://landscape.cncf.io
Exercises and Assignments
Assignment: Create a presentation showing the push you are planning for your company. Think about steps, risks, mitigations, and how you plan to lead the journey. Think about the presentation as if you were presenting it to your CEO or a client.
Several cases of Agile adoption in a set of big and mid-size companies. Also presented key benefits, challenges and outputs of an agile adoption
The content
Today, over 50% of the Fortune 500 companies from the year 2000 no longer exist. GE is stumbling. BlackBerry (RIM) is gone, and so is most of Nokia, raised to a $150 billion corporation. (…) John Boyd developed a methodology for operating in such situations, called the OODA Loop. The speed of executing the loop is the essential element of survival. It involves testing one’s premises by actual Observation, Orienting your corporation with respect to the situation. Then Deciding on a course of action, and then executing that plan by Acting. This is the meaning of being Agile. (…) Data is the new gold.
MIT – Cloud & DevOps course – 2020
Agile Adoption
Pros of agile software development:
Customers have frequent and early opportunities to see the work being delivered and to make decisions and changes throughout the development of the project.
The customer gains a strong sense of ownership by working extensively and directly with the project team throughout the project.
If time to market is a greater concern than releasing a full feature set at initial launch, Agile is best. It will quickly produce a basic version of working software that can be built upon in successive iterations.
Development is often more user-focused, likely a result of more and frequent direction from the customer
Cons of Agile Software Development:
Agile will have a high degree of customer involvement in the project. It may be a problem for some customers who simply may not have the time or interest for this type of participation.
Agile works best when the development team are completely dedicated to the project.
The close working relationships in an Agile project are easiest to manage when the team members are located in the same physical space, which is not always possible.
The iterative nature of Agile development may lead to frequent refactoring if the full system scope is not considered in the initial architecture and design. Without this refactoring, the system can suffer from a reduction in overall quality. This becomes more pronounced in larger-scale implementations, or with systems that include a high level of integration.
Managing Complexity of Organizations and operations
As companies grow, their complexity grows. And they have to manage that complexity, otherwise, it’s gonna turn into chaos. The problem is that they usually manage that putting processes in place: you have to sign X docs, follow Y procedures, etc. The problem is that we tail employee freedom, and the side effect is that the high performing employees tend to leave our company.
Netflix’s solution to this scenario was different. They decided to let the smart workers manage the complexity instead of putting processes in place.
The problem for the traditional approach is that when the market shifts we’re unable to move fast. We have had so many processes and fixed culture that our teams won’t adapt and innovative people won’t stick to these environments.
That leads us to three bad options of managing our growing organizations:
Stay creative and small company (less impact)
Avoid rules (and suffer the chaos)
Use process (and cripple flexibility and ability to thrive when the market changes)
Back to Netflix case: they believed that high performing people can contain the chaos. With the right people, instead of a culture of process adherence, you have a culture of creativity and self-discipline, freedom, and responsibility.
Comparing waterfall and agile software development model
Assignment: Write a summary about two articles suggested by the MIT that highlight the complexity of turning Agile that some companies faced and how they are thriving.
Recently I had the mission to decide which cloud provider my company was to adopt. After taking a deep look at the capabilities, benefits, and other key important points of the three main clouds in the market (Azure, Amazon, and Google), I was very excited about the result and the effort put on and decided to share so it could help more people. This is a decision matrix developed to take the decision in my company. I strongly encourage you to inspire in this one and make the needed changes for yours. I hope it helps you to take your own decision as well. There are three main areas in this article: 1) Define the goals of the journey, 2) The spreadsheet, 3) the Adopted Criteria, and 4) the Final Decision.
1 – Define the goals for the cloud journey
When we start such an important project like this, it must be very clear to all high management the goals involved. These are the goals I defined together with my peers and validated with all the company senior management.
Overall company speed – essential for keeping competitive time to market.
Teams autonomy – one more important move to keep time-to-market as fast as possible and to foster DevOps adoption.
Cost savings – use the cloud benefit of the pay-as-you-go.
Enable software scalability – some of the products still suffer from on-prem challenges to scale.
Security – improve security while handing over a few of the key concerns to the cloud provider.
In this spreadsheet, you can find the summarized version of all the criteria presented in the next section. The values in it are the actual result of my analysis. You should take a look by yourself and change the values according to your scenario.
3 – Define the criteria list
The following items are those important for this scenario’s migration. They are a total of Seventeen criteria analyzed to achieve a better overall understanding.
Five is the highest possible score. One is the lowest. Any other number between those are valid scores.
Criteria
Weight
Cost
5
Feature count
1
Oracle migration ease
2
Available SDKs
1
DDoS protection
1
Overall security
5
Machine Learning and Data Science features
1
Community support
3
Professionals availability
3
Professionals cost
5
Companies that already are in each cloud (benchmark)
1
Internal team knowledge
5
Auditing capabilities
5
Cloud transition supporting products
5
Dedicated links with specific protocol availability*
5
GDPR and LGPD compliance*
3
Cloud support*
3
* These items were not included in my initial analysis and were suggested by a couple of friends using the model. I’m bringing them here as a few more suggestions for you.
2.1. Cost
The values were converted from US dollar to Brazilian Real in an exchange rate of BRL 5.25 to USD 1.00. RI = Reserved Instance. OD = On-demand instance
Why this criterion is important: Since the cloud move is an already taken decision, the goal of this criterion is to evaluate which cloud is the cheapest for this specific scenario need.
Cloud
Score given
Score comments
AWS
5
AWS has higher values in smaller machines and lower values in bigger machines
Azure
5
Azure has higher values in bigger machines and lower values for smaller machines
GCP
3
There are some lacking machine types.
2.2. Feature count
Why this criterion is important: innovation appetite of each cloud provider.
AWS gets the higher score according to specialists due to the granularity it allows
Overall Security
Azure
1
Overall Security
GCP
1
Ease to configure security
AWS
0.5
Ease to configure security
Azure
0.75
Ease to configure security
GCP
1.25
Google gets a higher score due to ease to configure and abstraction capacity
Security Investment
AWS
1.25
AWS is the one that invests the most on security
Security Investment
Azure
1
Security Investment
GCP
1
Security community support
AWS
1.25
AWS has a bigger community
Security community support
Azure
1
Security community support
GCP
0.75
2.7. Machine Learning and Data Science features
Why this criterion is important: looking for the future, it’s important to think about new services to be consumed. This feature received a low maximum score because it is not something critical for this stage of the cloud adoption.
Why this criterion is important: the ability to hire qualified professionals for the specific cloud vendor is crucial for the application lifecycle. This research was performed on LinkedIn with the query “certified cloud architect <vendor>”.
Cloud
Source
Given score
Score comments
AWS
LinkedIn
3
183k people found
Azure
LinkedIn
2
90k people found
GCP
LinkedIn
1
22k people found
2.10. Professionals cost
Why this criterion is important: as important as professionals availability, the cost involved in hiring each of these professionals is also something important to keep in mind.
There was no difference found between each professional
GCP
5
There was no difference found between each professional
2.11. Companies already present in each cloud
Why this criterion is important: taking a look at companies help to understand where the biggest and most innovative companies are heading to. And if they are doing so, there must be a good reason for that.
Why this criterion is important: since this is intended to be a company wide adoption, some areas will have more or less maturity to migrate to a new paradigm of cloud native software development. The more the cloud provider can assist with simpler migration strategies such as an “AS IS”, the better for this criterion.
Due to the company having a big number of Windows-based services, Microsoft native tools have an advantage
GCP
3
No resources to keep both cloud and on-premises workloads working together were found
4 – Conclusion
Below is presented the final result for this comparison. Having reached this, I intend to help your cloud journey adoption decisions, but please do not stick to these criteria presented here. Always take a look at what is important to your company and business evolution.
This adoption must also come hand-by-hand with an internal plan to improve people’s knowledge of the selected cloud. The cloud brings several benefits compared to on-premises services, and like everything in life there are trade-offs and new challenges will appear.
Introduced the serverless paradigm, pros and cons, limits and the evolution to reach it
The content
Serverless computing is a Cloud-based solution for a new computing execution model in which the manager of the server architecture and the application developers are distinctly divided. A frictionless connection exists in that the application does not need to know what is being run or provisioned on it; in the same way that the architecture does not need to know what is being run.
The journey that led us to serverless (image below).
A true microservice:
Does not share data structure and database schema
Does not share internal representation of objects
You must be able to update it without notifying the team
Serverless implications:
Your functions become Stateless: you have to assume your function will always run a new recently deployed container.
Cold starts: since every time your function will run in a new container, you have to expect some latency for the container to be spun up. After the first execution, the container is kept for a while, and then the call will become a “warm start”.
Serverless pros:
Cloud provider takes care of most back-end services
Autoscaling of services
Pay as you go and for what you use
Many aspects of security provided by cloud provider
Patching and library updates
Software services, such as user identity, chatbots, storage, messaging, etc
Shorter lead times
Serverless cons:
Managing state is difficult (leads to difficult debug)
Complex message routing and event propagation (harder to track bugs)
Introduced the regular high-ground phases of a Digital Transformation (and cases for exploring them), which are:
1 – Initial Cloud Project
2 – Foundation
3 – Massive Migration
4 – Reinvention
The content
Cloud computing services are divided in three possible categories:
IaaS – using the computational power of cloud computing data centers to run your previous on-prem workloads.
PaaS – using pre-built components to speed up your software development. Examples: Lambda, EKS, AKS, S3, etc.
SaaS – third-party applications allowing you to solve business problems. Examples: Salesforce, Gmail, etc.
XaaS – Anything as a service.
An abstraction of Overall Phases of adoption:
1 – Initial Cloud Project – Decide and execute the first project
2 – Foundation – Building blocks: find the next steps to solve the pains of the organization. Provide an environment that makes going to the cloud more attractive to the business units. Examples: increase security, increase observability, reduce costs.
1st good practice: During this phase, you can create a “Cloud Center of Excellence” committee to start creating tools to make the cloud shift more appealing to the rest of the organization.
2nd good practice: Build reference architectures to guide people with less knowledge.
3rd good practice: Teach best practices to other engaging business units.
3 – Migration – Move massively to the cloud
One possible strategy is to move As Is and then modernize the application in the future (the step below).
4 – Reinvention – modernize the apps (here you start converting private software to open source, Machine Learning, Data Science, etc).
See the picture below for an illustration of these 4 steps:
Phases of Digital Transformation, and time and value comparison
The pace of adoption is always calm. Even for aggressive companies, like Netflix that took 7 years to become a cloud-first company.
Principles for their “shift left” on number and coverage of tests:
Tests should be written at the lowest level possible.
Write once, run anywhere including the production system.
The product is designed for testability.
Test code is product code, only reliable tests survive.
Testing infrastructure is a shared Service.
Test ownership follows product ownership.
See below two pictures of (1) how Microsoft evolved their testing process model and (2) the results they achieved.
1 – How Microsoft evolved its testing model
2 – Microsoft results
Layers of test (based on Microsoft example):
L0 – Broad class of rapid in-memory unit tests. An L0 test is a unit test to most people — that is, a test that depends on code in the assembly under test and nothing else.
L1 – An L1 test might require the assembly plus SQL or the file system.
L2 – Functional tests run against ‘testable’ service deployment. It is a functional test category that requires a service deployment but may have key service dependencies stubbed out in some way.
L3 – This is a restricted class of integration tests that run against production. They require full product deployment.
“The best way to avoid failure is to fail constantly”
“The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most in the event of an unexpected outage”
The DevOps revolution: importance of continuous feedback, data-driven decisions, pillars of DevOps and metrics
Main quote
Today, software development is no longer characterized by designers throwing their software ‘over-the-wall’ to testers repeating the process with software operations. These roles are now disappearing: today software engineers design, develop, test and deploy their software by leveraging powerful Continuous Integration and Continuous Delivery (CI/CD) tools
Delivery lead time (measured in hours) – e.g.: how much time is taken between the task registered on the management tool until it reaches production?
Deployment frequency – how many deploys to the Production environment we make weekly.
Time to restore service – how many minutes we take to put the service back to work when something breaks.
Change fail rate – how many of our deploys to the Production environment cause a failure.
Importance of information flow. Companies have to foster an environment of continuous feedback and empowerment. It allows everybody to solve problems and suggest innovation within their area of work.
Data-driven decision making
Pillars of well designed DevOps:
Security
Reliability
Performance Efficiency
Cost Optimization
Operational Excellence
A good example of a well-designed pipeline abstraction:
Version control – this is the step when we retrieve the most recent code of versioning control.
Build – building the optimized archive to be used to deploy.
Unit test – running automated unit tests (created by the same developer that created the feature).
Deploy – deploy to an instance or environment that allows it to receive a new load of tests.
Autotest – running other layers of the test (stress, chaos, end to end, etc)
Deploy to production – deploy to the final real environment.
Measure & Validate – save the metrics of that deploy.
There are companies that are up to 400x times faster on having an idea and deploying it to production than traditional organizations.
Several analogies between Toyota Production system and cases (below) and DevOps:
Just in Time
Intelligent Automation
Continuous Improvement
Respect for People
Theory of Constraints:
You must focus on your constraint
It addresses the bottlenecks on your pipeline
Lean Engineering:
Identify the constraint
Expĺoit the constraint
Align and manage the systems around the constraint
Elevate the performance of the constraint
Repeat the process
DevOps is also about culture. Ron Westrum’s categories for culture evolution:
Typical CI Pipeline:
Exercises and Assignments
Assignment: Creating a CircleCI automated pipeline for CI (Continuous Integration) to checkout, build, install dependencies (Node app) and run tests.