IT Infrastructure Audit and Optimization: Best Practices and Key Benefits
What is an IT infrastructure?
IT infrastructure is the system of hardware, software, facilities and service components that support the delivery of business systems and IT-enabled processes.
Source: Gartner Glossary
IT infrastructure is optimal when it handles all the loads required, and is cost-efficient, highly available, and scalable. According to Amazon Web Services, the 6 pillars of well-architected framework include
- Operational Excellence
- Performance Efficiency
- Cost Optimization
What type of IT infrastructure is optimal?
There are two main types of IT infrastructure:
- On-premises infrastructure, configured on local servers which can be physically accessed by its owner. Scalability is limited to hardware capabilities. Infrastructure maintenance requires having an in-house team of software engineers who are well familiar with it. Otherwise, it gets harder, more time-consuming and costly to find support services qualified to work with it.
- Cloud infrastructure, configured in a cloud environment, with no possibility for the owner to access the servers. However, there’s unlimited scaling potential, and no support issues due to automation of IT infrastructure monitoring, and large pool of professionals.
That’s why there are more and more people who decide to migrate their on-premises app infrastructure to cloud.
Worldwide end-user spending on public cloud services is forecast to grow 20.4% in 2022 to total $494.7 billion, up from $410.9 billion in 2021, according to the latest forecast from Gartner, Inc. In 2023, end-user spending is expected to reach nearly $600 billion.
However, on-premises infrastructure turns out to be the optimal solution when cloud environments are not compliant with the governance or specific restrictions of the application. For example, we can face such constraints when
- the governance requires to store all documents only in on-premises storage
- only specific people can have access to the underlying hardware
- there are specific hardware requirements which ensure that all data on this particular hardware are encrypted.
Recently AWS have introduced Hybrid Cloud services as a way to make the most of the on-premises infrastructure and Cloud benefits. So there is no 100% optimal solution, everything needs to be based on the current situation.
Why is IT infrastructure optimization important?
Lack of optimization for the IT infrastructure can be compared to an old mansion without regular maintenance: it quickly gets shabby, yet still requires paying high taxes. Without the necessary reconstruction it’s soon to become a neglected monument to what has been glorious once upon a time.
Just like that, the unoptimized IT infrastructure can lead to the worst-case scenario when you end up with
- a low-performance infrastructure which can’t scale and recover from disasters
- a high number of provisioned servers and receive a high cloud bill by the end of a month.
What is an IT infrastructure audit?
Regular infrastructure audits are a proven IT infrastructure optimization strategy. They allow identifying the things that need to be optimized, help to prioritize what should be improved first, and thus prevent app downtimes.
To perform an infrastructure audit one needs to have a basic understanding of what an optimal server load is, as well as analytics skills to predict future growth in resources usage and costs. Infrastructure audit can be done by
- the operations team
- network operation center (NOC) engineers
- DevOps engineers
- basically, anyone involved in infrastructure configuration or monitoring.
Duration of the audit procedure usually takes 1–3 hours, and highly depends on the infrastructure complexity. Doing infrastructure audits on a weekly basis allows paying attention to details, e.g.:
- After a new service launch, monitor its behavior, check the amount of resources it requires, and if there are any errors, etc.
- Check CPU, RAM and Disk performance on each server
- Check the peak values of the server loads
- Analyze the incoming weekly amount of traffic, trying to spot any anomalies, etc.
- Check application logs for errors, and if they store any unneeded data; pay attention to the quantity of logs
- Analyze daily infrastructure cost throughout the week, notice any increases and their reasons, configure daily and monthly budget alarms
- Pay attention to security
- Check the number of users who are added to AWS and their rights for access (we use the least privilege principle, which means the users have only necessary access permissions)
- Check if the users have enabled multi-factor authentication (MFA)
- Check the number of open ports on the servers. It’s also necessary to make sure that the port SSH is accessible only to a few IP addresses, etc.
DevOps engineers summarize the obtained information in a document with some advice and recommendations about what can be improved next and how to do it.
An additional monthly audit provides better understanding and a generalized picture of how the infrastructure changes over the time, and allows paying attention to its condition in general, e.g.
- Analyze average server load to understand if you need to increase the number of servers, or if the cluster has enough resources for now
- Make a generalized IT infrastructure cost estimate to come up with the suggestions about how to reduce expenses. Usually this can be achieved by reserving Amazon servers for 1–3 years ahead (the 3-year reserve price is the lowest)
- Define the average and peak monthly traffic values, etc.
Prioritization: what needs to be optimized first?
Getting the audit summary doesn’t mean you have to instantly fix every identified fault and implement all the suggestions at once. It’s necessary to evaluate them, and define which ones are of primary importance. You can introduce the rest of the changes gradually. Just like it’s urgent to fix a broken window, but not so necessary to buy new curtains immediately.
What you really need to find and fix asap is the reason that causes lags or errors, e.g.
- lack of resources for the app to perform properly
- errors which can cause an application outage, etc.
Otherwise, the support team can suggest which updates to make in the following week, and discuss them with the client. A lot depends on what the primary goal is:
- cost optimization
- scalability and performance optimization
- or both options.
What are the ways of IT infrastructure cost optimization?
Applications which use the infrastructure on a part-time basis can benefit from on-demand cloud computing. Such an approach allows paying only for the time the infrastructure is running. However, it can’t be applied to the apps running full-time.
To reduce the costs spent on dev environments for one of our customers, the Pro.Con B2B sales management app, we used an approach which included the steps below.
- Made dev environments less performant. Not to affect the UX, the lower performance margin strongly depends on the application requirements, e.g., applications with active image processing need more CPU and RAM resources.
- Disabled autoscaling.
- Configured development environments only in one Availability Zone with no replications.
How to increase app scalability?
In case of Pro.con, our clients wanted to get a better scaling option with higher availability. The best way to achieve it for the app whose infrastructure was located on premises, was using the migration to cloud services. The process itself is risk-free, however it requires some precautions.
To optimize the app for new cloud infrastructure, so it’s fault-tolerant and scalable, you need to be sure that the application
- uses an independent storage service (e.g. S3 for static files), and does not keep any needed files in the local environment during workload
- is hosted across different Availability Zones (independent data centers), so if one data center is down, the application will be automatically restored in another one. This would ensure that the app can be quickly restored from failures.
Before actual data and app migration, it’s necessary to create a Proof of Concept (PoC) solution. The development and DevOps teams need to check how the app operates during the PoC stage and make sure that it is ready to be deployed to the cloud.
What has changed for Pro.Con app after its IT infrastructure migration to AWS?
Here is a real-life comparison of the B2B app infrastructure before and after it’s been moved to the cloud by Apiko DevOps engineers. Currently it’s too early to compare the cost effectiveness, as the static data hasn’t been migrated to Amazon Simple Storage Service for 100% of the app users yet.
Thus, migration to cloud has completely resolved the issues with IT infrastructure support and scalability. Although it caused an insignificant drop in app performance (we are working to minimize and overcome it), the app obtained high availability, and fast disaster recovery capabilities.
IT infrastructure audits provide valuable insights regarding the infrastructure condition. They help to spot any issues at early stages, and may signal which actions to take to avoid any unexpected disasters. Migration to cloud allows to automate IT infrastructure monitoring which not only prevents the black swan events, but sets the ground for reaching better KPIs.