Authored by Prasanna Raghavendra - Sr. Director R&D, JFrog India
The use of Generative AI has proven beneficial for basic tasks, but its effectiveness in providing technical guidance is a subject of interest. After the public release of ChatGPT, a comparison was made between its answers and regular web search results. While the AI offered some correct and relevant advice on Kubernetes best practices for production, it became evident that human input remains crucial.
JFrog has been running its platform on Kubernetes for over six years, utilizing managed Kubernetes services from various cloud providers. Kubernetes is primarily employed to manage workloads and runtime tasks, with JFrog using managed databases and object storage services provided by cloud providers. JFrog’s production environment consists of thousands of nodes and hundreds of thousands of pods running globally.
The article highlights Six essential aspects that ChatGPT won't provide insights yet on concerning modern uses of Kubernetes for production until OpenAI updates its data and algorithms:
Node sizing requires striking a balance between smaller nodes, which help minimize the impact of potential issues, and larger nodes, which enhance application performance. The essential aspect is to employ various node types according to the specific workload demands, whether it be for CPU or memory optimization.
Protecting the Control Plane:
Properly monitoring the Kubernetes control plane is vital, especially for managed Kubernetes services. Cloud providers offer reliable control, but it's essential to be aware of their limitations. Monitoring and alerting systems should be in place to ensure optimal performance, as a slow control plane can significantly affect cluster behavior, including scheduling, upgrades, and scaling. Overusing the managed control plane can result in a catastrophic crash, underscoring the importance of diligent monitoring and management.
Maintaining Application Uptime:
To optimize application uptime, prioritize critical services. Pod priorities and quality of service help identify high-priority applications that should run continuously, enhancing stability and performance. Additionally, utilize pod anti-affinity to avoid deploying multiple replicas of the same service on a single node, preventing a single point of failure. This ensures that if one node experiences issues, the other replicas remain unaffected.
Consider implementing dedicated node pools for mission-critical applications, such as a separate pool for ingress pods and essential services like Prometheus. This practice significantly improves service stability and enhances the end-user experience.
Planning for Scaling:
Is your organization ready to handle a doubling of deployments for necessary capacity growth without any adverse effects? Managed services with cluster auto-scaling can assist in this regard, but it's crucial to be aware of cluster size limitations. In our case, a typical cluster consists of approximately 100 nodes, and if that threshold is reached, we create a new cluster instead of forcing the existing one to expand.
Consider both vertical and horizontal application scaling, aiming for the right balance to optimize resource utilization without excessive consumption. Horizontal scaling and replicating workloads are generally preferred, but be cautious as it could impact database connections and storage.
Securing the Runtime:
Using admission controllers and role-based access control for enhanced runtime security. How does learning AI bring its value on adhoc transactional choices such as this remains to be seen. Also can we improve processes to detect wrong access of the infrastructure is also to be seen. A tight manual gating system is what works today, including implementing auditing tools, runtime protections, and having a strong incident response team
It is critical that organizations continuously learn from evolving systems and processes by collecting historical performance data for evaluation and action. Prioritize small, continuous improvements, as what was relevant before may no longer hold true.
Proactively monitor performance data to identify memory or CPU leaks, and performance issues in third-party tools. Active evaluation of data for trends and anomalies enhances system understanding and performance, resulting in more effective outcomes compared to reacting to real-time alerts.
Chaos testing is one tool that has been effective in this area, can the ML layer learn from chaos test data to provide large scale learning is to be seen, again this need specific learning in specific use cases with specific usage, which makes the learning data to be more focussed so the decisions are more accurate.
While AI-powered solutions hold promise for simplifying operations, human judgment and common sense should always be considered. Current AI engines rely on publicly available knowledge, which may not always be up-to-date or accurate, necessitating cautious utilization and awareness of AI's limitations. A layered training by use-case may take the current abilities to the next level to reduce dependency on human judgment in complex use cases such as Kubernetes in Production