Rajeev’s Substack

Cloud Native Data Infrastructure: Why Kubernetes Must Go Beyond Stateless Applications

Rajeev ranjan Sinha — Sat, 16 May 2026 11:26:20 GMT

Containers replaced virtual machines.
Microservices replaced monoliths.
Infrastructure became declarative.
Kubernetes emerged as the control plane for modern applications.

Yet despite all this progress, one critical layer often remained stuck in the past:

Data infrastructure.

Many organizations proudly describe themselves as “cloud native” while still running databases, analytics engines, and streaming systems outside the very platform orchestrating the rest of their applications.

This separation creates operational complexity, fragmented tooling, inconsistent automation, and slower innovation cycles.

The next phase of modernization is not simply about running applications on Kubernetes.

It is about bringing data into the same operational model.

Welcome to the era of cloud native data infrastructure.

The Incomplete Cloud Native Journey

Early Kubernetes adoption focused heavily on stateless workloads.

This made sense.

Stateless services are easier to orchestrate because they don’t maintain long-term data. Containers can be created, scaled, or destroyed with minimal coordination. APIs, frontend services, and business logic layers naturally fit Kubernetes’ original strengths.

Stateful workloads are different.

Stateless vs Stateful service

Databases, analytics systems, and streaming platforms must maintain consistency, durability, replication, and availability across failures. Storage persistence becomes critical. Capacity planning becomes far more complex. Performance unpredictability can create cascading failures.

For years, this led to a widely accepted belief:

“You should never run databases on Kubernetes.”

That statement once had practical merit.

But Kubernetes has evolved dramatically.

Today, the platform includes mature primitives for stateful infrastructure:

StatefulSets
Persistent Volumes
StorageClasses
CSI drivers
Operators
Advanced networking policies
Declarative scaling models

The ecosystem surrounding Kubernetes has also matured alongside it.

Modern data platforms increasingly embrace Kubernetes as their operational foundation:

PostgreSQL
MySQL
MongoDB
Apache Cassandra
Apache Kafka
Apache Flink
Apache Spark

The industry is moving toward a world where data systems are no longer exceptions to cloud native architecture.

They become first-class citizens within it.

What “Cloud Native Data” Actually Means

Running a database inside a container is not enough.

Cloud native data is not a packaging exercise.

It is an operational philosophy.

A truly cloud native data platform must embody the same principles expected from modern applications:

Scalability
Elasticity
Self-healing
Observability
Declarative management
Automation
Portability

The goal is convergence.

Instead of separate operational silos for:

Compute
Networking
Storage
Security
Databases
Streaming systems
Analytics platforms

everything becomes part of a unified control plane.

This is where Kubernetes changes from being “just a container orchestrator” into something much larger:

A distributed operating model for infrastructure itself.

From Virtual Servers to Virtual Datacenters

Traditional infrastructure thinking revolved around servers.

Cloud native thinking revolves around systems.

This distinction matters.

In earlier eras, scaling meant provisioning more hardware manually. Entire teams spent months planning infrastructure growth, configuring environments, and coordinating deployments.

Now infrastructure can be defined declaratively:

Compute resources
Networking
Storage policies
Security rules
Service discovery
Observability pipelines

all expressed as code.

Instead of deploying individual servers, we are increasingly deploying virtual datacenters.

Kubernetes continuously reconciles desired state against actual state:

Failed containers restart automatically
Traffic reroutes dynamically
Storage attaches declaratively
Services scale horizontally
Infrastructure heals itself

For stateless workloads, this model became mainstream years ago.

For data infrastructure, the transition is only beginning.

Why Stateful Infrastructure Was Hard

The hesitation around stateful workloads on Kubernetes did not come from nowhere.

Databases and analytics systems have historically required:

Stable storage
Predictable networking
High I/O throughput
Strict consistency guarantees
Controlled failover behavior
Resource isolation

Traditional infrastructure teams optimized for predictability by dedicating hardware to databases.

The fear was understandable:
What happens if noisy neighbors impact performance?
What if storage becomes ephemeral?
What if orchestration introduces instability?

Early Kubernetes versions lacked mature answers to many of these concerns.

But today’s environment is very different.

Storage orchestration has improved dramatically.

Operators now automate many database lifecycle tasks:

Failover
Backup management
Replication
Scaling
Recovery workflows

Infrastructure teams can increasingly define sophisticated storage and placement policies declaratively.

The result is something powerful:
The operational benefits of Kubernetes can finally extend to data systems.

The Four Pillars of Cloud Native Data

1. Scalability

Cloud native systems must scale without downtime.

Legacy systems often required maintenance windows for upgrades or expansion.

Modern infrastructure should allow:

Dynamic horizontal scaling
Intelligent data redistribution
Elastic resource growth
Near-continuous availability

Applications no longer tolerate downtime as a normal operational expectation.

Neither should data platforms.

Scalability: If a service can produce a unit of work for a unit of resources, adding more resources should increase the amount of work a service can perform. Scalability describes the service’s ability to apply additional resources to produce additional work. Ideally, services should scale infinitely given an infinite amount of compute, network, and storage resources. For data, this means scale without the need for downtime. Legacy systems required a maintenance period while adding new resources, during which all services had to be shut down. With the needs of cloud native applications, downtime is no longer acceptable.

2. Elasticity

Scalability is adding resources.

Elasticity is removing them when demand drops.

This distinction matters enormously in modern environments where infrastructure cost directly impacts business efficiency.

Cloud native data systems should intelligently:

Reclaim unused storage
Tier cold data automatically
Scale compute dynamically
Optimize resource utilization continuously

Infrastructure should expand and contract based on real demand.

Elasticity: Whereas scale is adding resources to meet demand, elasticity is the ability to free those resources when they are no longer needed. The difference between scalability and elasticity is highlighted in below figure. Elasticity can also be called on-demand infrastructure. In a constrained environment such as a private datacenter, this is critical for sharing limited resources. For cloud infrastructure that charges for every resource used, this is a way to prevent paying for running services you don’t need. When it comes to managing data, this means that we need capabilities to reclaim storage space and optimize our usage—for example, moving older data to less expensive storage tiers.

3. Self-Healing

Failures are inevitable.

Disks fail. Nodes disappear. Networks partition. Services crash.

Modern systems must assume failure as a normal operating condition.

Self-healing infrastructure responds automatically:

Rebuilding replicas
Rerouting traffic
Recovering workloads
Maintaining service availability

For data systems, this extends beyond infrastructure:

Detecting data corruption
Monitoring data quality
Recovering consistency automatically

Self-healing: Bad things happen. When they do, how will your infrastructure respond? Self-healing infrastructure will reroute traffic, reallocate resources, and maintain service levels. With larger and more complex distributed applications being deployed, this is an increasingly important attribute of a cloud native application. This is what keeps you from getting that 3 A.M. wake-up call. For data, this means we need capabilities to detect issues with data such as missing data and data quality.

4. Observability

Distributed systems create distributed failure modes.

Without visibility, debugging becomes guesswork.

Cloud native observability combines:

Logs
Metrics
Traces

to provide system-wide insight into behavior.

For data infrastructure, observability becomes even more critical:

Query latency
Replication lag
Throughput bottlenecks
Storage saturation
Pipeline delays
Distributed tracing

The larger the system becomes, the more important observability becomes.

Observability: If something fails and you aren’t monitoring it, did it happen? Unfortunately, not only is the answer yes, but that can be an even worse scenario. Distributed applications are highly dynamic, and visibility into every service is critical for maintaining service levels. Interdependencies can create complex failure scenarios, which is why observability is a key part of building cloud native applications. In data systems, the volumes that are commonplace need efficient ways of monitoring the flow and state of infrastructure. In most cases, early warnings for issues can help operators avoid costly downtime.

With all the previous definitions in place, let’s try a definition that expresses these properties:

Cloud native data approaches empower organizations that have adopted the cloud native application methodology to incorporate data holistically rather than employ the legacy of people, process, technology, so that data can scale up and down elastically, and promote observability and self-healing. This is exemplified by containerized data, declarative data, data APIs, data meshes, and cloud native data infrastructure (that is, databases, streaming, and analytics technologies that are themselves architected as cloud native applications).

For data infrastructure to keep parity with the rest of our application, we need to incorporate each piece. This includes automation of scale, elasticity, and self-healing. APIs are needed to decouple services and increase developer velocity, as well as enable you to observe the entire stack of your application to make critical decisions. Taken as a whole, your application and data infrastructure should appear as one unit.

More Infrastructure, More Problems

Whether your infrastructure is in a cloud, on premises, or both (commonly referred to as hybrid), you could spend a lot of time doing manual configuration. Typing things into an editor and doing incredibly detailed configuration work requires deep knowledge of each technology. Over the past 20 years, significant advances have occurred in the DevOps community, both to code and the way we deploy our infrastructure. This is a critical step in the evolution of modern infrastructure. DevOps has kept us ahead of the scale required for applications, but just barely. Arguably, the same amount of knowledge is needed to fully script a single database server deployment. It’s just that now we can do it a million times over (if needed) with templates and scripts. What has been lacking is a connectedness between the components and a holistic view of the entire application stack. Let’s tackle this problem together. (Foreshadowing: this is a problem that needs to be solved.)

As with any good engineering problem, let’s break it into manageable parts. The first is resource management. Regardless of the many ways we have developed to work at scale, fundamentally, we are trying to manage three things as efficiently as possible: compute, network, and storage, as shown in figure below. These are the critical resources that every application needs and the fuel that’s burned during growth. Not surprisingly, these are also the resources that carry the monetary component to a running application. We get rewarded when we use the resources wisely and pay a literal high price if we don’t. Anywhere you run your application, these are the most primitive units. When on prem, everything is bought and owned. When using the cloud, we’re renting.

Fundamental resources of cloud applications: compute, network, and storage

The second part of the problem is having an entire stack act as a single entity. DevOps has provided many tools to manage individual components, but the connective tissue between them provides the potential for incredible efficiency—similarly to how applications are packaged for the desktop but working at datacenter scales. That potential has launched an entire community around cloud native applications. These applications are similar to what we’ve always deployed. The difference is that modern cloud applications aren’t a single process with business logic. They are a complex coordination of many containerized processes that need to communicate securely and reliably. Storage has to match the current needs of the application, but remain aware of how it contributes to the stability of the application. When we think of deploying stateless applications without data managed in the same control plane, it sounds incomplete because it is. Breaking your application components into different control planes creates more complexity and thus goes against the ideals of cloud native.

Kubernetes as the Universal Control Plane

The real power of Kubernetes is not containers.

It is consistency.

Kubernetes standardizes infrastructure operations across environments:

On-premises
Public cloud
Hybrid cloud
Edge deployments

This consistency creates enormous operational leverage.

Teams can deploy applications using the same patterns everywhere:

APIs
Declarative configuration
Automated reconciliation
Infrastructure abstraction

Data infrastructure benefits immensely from this model.

Imagine databases designed natively for Kubernetes:

Storage tiers managed declaratively
Automatic scaling policies
Intelligent workload placement
Self-healing replication
Built-in observability
Elastic compute utilization

That future is becoming increasingly realistic.

The Rise of the SRE Mindset

Cloud native infrastructure also changes engineering culture.

Traditional infrastructure roles focused heavily on managing individual systems.

Modern platform engineering requires understanding how entire ecosystems behave together.

This is where Site Reliability Engineering (SRE) becomes essential.

The focus shifts from:
“What are we deploying?”

to:
“How does the entire system behave under real-world conditions?”

That includes:

CI/CD pipelines
Distributed systems design
Automation
Reliability engineering
Capacity planning
Observability
Incident response

Infrastructure engineers, DBAs, and platform teams are increasingly converging into reliability-focused engineering disciplines.

The future belongs to engineers who can think holistically.

Distributed Systems Change Everything

Kubernetes forces engineers to confront the realities of distributed computing.

The classic fallacies still apply:

The network is not always reliable
Latency is never zero
Bandwidth is finite
Topology constantly changes
Security is never automatic

Distributed systems introduce complexity that monolithic infrastructure rarely exposed.

But they also unlock unprecedented scalability and flexibility.

The tradeoff is worth it.

Modern applications demand:

Global scalability
High availability
Elastic infrastructure
Rapid deployment cycles
Resilient architectures

These requirements are impossible to achieve consistently without embracing distributed systems thinking.

The Future of Cloud Native Data

We are still early in this transition.

But the direction is becoming increasingly clear.

The future will likely include:

Databases built specifically for Kubernetes
Intelligent autoscaling based on workload patterns
Native observability baked into infrastructure
Declarative data lifecycle management
Seamless hybrid-cloud portability
Automated storage optimization
Self-healing distributed architectures

Most importantly, data infrastructure will stop being treated as a separate operational domain.

It will become fully integrated into the cloud native ecosystem.

Final Thoughts

Cloud native transformation was never only about containers.

It was about rethinking how infrastructure itself should operate.

The organizations that succeed over the next decade will not merely modernize applications.

They will modernize the entire stack:

Compute
Networking
Storage
Security
Observability
Data infrastructure

Kubernetes is becoming the foundation for that convergence.

The question is no longer whether data belongs in the cloud native model.

The question is how quickly organizations can adapt their systems, processes, and engineering culture to fully embrace it.

Because the future of infrastructure is not partially cloud native.

It is entirely cloud native.

source: Cloud Native Data Infrastructure

Understanding Autoencoders: The Foundation of Generative AI

Rajeev ranjan Sinha — Sat, 24 Jan 2026 13:05:46 GMT

If you’ve ever wondered how AI generates fake news articles, creates deepfakes, or restores old photographs, you’re about to discover the secret: autoencoders.

These elegant neural networks are the unsung heroes of generative AI, quietly powering everything from medical imaging analysis to the latest AI art generators. Today, we’re going to demystify how they work.

What Are Autoencoders?

Think of an autoencoder as a neural network that learns to compress and decompress data. It’s like teaching a computer to take detailed notes about an image, then reconstruct that image from just those notes.

The beauty? The computer teaches itself what’s important to remember.

Autoencoders consist of two main parts:

Encoder: Compresses the input data into a compact representation (the “latent space”)
Decoder: Reconstructs the original data from this compressed form

This simple architecture unlocks powerful capabilities:

Generating realistic fake images
Translating poses between people
Animating static photographs
Restoring damaged images and videos
Analyzing medical scans
Creating AI-generated art

A Hands-On Example: MNIST Digit Generation

Let’s walk through a practical example using the classic MNIST dataset of handwritten digits. This 70,000-image collection of numbers 0-9 has become the “Hello World” of computer vision.

Setting Up the Data

First, we load and prepare our data. Each image is 28×28 pixels, where each pixel has a value from 0 to 255 representing its darkness.

from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Normalize pixel values to 0-1 range
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.

# Flatten images from 28x28 to vectors of 784
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))

Why flatten? We’re converting each 28×28 grid into a single row of 784 values. This makes it easier for our neural network to process.

The Magic of Self-Supervised Learning

Here’s where autoencoders get interesting. Unlike typical machine learning where you need labeled data (like “this is a 7” or “this is a 3”), autoencoders use self-supervised learning.

The trick? We train the network to recreate its own input. The image becomes both the input AND the target output. This means the network learns to:

Extract the essential features of each digit
Compress them into a smaller representation
Reconstruct the original image from this compressed form

Training the Model

Training is straightforward:

autoencoder.compile(optimizer='adam', loss='mse')

autoencoder.fit(x_train, x_train,  # Notice: input = output
                epochs=50,
                batch_size=256,
                shuffle=True,
                validation_data=(x_test, x_test))

We use Mean Squared Error (MSE) as our loss function because we’re essentially comparing pixels between the original and reconstructed images. The smaller the difference, the better our autoencoder has learned.

Seeing the Results

After training, we can generate reconstructions:

generated = autoencoder.predict(x_test)

The network takes messy, handwritten digits and recreates them, having learned the fundamental patterns that make each number unique.

Why This Matters

Autoencoders might seem simple, but they’re the foundation for cutting-edge AI technologies:

Variational Autoencoders (VAEs) extend this concept to generate entirely new images, not just reconstruct existing ones.

Generative Adversarial Networks (GANs) use similar principles to create photorealistic faces, artwork, and more.

Transformers like GPT and BERT use encoder-decoder architectures inspired by autoencoders to understand and generate human language.

Sequence-to-sequence models power machine translation, video prediction, and more.

The Bigger Picture

What makes autoencoders fascinating isn’t just what they do—it’s how they do it. By learning to compress and reconstruct data, they discover meaningful patterns humans might never identify.

That compressed middle layer (the latent space) becomes a rich representation of the input’s essential features. This is why autoencoders are so versatile: once you’ve learned a good compression, you can use it for classification, generation, denoising, and countless other tasks.

Want to Learn More?

Autoencoders are just the beginning of generative AI. If you found this interesting, future posts will explore:

Variational autoencoders and how they generate new content
GANs and the adversarial training process
Real-world applications in art, medicine, and beyond

Have you experimented with autoencoders or generative AI? I’d love to hear about your experiences in the comments below.

This article is part of a series exploring the foundations of modern AI. Subscribe to get the next post delivered to your inbox.