Why Enterprise Data Platforms Are Still Insecure

July 10, 2025

Why Enterprise Data Platforms Are Still Insecure

By Hayato Shimizu - 8th July 2025

Despite all the advances in tooling, regulation, and architectural best practices, enterprise data platforms remain shockingly undersecured. I’ve come across this again and again - massive PostgreSQL, Cassandra, Kafka, or Elasticsearch clusters in production, holding sensitive data, with barely any meaningful access controls in place.

This isn’t because people don’t care about security. It’s because real-world engineering priorities push security to the side. There’s pressure to deliver features, data teams want quick access, infra teams are swamped, and security is too often the last team to be involved.

We don’t have to speculate about what can go wrong. The last decade has been full of real-world database security incidents that were completely avoidable.

In 2017, thousands of MongoDB instances were discovered exposed to the public internet, with no authentication and no encryption. Some were wiped and held for ransom. Others were simply harvested for data. These were default installs that no one ever locked down.

In 2020, Microsoft suffered a breach affecting 250 million support records after misconfiguring an Elasticsearch database. It was accidentally exposed to the internet, and again, no access control was in place.

In 2021, a Brazilian government COVID-19 system leaked data from an open PostgreSQL server. No firewall, no password, just a public-facing IP with a working login.

In 2022, Revolut suffered a breach where attackers accessed production PostgreSQL databases containing customer data. Reports suggested lateral movement from compromised access and inadequate segmentation between services and sensitive data.

In 2024, Ticketmaster suffered a major breach after attackers accessed Snowflake-hosted customer databases. According to incident analysis, credentials were either stolen or reused, and there were no additional layers of defence, no IP restrictions, no client certs and no anomaly detection. Personal and financial data of over half a billion users was reportedly exfiltrated.

These weren’t zero-day exploits. They weren’t sophisticated attacks. They were the result of assuming internal infrastructure is private, or that someone else would secure the database “later.” The rest of this post breaks down how these failures happen and how to prevent them based on what I’ve seen in the field and what actually works.

‍

Internal Traffic Still Needs Encryption

Many teams assume that enabling encryption at rest and toggling TLS means the job is done. But “encryption” is only meaningful if you manage your keys correctly, validate certificates, rotate them regularly, and understand what you’re protecting against.

More disturbingly, what I hear often is some version of: “It’s just internal traffic, we don’t need to encrypt it.” It usually comes from a place of misplaced trust: the assumption that if something lives inside a VPC or on a private subnet, it’s inherently secure. Teams often leave internal service-to-service communication unencrypted because they’re under pressure to deliver, and TLS feels like overhead with more certificates to manage, more complexity in setup, and more things that can break.

But this is a fundamental misunderstanding of modern attack surfaces. The internal network isn’t a clean, hermetically sealed zone. It’s full of build or monitoring agents, test systems, staging workloads, contractor VPNs, legacy boxes nobody wants to touch, and operators with broad SSH access. If you’re not encrypting internal communication, what you’re really saying is: “I trust every human and every workload inside this network not to make a mistake.” That’s not a security posture, that’s a gamble.

And let’s be honest, the weakest part of any infrastructure is the human. Secrets get copied into Slack. Engineers run ps and grab tokens. Someone pastes a private key into a Jira ticket. Contractors get temporary access that somehow becomes permanent. Someone connects to prod to debug a service and leaves a tunnel open. These things don’t happen because people are malicious, they happen because people are fallible. Even the best engineers are human. People are the single least reliable security control in your stack.

Even when TLS is deployed, it's often just enough to look compliant not to be secure. I regularly find:

Self-signed certificates with no internal CA and no chain of trust.
TLS verification disabled entirely to “make things work.”
Hostname checks skipped or hardcoded CN fields are reused across services.
Static certificates, sometimes shared across all services, valid for 5+ years, never rotated.
Outdated cipher suites, or support for broken TLS versions like 1.0 and 1.1 still enabled.

And it’s not just developers. What’s genuinely shocking is how often I’ve had to explain the basics of PKI to dedicated IS or security teams. People whose job titles suggest they’re responsible for hardening infrastructure, and yet they treat internal certs as checkbox items. No concept of trust chains, no expiration policies, and no plan for revocation. It’s not a tooling problem - it’s a knowledge gap across the board.

So when I say “trust nobody,” I mean it:

Don’t trust developers to remember to rotate secrets or validate certs.
Don’t trust DevOps engineers to always follow the right process under pressure.
Don’t trust contractors with long-lived SSH keys or VPN access.
Don’t trust CI jobs, build agents, or bastion hosts to stay within their intended scope.
And yes, don’t blindly trust security teams just because they have “security” in their title — verify that the controls are actually enforced.

Security that relies on humans doing the right thing, every time, will eventually fail. The answer is automation, validation, and continuous policy enforcement.

Start with a real internal certificate authority—HashiCorp Vault, cert-manager, or ACM PCA. Issue short-lived certificates—no longer than every few days to few weeks. 1-year-certs are way too long. Rotate them more frequently and automatically. Don’t rely on someone remembering to renew before expiry. Wire cert issuance and renewal into your deployment pipelines. Make failures loud and immediate. Sergio Rua from our team saw a gap, and built vals-operator to make this simple.

Encryption, done properly, limits the blast radius. But to get there, you need education, automation, and a zero-trust mindset. Because if your infrastructure assumes anyone or anything on the network is automatically trustworthy, you’re building on sand.

Trust nobody. Encrypt everything. Rotate often. Automate it. Enforce it. Sleep better.

‍

Authentication: Machines Deserve First-Class Identity Too

Most enterprises have SSO figured out for human users. MFA, identity federation, SCIM provisioning - it’s all there. But machine and microservice authentication? That’s still where things fall apart.

The pattern I still see, far too often, is this: static credentials created once and shared between services. A database user like app_user, granted broad access, with its password dumped into a Helm chart or CI/CD variable. AWS access keys issued years ago, still valid. Shared API keys floating around in Terraform files and GitHub repos.

This is a huge problem not just because credentials are long-lived and exposed, but because services aren’t treated as identities. And in a modern cloud-native environment, they should be.

If you’re on AWS and running Kubernetes, you should be using IRSA and IAM Roles for Service Accounts. This lets you assign IAM roles to pods via OIDC federation. No long-lived keys, no secrets. The pod authenticates using a projected service account token and gets temporary credentials scoped to exactly what it needs.

GCP offers something similar through Workload Identity Federation, letting you bind Kubernetes service accounts to GCP service accounts securely. Azure has Managed Identities for workloads in AKS.

If you’re not on a cloud-managed K8s service, or you’re running bare metal or hybrid, use Vault’s AppRole, JWT, or Kubernetes auth backends. Your service authenticates to Vault with a short-lived token, gets scoped credentials for what it needs, and those credentials expire quickly. No hardcoded secrets, no permanent access.

Secrets should never be passed via environment variables. Use volumes with automatic TTL, or better, inject them into memory at runtime through an init or sidecar process. Store nothing on disk if you can avoid it. If you must use secrets in Kubernetes, at least use sealed secrets or something like Mozilla SOPS so they’re encrypted at rest.

And finally, for services accessing databases: give each service its own database role. Don’t share roles between services, and certainly don’t use superuser access in production. Define what a service needs - read-only, insert-only, full access to a subset of tables, and restrict it at the database level. Such practice also helps when debugging issues, as role access log should limit the scope of investigation.

Security starts with identity. If machines aren’t first-class citizens in your IAM strategy, you’re doing it wrong.

‍

Access Control: Too Broad, Too Stale

Once something is authenticated, what can it do? In most environments I audit, the answer is: far too much.

I've seen engineers granted admin access to production databases “temporarily” and left there for over a year. I've seen analyst accounts with write access to production tables, service accounts with ALL PRIVILEGES, and no audit trail of who granted what or why.

The worst part is how much of this gets grandfathered in. Early in a project, everything is wide open because you’re still figuring things out. Years later, nobody’s gone back to clean it up.

The fix is mostly discipline. Role-based access control for everything—users, services, jobs. Automate it with Terraform or Pulumi, and tie everything to groups or service identities, not individuals. Create narrowly scoped roles in your database and grant them only what’s required. In Cassandra, that means SELECT on specific keyspaces or tables. In PostgreSQL, that means splitting read, write, and admin roles by schema and purpose.

Then review it quarterly. Disable dormant accounts. Expire old credentials. And never allow shared secrets between services or users.

‍

Network Architecture: Flat Networks and Open Outbound Access

Cloud networks often start with the best intentions but quickly become dumping grounds. I’ve lost count of the number of /16 VPCs I’ve seen where every environment, in dev, test, staging and prod all share the same address space, with flat subnetting and overly permissive security groups. Perhaps, a symptom of copy and paste of Terraform code across environments. It is way too easy to create VPCs and it fosters a culture of provisioning networks without too much thinking.

Just last year I reviewed a Cassandra deployment where the entire cluster had a /16 range, on a public subnet, with SSH open globally “for debugging.” It wasn’t a small company. It was a well-funded enterprise with compliance obligations.

Even putting aside security groups, CIDR design alone is worth rethinking. Use smaller subnets that are meaningful—give databases and Kafka a /24, enough for most enterprises, and frontend nodes their own block. Of course, if your enterprise is of bigger scale then do carve up your network accordingly. Use consistent IP planning so that even without DNS, your engineers can tell what service they’re looking at by the IP alone.

Another huge problem I see, which doesn’t get talked about enough, is unrestricted outbound access from database servers. I’ve seen PostgreSQL and Cassandra nodes with full egress to the public internet with no proxy, no NAT gateway restrictions, no logging, and sometimes even public IPs assigned directly to the servers.

This is incredibly risky. Any misconfigured tool, compromised host, or rogue query with COPY TO PROGRAM (Postgres) can be used to exfiltrate data. If a server is compromised and has unrestricted outbound, an attacker has all the bandwidth they need to extract sensitive data in real-time, and you’ll never see it coming if you're not logging outbound traffic.

Database servers, particularly in production, should not have direct internet access. If they need to fetch updates or connect to telemetry endpoints, force that through a proxy with strict egress rules. Set up VPC Flow Logs, use Cloud NAT or firewall rules to limit destinations, and log everything. Outbound access should be the exception, not the default.

If a developer or support script needs to access something external, give them a jump host or bastion with auditable sessions, but not a general-purpose pipe out of your data layer.

‍

Visibility: Logging and Auditing Should Be the Default

Security isn’t just about prevention; it’s about detection and response. Yet far too often, critical systems are deployed with minimal logging, and what logs do exist can be easily deleted or tampered with by anyone who has access to the server.

If someone can SSH into a box and delete the logs that show what they did, you don’t have visibility; you have wishful thinking.

Start with OS-level logging. Every server should be forwarding system logs (syslog, journald, /var/log/auth.log, etc.) to a remote destination as soon as the machine boots. These logs must leave the server—ideally within seconds—so they can’t be wiped after an intrusion. That means centralised logging infrastructure: Fluent Bit, rsyslog, Vector, or an agent that ships to Loki, OpenSearch, or SaaS services like Splunk or Elastic Cloud.

Database-level auditing should be on by default. PostgreSQL can log all DDL and auth events. Cassandra can write full audit logs for queries and login attempts. These should be treated as first-class security signals, not just operational noise.

Logs should be tagged, searchable, and alertable. You should be notified when someone logs into a prod node, runs a shell, escalates a role, or changes a table definition. These aren’t rare events—they’re signals of change, and they should be monitored like everything else in production.

The goal here is tamper resistance. Nobody, not even root, should be able to erase what happened after the fact. If your logs live only on the server, you don’t have a trail - you have a liability.

‍

Security by Checkbox

One of the most common patterns I encounter is what I’d call "security theatre through compliance". (FYI, I did not come up with this terminology or concept.) The company has a security policy. There’s an approval flow. Someone somewhere signs off on a risk register. There’s a PDF that says "data is encrypted" and "access is controlled" and yet, the reality on the ground looks nothing like what’s on paper.

I’ve seen this across industries, platform teams rush to pass audits by doing the bare minimum. TLS is enabled with a single shared cert. Role-based access control exists in theory, but the default role still has full table access. Secrets are stored in a "secure" location until you realise it’s just a parameter in a Terraform variable shared between all environments.

Management gets their box ticked. The auditor signs off. But no one verifies the implementation beyond the existence of the config. There’s no threat modelling, no realistic attack simulation, and often no logging of the controls actually working.

This kind of superficial security is worse than no security at all. It gives everyone a false sense of protection, while quietly opening the door to a breach. And when something does go wrong, the postmortem reveals what we already knew: the policy wasn’t enforced, the control wasn’t monitored, and nobody was actually accountable.

If you’re spending more time on documentation than on hardening, you’re not doing security, you’re playing corporate theatre. And attackers don’t care how pretty your Confluence page looks.

‍

Security as Culture, Not a Compliance Exercise

None of the above matters if the culture isn’t right. In many companies, security is seen as a blocker, not a responsibility. Engineers get told "we’ll secure it later", platform teams assume the security team is handling it, and security teams are too busy fighting fires to prevent the next one.

Security needs to be part of your delivery process. You should threat model during design. You should enforce IAM diffs in code review. You should treat privilege boundaries the same way you treat API boundaries - intentional, versioned and reviewed.

And yes, this takes time. But so do outages. So do breaches. We’ve helped companies secure infrastructure after the fact, and it’s always more painful than doing it upfront.

‍

Final Word

Securing data platforms isn't about box-ticking or following trends. It's about operational maturity. It's about designing systems with failure in mind because authentication will fail, secrets will leak, and mistakes will inevitably happen. What matters is how far an attacker can get when they do. As a CIO, you want to be able to go to sleep at night! Are you chancing it?

Most of the breaches you hear about aren’t caused by zero-days. They’re caused by overly trusting defaults, forgotten credentials, flat networks, and nobody watching the logs.

So take the time now to fix it. Not all at once. Start with secrets. Then work on authentication. Then access controls. Then the network. But start. Because this isn’t going away, and the next incident could be yours.

If your team needs help untangling your platform security or building it right from day one, feel free to reach out. At Digitalis, we’ve been through this in cloud, on-prem, hybrid, bare metal - you name it.

It’s never too late to secure the foundations. But it’s better not to wait until you’re on fire.

‍

Subscribe to newsletter

Subscribe to receive the latest blog posts to your inbox every week.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.