Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Thousands of companies use the Ray framework to scale and run highly complex, compute-intensive AI workloads — in fact, you’d be hard-pressed to find a large language model (LLM) that hasn’t been built on Ray.
Those workloads contain loads of sensitive data, which, researchers have found, could be highly exposed through a critical vulnerability (CVE) in the open-source unified compute framework.
For the last seven months, this flaw has allowed attackers to exploit thousands of companies’ AI production workloads, computing power, credentials, passwords, keys, tokens and “a trove” of other sensitive information, according to new research from Oligo Security.
The vulnerability is under dispute — meaning that it is not considered a risk and has no patch. This makes it a “shadow vulnerability,” or one that doesn’t appear in scans. Fittingly, researchers have dubbed it “ShadowRay.”
This marks the “first known instance of AI workloads actively being exploited in the wild through vulnerabilities in modern AI infrastructure,” write researchers Avi Lumelsky, Guy Kaplan and Gal Elbaz.
“When attackers get their hands on a Ray production cluster, it is a jackpot,” they assert. “Valuable company data plus remote code execution makes it easy to monetize attacks — all while remaining in the shadows, totally undetected (and, with static security tools, undetectable).”
Creating a glaring blind spot
Many organizations rely on Ray to scale and run large and complex AI, data and SaaS workloads — including giants Amazon, Instacart, Shopify, LinkedIn and OpenAI, whose GPT-3 was trained on Ray.
This is because models comprising billions of parameters require intense computational power and can’t fit onto the memory of one single machine. The framework, which is maintained by Anyscale, supports distributed workloads for training, serving and tuning AI models of all architectures. Users don’t have to be proficient in Python, installation is simple and there are few dependencies, the Oligo researchers point out.
They ultimately described Ray as the “Swiss Army knife for Pythonistas and AI practitioners.”
But this makes ShadowRay all the more concerning. The vulnerability, identified as CVE-2023-48022, is the result of a lack of authorization in the Ray Jobs API. This exposes the API to remote code execution attacks. Anyone with dashboard network access could invoke “arbitrary jobs” without needing permission, according to researchers.
The vulnerability was disclosed to Anyscale along with four others in late 2023 — but while all the others were quickly addressed, CVE-2023-48022 was not. Anyscale ultimately disputed the vulnerability, calling it “an expected behavior and a product feature” that enables the “triggering of jobs and execution of dynamic code within a cluster.”
Anyscale contends that dashboards should either not be internet-facing, or only accessible to trusted parties. Ray doesn’t have authorization because it is assumed that it will run in a safe environment with “proper routing logic” via network isolation, Kubernetes namespaces, firewall rules or security groups, the company says.
This decision “underscores the complexity of balancing security and usability in software development,” the Oligo researchers write, “highlighting the importance of careful consideration in implementing changes to critical systems like Ray and other open-source components with network access.”
However, disputed tags make these types of attack difficult to detect; many scanners simply ignore them. To this point, researchers report that ShadowRay did not appear in several databases, including Google’s Open Source Vulnerability Database (OSV). Also, they are invisible to static application security testing (SAST) and software composition analysis (SCA)
“This created a blind spot: Security teams around the world had no idea that they could be at risk,” the researchers write. At the same time, “AI experts are not security experts — leaving them potentially dangerously unaware of the very real risks posed by AI frameworks.”
From production workloads to OpenAI and Hugging Face tokens
Researchers report that a “trove” of information was leaked due to compromised servers. These include:
- AI production workloads, allowing attackers to disrupt a model’s integrity or accuracy, or steal or infect models during the training phase.
- Access to the cloud environment (AWS, GCP, Azure, Lambda Labs) and sensitive cloud services. This could leak sensitive production data, including complete databases with customer data, codebases, artifacts and secrets.
- KubernetesAPI access, which could enable attackers to infect cloud workloads or steal Kubernetes secrets.
- Passwords and OpenAI, Stripe and Slack credentials.
- Production DB credentials, potentially allowing threat actors to silently download complete databases. In some cases, attackers could also modify a database or encrypt it with ransomware.
- Private SSH keys that can be used to connect to more machines from the same VM image template. This would enable attackers to reach more compute for crypto-mining campaigns.
- OpenAI tokens, which could be used to access accounts and drain impacted credits.
- Hugging Face tokens, which could provide access to private repositories and allow threat actors to add and override existing models. These could then be used for supply chain attacks.
- Stripe tokens, which could be used to drain payment accounts by signing transactions on the live platform.
- Slack tokens, which could be leveraged to read messages or send arbitrary messages.
The researchers further report that most of the compromised GPUs are “currently out of stock and hard to get.” Oligo has found “hundreds” of compromised clusters consisting of many nodes. Most of these have GPUs that attackers use for cryptocurrency mining.
“In other words, attackers choose to compromise these machines not only because they can obtain valuable sensitive information, but because GPUs are very expensive and difficult to obtain, especially these days,” the researchers write, pointing out that GPU on-demand prices on AWS can reach an annual cost of $858,480 per machine.
Attackers had seven months to leverage this hardware, and researchers estimate that the total amount of machines and compute power that could have been compromised could total $1 billion.
They warn: “Attackers are doing the same math.”
Shining a light on shadow vulnerabilities
The Oligo researchers concede that “shadow vulnerabilities will always exist” and that signs of exploit vary — data could be loaded from untrusted sources, firewall rules might be missing or users may not take into account dependency behavior.
They advised organizations to take several actions, including:
- Always running Ray within a secure and trusted environment.
- Adding firewall rules or security groups to prevent unauthorized access.
- Continuously monitoring production environments and AI clusters for anomalies, even within Ray.
- If a Ray dashboard does need to be accessible, implement a proxy that adds an authorization layer.
- Never trusting the default.
Ultimately, they emphasize: “The technical burden of securing open source is yours. Don’t rely on the maintainers.”