The research team scanned 400 terabytes of Common Crawl’s December 2024 dataset, which included 2.67 billion web pages from 47.5 million hosts. Using their open-source tool TruffleHog, they identified thousands of exposed credentials, including AWS, Slack, and Mailchimp authentication tokens. The analysis found that 63% of these secrets were reused across multiple domains, with a single WalkScore API key appearing over 57,000 times across 1,871 subdomains. Even more concerning, some AWS root keys were discovered embedded in front-end HTML, while 17 unique Slack webhooks were hardcoded into a single chat feature.
Mailchimp API keys were among the most frequently exposed credentials, appearing in over 1,500 instances, often embedded directly in client-side JavaScript. This practice makes them an easy target for phishing campaigns and data exfiltration. The study also revealed that LLMs cannot distinguish between functional and non-functional credentials during training, increasing the risk of suggesting insecure implementations in generated code.
Processing such a vast dataset posed challenges. Truffle Security deployed a 20-node AWS cluster to scan the 90,000 WARC files containing raw HTML, JavaScript, and server responses. While initial streaming inefficiencies slowed processing, AWS optimizations improved download speeds by up to six times. Despite these obstacles, the researchers prioritized ethical disclosure, working with vendors like Mailchimp to revoke thousands of compromised keys rather than contacting individual website owners directly.
The findings underscore a critical security dilemma: LLMs trained on publicly accessible data may inherit its vulnerabilities. While models like DeepSeek employ safeguards such as fine-tuning, alignment techniques, and prompt constraints, the widespread presence of hardcoded secrets in training corpora risks normalizing insecure practices. Placeholder tokens further complicate the issue, as LLMs cannot verify whether credentials are active or merely examples.
To address these concerns, Truffle Security recommends integrating AI security guardrails into development tools. For example, GitHub Copilot’s Custom Instructions can enforce policies against hardcoding secrets. Expanding secret-scanning initiatives to include archived web data would help detect historical leaks that resurface in training datasets. Additionally, adopting Constitutional AI techniques could better align AI models with security best practices, reducing the risk of inadvertent exposure.
As LLMs continue to influence software development, securing their training data is no longer optional—it is fundamental to building a safer digital future.
Source: Cyber Security News
The European Cyber Intelligence Foundation is a nonprofit think tank specializing in intelligence and cybersecurity, offering consultancy services to government entities. To mitigate potential threats, it is important to implement additional cybersecurity measures with the help of a trusted partner like INFRA www.infrascan.net, or you can try yourself using check.website.