
The U.S. Cybersecurity and Infrastructure Security Agency (CISA), in collaboration with the National Security Agency (NSA), the Federal Bureau of Investigation (FBI), and international partners, has released a joint Cybersecurity Information Sheet titled ‘AI Data Security: Best Practices for Securing Data Used to Train and Operate AI Systems.’ It focuses on the critical importance of securing data that underpins artificial intelligence (AI) systems, emphasizing that the accuracy, integrity, and trustworthiness of AI outcomes are only as strong as the data used to build and run them. It identifies risks related to data security and integrity across the AI lifecycle, from development and testing to deployment and operational use.
Building upon the NSA’s April 2024 joint guidance on ‘Deploying AI Systems Securely,’ the latest guidance delves deeper into securing the data used to train and operate AI-based systems. Owners and operators of National Security Systems, the Defense Industrial Base, federal agencies, and critical infrastructure sectors are urged to review the publication and implement its recommended best practices. These include adopting strong data protection protocols, proactively managing risks, and enhancing capabilities in monitoring, threat detection, and network defense to safeguard sensitive, proprietary, and mission-critical information within AI and machine learning environments.
As AI systems become deeply embedded in core operations, securing the data that fuels them is not optional; it is essential.
This guidance aims to achieve three core objectives. First, it seeks to increase awareness of the data security risks that can emerge throughout the AI lifecycle, particularly during development, testing, and deployment. Second, it offers actionable best practices for securing AI data at each stage, with detailed analysis of the most critical areas of vulnerability. Third, it promotes the adoption of strong data protection measures and encourages organizations to implement proactive risk mitigation strategies, helping to build a secure and resilient foundation for AI systems.
Data security is a core pillar across the entire AI system lifecycle. Since machine learning models learn directly from the data they are fed, any compromise to that data can distort the system’s logic and decision-making. If an attacker manipulates training or operational data, they can corrupt outputs, introduce bias, or even hijack system behavior for malicious ends.
To mitigate these risks, the National Institute of Standards and Technology (NIST) outlined six stages in its AI Risk Management Framework (RMF), beginning with Plan and Design and continuing through to Operate and Monitor. Each stage depends on strong data integrity. Without it, the reliability, security, and ethical foundation of AI systems cannot be assured.
Throughout the AI system lifecycle, securing data is paramount to maintaining information integrity and system reliability. Starting with the initial Plan and Design phase, carefully plan data protection measures to provide proactive mitigations of potential risks. In the Collect and Process Data phase, data must be carefully analyzed, labeled, sanitized, and protected from breaches and tampering.
Securing data in the Build and Use Model phase helps ensure models are trained on reliably sourced, accurate, and representative information. In the Verify and Validate phase, comprehensive testing of AI models, derived from training data, can identify security flaws and enable their mitigation.
When it comes to the verification and validation phase, it is necessary that each time new data or user feedback is introduced into the model; therefore, that data also needs to be handled with the same security standards as AI training data. Implementing strict access controls protects data from unauthorized access, especially in the Deploy and Use phase.
Lastly, continuous data risk assessments in the Operate and Monitor phase are necessary to adapt to evolving threats. Neglecting these practices can lead to data corruption, compromised models, data leaks, and non-compliance, emphasizing the critical importance of robust data security at every phase.
To protect data used in AI-based systems, the advisory calls upon system owners to source data from reliable, trusted providers. Tracking data provenance is essential. Maintaining logs of where data originated and how it flows through the system helps detect tampering and provides accountability. Using secure, cryptographically signed provenance databases can make it difficult for attackers to manipulate data without being detected.
Ensuring data integrity during storage and transit is also critical. Organizations should use cryptographic hashes and checksums to verify that data has not been altered. This approach protects the reliability and authenticity of the datasets that feed into AI models.
Digital signatures can further enhance data trustworthiness by verifying the authenticity of datasets used in model training or post-training processes. Using quantum-resistant signature standards and trusted certificate authorities ensures that only verified, authorized changes are accepted.
Trusted infrastructure is another vital component. Secure computing environments based on zero trust principles help isolate sensitive data operations. Trusted execution environments and secure enclaves ensure data remains protected during processing, reducing the risk of tampering and unauthorized access.
Data classification and access control are essential for applying appropriate security protections. Sensitive data should be clearly labeled, and access should be restricted based on classification levels. Encryption and other security controls must reflect the sensitivity of the input data, which often matches the required protection level of the AI system’s output.
Encryption remains a cornerstone of AI data protection. Data should be encrypted at rest, in transit, and during processing using strong standards such as AES-256. Transport layer protocols like TLS with post-quantum encryption support secure data movement between systems.
Secure data storage is equally important. Data should be stored in certified, compliant hardware that meets high cryptographic standards, such as those specified in NIST FIPS 140-3. Organizations should assess the appropriate level of storage security based on their risk profile and operational needs.
Privacy-preserving techniques like data masking, differential privacy, and federated learning can reduce exposure of sensitive information while still enabling effective AI model development. These methods help organizations balance utility and privacy, even though some come with computational trade-offs.
When AI-related data is no longer needed, it should be securely deleted. Techniques such as cryptographic erase or data overwrite can help ensure data is unrecoverable, reducing residual risk from decommissioned systems.
Finally, organizations must conduct continuous data security risk assessments. Using frameworks such as NIST’s AI RMF and SP 800-37 helps identify vulnerabilities, track evolving threats, and guide ongoing improvements to security measures. Maintaining a proactive and adaptive security posture is key to protecting data integrity and supporting trustworthy AI systems.
The guidance highlighted that a strong data management strategy is essential for maintaining control over AI system inputs. By following best practices, organizations can more easily add and track new data used for training or model adaptation. This makes it possible to pinpoint data elements that contribute to model drift and take corrective action when needed.
Developers should apply data-quality testing throughout the AI lifecycle. Using assessment tools helps filter and validate the data chosen for training or updating models. A clear understanding of dataset quality and its effect on model performance is critical for identifying shifts that may compromise results.
Monitoring the inputs and outputs of AI systems helps ensure they function as intended. Regularly updating models with fresh data and applying statistical analysis to compare training and test datasets can reveal whether data drift is occurring. This proactive approach helps maintain accuracy and reliability over time.
The advisory emphasized that data security is fundamental to the safe and trustworthy development and operation of AI systems. As organizations across industries increasingly depend on AI to drive decision-making, the integrity, accuracy, and reliability of these systems hinge on protecting the data that fuels them.
The Cybersecurity Information Sheet offers a comprehensive framework for securing AI data. It addresses critical risks such as compromised data supply chains, malicious data injection, and model degradation from data drift. These threats, if left unmitigated, can undermine the core functionality and trustworthiness of AI applications.
Data security is not static. As threats continue to evolve, so must the strategies used to defend against them. The guidance encourages organizations to adopt a proactive posture, applying the highest data protection standards throughout the AI lifecycle. By implementing these best practices and risk mitigation techniques, organizations can better safeguard sensitive, proprietary, and mission-critical data while reinforcing the performance and reliability of their AI systems.

Anna Ribeiro
Industrial Cyber News Editor. Anna Ribeiro is a freelance journalist with over 14 years of experience in the areas of security, data storage, virtualization and IoT.