GDPR for AI and Machine Learning Companies

AI and machine learning companies face unique GDPR challenges that most general compliance guidance does not address. The issues are not theoretical — supervisory authorities across the EU have already fined AI companies for training data violations, illegal profiling, and failure to respond to data subject rights. This article addresses the GDPR obligations that are specific to companies training, fine-tuning, or deploying AI/ML systems.

The Core Tension: AI Needs Data; GDPR Limits It

AI models are built on data — often large volumes of it. GDPR imposes:

Purpose limitation: Data collected for one purpose cannot be reused for another incompatible purpose
Data minimisation: Only data that is necessary should be collected and used
Storage limitation: Data should not be kept longer than necessary
Transparency: Data subjects should know how their data is used

Training an AI model on personal data implicates all four principles. This creates a genuine compliance tension that requires deliberate choices, not boilerplate privacy policies.

Training Data: The Foundational GDPR Question

If you train a model on personal data — user inputs, scraped web data containing personal information, purchased datasets, proprietary customer data — you need a lawful basis for that training activity.

Lawful bases for AI training:

Consent (Article 6(1)(a)): Valid but demanding. Consent must be specific to the training purpose, separate from consent for the primary service, and withdrawable. If a user withdraws consent, what happens to the model trained on their data? This is a known hard problem — consent withdrawal and model "unlearning" are not technically straightforward.

Legitimate interests (Article 6(1)(f)): The most commonly used basis for AI training in practice. Requires a legitimate interest balancing test: your interest in training the model vs. the data subject's interest in not having their data used this way. This balance varies significantly by context:

Training on user inputs from a product they voluntarily use: more defensible
Scraping public social media data: much harder to justify under legitimate interest
Training on customer data provided for one service purpose to improve another: weak

Research exemption (Article 89): Academic and genuine public interest research can process personal data for training with reduced obligations, subject to appropriate safeguards. This is a narrow exception, not available to most commercial AI companies.

Special category data in training datasets: If your training data includes health information, political opinions, biometric data, or other special category data — from user inputs or scraped sources — you need an Article 9 justification in addition to the Article 6 basis. This is a common oversight.

Automated Decision-Making (Article 22)

AI systems that make or contribute to decisions with significant effects on individuals trigger Article 22:

Credit scoring, loan decisions
Insurance pricing
Recruitment decisions
Content moderation that affects access to platforms

Requirements when Article 22 applies:

Inform individuals that automated processing is occurring
Provide meaningful information about the logic involved
Give the right to request human review
Allow the individual to contest the decision

"Meaningful information" does not require explaining the full model, but it must go beyond vague descriptions. What factors were considered? How were they weighted? Why was this specific decision made?

Data Subject Rights and AI Systems

AI companies face specific challenges in responding to data subject rights:

Right to access: An individual can request a copy of their personal data. If their data was used to train a model, do you need to disclose that? The answer depends on whether the training data is "personal data" in your retention (if you deleted the training data after use, there may be nothing to access).

Right to erasure: An individual can request deletion of their data. For model training, erasure is technically complex. The EDPB has acknowledged that model "unlearning" (removing the effect of a specific data point from a trained model) is technically challenging and not always required — but the underlying training data must be deleted, and you must document why model retraining is not required.

Right to object to profiling: Individuals can object to processing for profiling purposes. AI systems that build user profiles must provide a mechanism to object and must stop the profiling if the objection is upheld.

Privacy by Design in AI Development

Article 25 requires data protection by design and by default. For AI companies, this means:

Pseudonymisation before training: Where possible, remove direct identifiers from training datasets before use.

Data minimisation in feature engineering: Only include the features that are necessary for the model's purpose. Don't include age, gender, location, or other personal attributes unless genuinely necessary.

Differential privacy: Where the dataset is sensitive, consider differential privacy techniques that add controlled noise to prevent re-identification from model outputs.

Model output review: AI models can leak training data through outputs — especially language models. Implement testing to detect and prevent training data memorisation in generative outputs.

DPIAs for AI Processing

Supervisory authorities consider AI-based profiling, large-scale training on personal data, and automated decision-making to be high-risk processing requiring a DPIA. The EDPB has specifically listed automated decision-making and large-scale processing as triggers.

A DPIA for AI processing should cover:

The training data sources and legal basis
The automated decision-making logic (to the extent describable)
Risks to data subjects (profiling, discrimination, exposure)
Mitigations implemented
Consultation with the DPO if one is appointed