The IAPP is policy neutral. We publish contributed opinion and analysis pieces to enable our members to hear a broad spectrum of views in our domains.
By virtue of the privacy-by-design principle — a key premise of the EU General Data Protection Regulation — any new initiative within a company that involves personal data must incorporate privacy protections throughout the entire engineering life cycle, from initial design to final deployment.
With increasing AI expansion across all industries, ensuring such technology is developed under the privacy-by-design approach is crucial for safeguarding individual rights while leveraging the remarkable benefits it can offer.
Building on this premise, Spain's data protection authority, the Agencia Española de Protección de Datos, recently published an article questioning the common belief that more complex AI models are inherently superior. Instead, the AEPD demonstrates that well-designed, streamlined AI systems can not only strengthen compliance with data protection laws, but also deliver better outcomes, underscoring the importance of thoughtful, privacy-focused development.
A surprising experiment: One neuron vs. many
The AEPD suggests developing two different AI models — a single-neuron network and a multi-neuron network — to determine, as an example, whether someone is overweight using only height and weight. Comparing their results highlights how the choice of model can significantly impact both AI accuracy and compliance with GDPR principles.
A single-neuron model takes a straightforward approach, processing height and weight through a simple formula, then classifying individuals as either "overweight" or "not overweight." Surprisingly, even a small training set of just three samples can be enough for this model to generalize across different height-weight scenarios.
Essentially, the neuron establishes a line in a two-dimensional space, grouping points on one side as overweight and those on the other as not overweight. While not flawless, this straightforward method can perform quite well, as long as the relationship between height and weight is relatively direct.
Now, assume a single neuron doesn't seem sophisticated enough, and a more complex network is pursued — one with multiple layers capable of mapping intricate decision boundaries. In theory, this bigger model seems "smarter" since it can adapt to more complex patterns. But if it is still trained on just a handful of examples, it may end up generating odd "hallucinations," such as classifying a healthy person as overweight. Why? Because the model is overfitting, forcing its decision boundary to match those few points at all costs.
This lack of sufficient data can lead to unexpected, inaccurate outcomes — ironically making the more advanced system less reliable than its simpler counterpart.
One natural response might be to feed the larger network more training data — say 23 samples instead of three — carefully labeling which cases are overweight and which are not.
This extra data usually helps the model form a more sensible boundary. Yet even then, unusual data points or missing scenarios can lead to errors, such as incorrectly classifying someone who is 5 feet tall and 220 pounds as "not overweight." To address this issue, it might be necessary to gather more samples that could be unusual in the real world, either by collecting them from real individuals or by generating carefully selected synthetic data.
From design to deployment: The importance of starting smart
The AEPD concludes that a single-neuron network performs better in this example not simply because it is "simpler," but because its design implicitly encodes a near-linear relationship between height and weight when determining overweight status. By incorporating this contextual knowledge from the outset — that is, weight tends to increase with height in a mostly linear way — the network can achieve its goal with minimal data and reduced complexity.
A more "intelligent" or sophisticated network lacks that built-in assumption, meaning it needs significantly more samples to learn the same near-linear relationship. Therefore, it is necessary to collect a significantly larger number of samples so the training process can infer the near-linear relationship. Without enough suitable examples, the model may struggle to infer this trend correctly, increasing the risk of "hallucinations."
The first model minimizes data need, lowers resource consumption and shortens training and testing time. It also offers a clearer, more objective way to assess the quality of outputs. However, it requires a mature development methodology and domain expertise — for example, statistical and algebraic analysis — to embed the appropriate structural assumptions from the start.
In turn, the second one can potentially capture any kind of relationship in the data without heavy upfront design decisions. However, it demands an unknown amount of training and testing data, carries greater uncertainty in development timelines, and does not inherently guarantee high-quality outputs without extensive data collection and validation.
Which one is better? Neither. The question is not whether to pick a simpler or more complex machine learning structure, nor whether the classification is linear. Rather, it's about selecting the approach that best suits the context and purpose of the data processing.
By leveraging prior knowledge identified during the design phase and embedding it into the AI architecture, companies can save significant development resources — particularly in terms of how many and what types of samples are needed — and achieve stronger assurances of quality in the final output.
Data protection: Real-world implications
That's where privacy laws step into the picture. The AEPD points out that its conclusions have the following important data protection implications.
Data minimization. A carefully selected set of just three samples — whether real or synthetic — can be sufficient for training a basic model. In other words, data quality in machine learning does not necessarily stem from the volume of data but from its relevance to the training objective.
Accountability and privacy by design. Ensuring compliance with GDPR principles requires mature development methodologies. Incorporating data protection considerations at each stage of system design is crucial for achieving lawful and transparent AI solutions.
Adequacy of the processing. When deploying an AI system as part of a broader processing operation, it is essential to verify that its output maintains acceptable quality, even for scenarios not fully represented in the training or testing sets.
Also, some AI architectures may require large and diverse datasets that are impractical or even illegitimate to obtain. If gathering enough varied data is not feasible, the AI system might be unsuitable from the start.
The AEPD stresses that effective management of real-world AI challenges — and ensuring compliance with data protection regulations — requires combining proper data science techniques with robust development methodologies.
The bottom line
In a world where AI applications are proliferating daily, it is easy to be impressed by the promise of sophisticated models solving every problem. However, as this demonstration illustrates, sometimes a single neuron is sufficient — provided you have thoroughly analyzed your data and clearly defined your AI's requirements.
So, the next time you hear about an organization developing a massive AI system to tackle what appears to be a simple problem, remember: simplicity can be a powerful asset when it is well-suited to the task and implemented with careful attention to data protection.
Often, less really is more — both for system performance and for safeguarding individual privacy.
Joanna Rozanska, CIPP/E, CIPP/US, is associate at Hogan Lovells.