AI beyond the hype: Focus on the data

As the hype around artificial intelligence continues to grow, the public narrative becomes muddier. Conversations focus on the dichotomy of AI as an existential threat or a modern-day savior. We read headlines about lifesaving applications of AI and then immediately hear stories about AI failures and fears. The confusion is only exacerbated by an alphabet soup of terms: AI, LLMs, GPTs, GANs, DL, RL, CNN, SVN, etc. Lack of clarity can lead to missed opportunities in many fields. The lack of clarity can also cause us to overlook critical flaws and limitations that may impact the performance of AI systems.

Given AI's potential, as well as its risks, it is crucial to critically analyze the performance and functionality of AI models in terms of their ability to achieve intended tasks reliably and effectively. Through technical excellence and intentionality, we can build robust AI models that perform as intended, reduce errors and minimize inadvertent consequences. This is especially true for functions that require precision and accuracy, such as applications within the health care, legal and financial fields. Mistakes made by conversation prompt generators are more tolerable than mistakes made by medical translation services used in treatment. Difference in mistake tolerance highlights the importance of distinguishing the performance of an AI model between specific and general purpose applications.

Unfortunately, the mystery and mystique surrounding AI make discussions akin to throwing rocks in a clear mountain lake and then complaining about the cloudiness from the disturbed sediment. To make sense of the discussion, let's take a step back, put our rocks down and avoid further clouding the water. As a step toward clarity, let us review how AI interacts with data. Models are dependent on the data used to train and run them. Data is an attempt to reflect real-world conditions of varying sensitivity from low to high. Data is the food for AI. AI models depend on ingesting large amounts of training data to identify patterns and correlations and make predictions. Data determines the noise, bias, variance and other factors that impact the accuracy and fairness of a model.

Syrenis ad, a privacy professional's AI checkilist

Data impacts AI systems in the same way our diet affects our health. An unbalanced diet lacking in nutrients, e.g., junk food, can inflict physical and cognitive harm, causing health issues. Similarly, bad data can sustain an AI model and create the perception of effectiveness in the short term, but eventually, it may poison the model, causing it to collapse.

AI models also have expiration dates. Eating food past the expiration date increases the risk of food-borne diseases. AI models that are past their expiration dates due to stale data decrease the accuracy of predictions and insights. An AI system starved of good data and lacking data hygiene will actively increase the risk of harm and unintended consequences. In short, there is a pretty substantial garbage in, garbage out challenge. AI hygiene requires consuming fresh data and updating the existing training data to keep models current and relevant, i.e., preventing model drift.

In addition, while food is mostly anonymous, data can potentially expose unintended private or sensitive information. Part of the challenge with data management is the need for high accuracy in certain use cases within domains like health care, law enforcement, financial services and education that rely on sensitive data. AI systems in the health care domain require health care and social determinants of health data. However, that data tends to carry information about individuals' personal health, conditions, diagnostics, treatments and, in some cases, genetic information. Think of it as eating a salad that has real social security numbers written on the lettuce leaves. That data needs to be heavily protected and guarded to ensure it is not misused or abused, especially if embedded in AI.

Many brilliant minds are researching answers to the data challenge of hygiene, use and privacy. One proposal is creating model cards. Inspired by nutritional labels, these model cards outline essential metadata components for datasets such as lineage, source, legal rights, privacy considerations, generation details, data type, and intended use and usage restrictions. Model cards help communicate the scope of a dataset, just like nutrition labels communicate ingredients and dietary profiles.

It is also essential to recognize the iterative nature of AI modeling through its stages of data processing. The first models, champion models, will often have some mix of errors in data, modeling, computing and phrasing of the question. Data scientists and AI developers expect to iterate through solutions and models to assess performance. Comparing champion models against successive models, i.e., challenger models, can assist in developing and assessing relevant metrics as an AI model is deployed and used. These iterative efforts can help organizations promote AI systems that align with societal values.

As AI hype grows, let's avoid throwing rocks in the already murky waters of the public narrative. Recognizing data's pivotal role, like diet's impact on health, and the iterative nature of AI modeling can address challenges like model expiration and garbage in, garbage out. Data governance standards enable clear communication between developers, users and regulators. Evaluating systems in context and considering performance tradeoffs are key for robust, responsible AI, especially in precision-critical domains. Amid the complexities, let's prioritize technical excellence, acknowledge limitations and steer clear of overhyping AI. With clarity, wisdom and responsibility, we can more objectively consider AI's proper role to better guide us in cultivating AI as a valuable tool.

This content is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

AI beyond the hype: Focus on the data

Related stories