Data-driven Artificial Intelligence (AI) models are increasingly being deployed in real-world settings and becoming part of our daily routines. Every time we use web-based translation systems, shop on e-commerce portals, are admitted to hospital, and much more, an AI-system is working in the background to detect patterns of behavior and make predictions. These applications demonstrate the increasing effectiveness of AI models and point to a future in which they will become even more ubiquitous.

However, a series of high-profile and catastrophic failures suggest that some AI models are by no means the finished article. Flaws in decision-making technologies, for example, have caused autonomous vehicles to crash. More recently, British Prime Minister Boris Johnson blamed a mutant algorithm for the UK’s exam fiasco. The vast discrepancies between some predicted and computer-generated final grades show how much damage imperfect data-driven models (although not strictly speaking an AI model in this instance) can do.  

Accordingly, one important question for policymakers is how best to make sense of the recent surge in AI models. From there, how should government agencies and similar organizations decide when these models are ready for deployment, and what quality checks should be put in place? There are at least four criteria that must be evaluated before AI models can operate in the wild: accuracy, stability, robustness, and interpretability.

Taming the Beast

It goes without saying that AI models must be accurate. However, accuracy comes in many forms, with measures for cancer detection, for example, substantially different from those required for detecting fraud. In the case of the former, it is important to control false negatives, meaning that AI models should not miss any cases that are genuinely cancerous. False positives will eventually be discovered in further downstream testing even though the initial shock to a person wrongly diagnosed can be substantial. In the case of fraud detection, AI models should always err on the side of caution and minimize the number of false positives. Incorrectly classifying an innocent person as a fraudster can cause undue harassment and reduce overall trust in the fraud detection system.

The need for accuracy in turn reflects that state-of-the-art AI models tend to be black box machines with a large number of inner knobs (sometimes in the billions) which have been tuned by data. Understanding how these knobs interact and produce output can be difficult. That said, even black box models can detect and be tested for stability, i.e. small changes in input that should not result in large changes in output. For example, consider a model for credit-card scoring that uses age as an input. The model should not give dramatically different results if the age is perturbed by one year while everything else remains the same. 

Closely related to the concept of stability is the notion of robustness. AI models are considered robust if they are not disproportionately affected by the presence of outliers in the data. The canonical example is that the median is more robust than the mean. That is why government ministries will often highlight the median rather than mean income in official reports and statements. Put another way, the median income is not increased by the presence of a few “extra billionaires” inside a country.  

Designing a robust model nevertheless remains an art form often requiring years of practice to perfect. One of the major challenges facing designers is that outlier detection itself is very hard. If we cannot detect an outlier, how can we design a model which is robust against the presence of outliers? Game theory is emerging as an important tool for checking the robustness of AI models. An adversary deliberately generates bad samples to test for model weakness, and the model designer changes the model architecture to respond to the attack. The interaction goes on until both the adversary and the model designer reach some form of equilibrium. 

For black box models, interpretability (or transparency) is the hardest nut to crack. How can we make sense of a decision made by an AI model consisting of millions (if not billions) of parameters or adjustable knobs? The famous mathematician John Von Neumann is quoted as saying “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” Attempts to extract interpretable rules from complex models have seen only partial success as these rules can lead to reduction in accuracy. Still, for a given task, users must insist that AI scientists provide guidelines to interpret the output of AI models. If an AI model comes up with a surprise outcome that contradicts domain expertise, then it needs to be thoroughly validated.

Staged Deployment

Calls for an effective vaccine against COVID-19 grow louder with every passing day of the pandemic. Vaccines typically go through rigorous testing and formal staging processes before they are deployed into the wild. Phase 1 typically begins in the laboratory with animal testing, Phase 2 involves a small group of volunteers for trials and monitoring for adverse effects, with Phase 3 embarking on randomized controlled trials on a larger group. 

A similar protocol needs to be put in place for the deployment of AI models in the wild. At each stage, the metrics of accuracy, stability, robustness and interpretability must be evaluated and assessed. Staged deployment of AI models will prevent unnecessary surprises and lead to an increase in the overall trust of AI systems.


 

 

Dr. Sanjay Chawla

Dr. Sanjay Chawla is the Research Director of Qatar Center for Artificial Intelligence, QCRI, at Hamad Bin Khalifa University.

Exclusive to The Times, Kuwait



Read Today's News TODAY... on our Telegram Channel click here to join and receive all the latest updates t.me/thetimeskuwait