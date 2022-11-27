The world is always changing

Photo by DeepMind on Unsplash

“So. What do you do?” “Oh, I’m a data scientist at ” “Oh wow! You’re going to write machines to take over our jobs, huh?”

I hear this conversation a lot at parties. And I chuckle.

It’s true. Artificial intelligence and machine learning has made considerable strides in the last few years. Advances in GPU processing power, cloud computing, and deep learning packages made this possible.

That being said, we are YEARS away from artificial intelligence taking over critical thinking jobs. Humans are still needed.

Why am I so sure? I came across this article recently.

This was published in July 30, 2021, one year into the COVID pandemic. If AI was amazing, why did all these tools fail?

Well, when did the COVID outbreak start? Sometime in March 2020. Between the outbreak and that article, scientists has a year and four months worth of data.

So why were all these tools still unfit for clinical use?

Surprisingly, it’s not about a new math algorithm or a complex statistic approach. All the problems are found in basic statistics and analysis.

AI tools failed because they were trained on unknown data.

This isn’t a fault of researchers. COVID was a complete unknown when if broke out in March 30, 2020. Pandemic experts had to rely on knowledge from prior pandemics to determine what this new disease does. In the course of world history, there have been 19 known pandemics. 13 of them started before the turn of the 20th century.

Were there internet data lakes collecting data when medieval peasants were suffering from the bubonic plague? Or when the Roman Empire collapsed through a smallpox pandemic? Archeologists struggle to find even 5 handwritten medical records from those eras.

Did we even have data to begin with? Was the data good and comparable to others? Could we normalize all medical data from different centuries into one consistent standard?

Based on how the modern world operates, I highly doubt it. If modern hospitals still can’t agree on a consistent data format for electronic health records, then why would we expect Revolutionary War doctors to use a standard data format for their parchment and quill records?

That being said, current technology did help a lot in studying COVID. A year and 4 months later into the COVID pandemic, scientists now have a better idea to detect symptoms and find other diseases worth of comparison. Medical experts were amazed at how quickly COVID evidence was accumulated and how fast models were created.

Despite hundreds of AI tools developed, only two were seen as promising. Why the major discrepancy?

People think that data analysis is not part of a data scientist’s job. They think all a data scientist should do is focus on intricate model building.

I’ve written an earlier post on why being a great data analyst is crucial to becoming a great data scientist. Post is below.

COVID data is relatively new. It comes from various different sources. 95% of the work is cleaning up the data, removing duplicates, normalizing variables, etc. The 5% is writing 4 lines of a sklearn library to train and build a model.

So why was it that these data scientists trained models on data sets with duplicates? If a dataset is split into a training and test set, the test set MUST include data the model hasn’t trained on before. Else, the model’s accuracy is determined not on how well it predicts data, but how well it memorized answers. Duplicates increase the chance of the test set having data the model has already seen in the training set.

If a child got 100% on an exam, you’d think he/she worked hard to study. Would you still feel the same way if the child stole answers for that exam and memorized them beforehand? No. So why do machine learning models get a pass?

Why wasn’t that dataset cleaned? You don’t need a Masters or Ph.D to clean data. You can do that with a couple of SQL or Python For Dummies books from the library. And all that costs you, worst case, is a couple of dollars in late fees.

That being said, dataset cleaning improved over time. You can find a clean COVID dataset on Kaggle.

So why did AI models STILL fail?

What’s a statistician’s favorite sentence?

Correlation does NOT equal causation.

There is so much truth behind that. We all love to find patterns to predict large outcomes. Whether it’s common as coughing multiple times a day indicating that you are sick. Or it’s as superstitious as your football team Philadelphia Eagles winning every time you make macaroni and cheese for lunch.

Correlations are fun in sports, but dangerous in medical research. If your correlation is wrong in sports, no one dies. If your correlation is wrong in medicine, a person’s health is affected. Be ready to fork over millions of dollars in a malpractice lawsuit.

A model can pick up many correlations through feature selection that humans never noticed. In natural language processing, it can be a common word present in every page. In image processing, it can be a common object present in every image.

That does NOT mean that the word/object is a reliable feature.

In my prior post Want To Be A Valuable Data Scientist? Then Don’t Focus On Creating Intricate Models (referenced above), I brought up an example of a text classifier I helped build that picked up a wrong feature. To predict whether a page is talking about back injuries, the classifier looked at the presence of a patient’s full name. If the patient’s full name was on that page, that page was classified as a back injury. Else, it wasn’t.

This feature makes no sense. But the model doesn’t know what the term back injury is. It just trains on the data we feed it. Turns out all medical forms for back injury we trained the model on were from the 1980s. Each page had the full name of that patient. Because one of these records was 40 pages long (and other records has 2–5 pages in length), the model assumed that the patient’s full name for that record was a feature for predicting pack injury.

This is sadly a common mistake radiologists and doctors noticed from these AI tools.

Convolutional Neural Networks (CNN) predicted COVID risk in patients who had chest X-rays of them lying down. The correlation is that upright chest X-rays indicated that patients are healthy enough to stand. However, many hospitals scanned X-rays of healthy patients who lied down. The CNN predicted these non-COVID patients as high risk of COVID.

CNNs predicted COVID risk based on text fonts from medical records from certain hospitals. These hospitals usually had more COVID patients, but CNNs thought that the unique text fonts for those hospitals was an appropriate feature. It misdiagnosed healthy patients just because they were from those hospitals.

Researchers trained their AI tools on a particular COVID chest X-ray dataset. Radiologists found these CNNs incorrect, and examined the dataset ground truth closer. Turns out all the chest X-rays that were labeled “non-COVID” were from children. The CNNs could not predict COVID in an adult accurately, as adults have different biological structures than children. All the CNNs could do were identify kids from chest X-rays.

The first thing a data scientist learns isn’t model building. Or algorithms. It’s exploratory data analysis. A data scientist should understand exactly what the data he/she is working with. Not pull something from the internet and blindly assume it’s accurate and cleaned. The world doesn’t hand out clean and processed data sets like Kaggle does.

It’s unfair to expect a data scientist to know radiology. That’s why we have these medical experts validate these datasets. If a data scientist ignores the advice that these experts spent years studying for, then their model will be inaccurate.

So, all of this ties back down to AI tools being trained on terrible datasets. This could have been fixed by the scientists cleaning the datasets, exploring the data, removing features that make no sense, and validating with experts in the field. Why was this process rushed?

The world moves fast.

There’s pressure to work round the clock and get something out. The early bird gets the worm.

The problem is that healthcare is slow. For a very good reason. Despite the excessive charges the doctor itemizes on your annual visit, healthcare’s goal is is not to optimize profits. Healthcare’s goal is to save lives. To achieve that goal, almost all distinguished experts in the community need to confirm the correctness of a certain finding.

There have been articles published in 1990s and 2000s about stem cells, gene sequencing, and other methods for curing cancer. It’s 2022. Why hasn’t cancer been cured yet?

The answer is that validating these techniques takes time. Each medical researcher needs to conduct the experiments on his/her own and confirm they saw the same findings in ALL demographics. Age, race, sex, and disabilities can affect these techniques.

Tech entrepreneurs and Silicon Valley venture capitalists argue that AI can help these researchers validate faster. Yes, AI can predict things and pick up objects faster than humans can. These data scientists even wrote algorithms to detect multiple skeletal fractures in a single radiograph, a feat that hasn’t been done before (see article below).

However, medical experts tested this tool’s results in a sample of 600 patients (mean age ± standard deviation, 57 years ± 22; 358 women). The AI aid reported the following for this sample below.

improved the sensitivity (true positive rate) of physicians by 8.7% (95% CI: 3.1, 14.2; P = .003)

improved the specificity (true negative rate) by 4.1% (95% CI: 0.5, 7.7; P < .001)

reduced the average number of false-positive fractures per patient by 41.9% (95% CI: 12.8, 61.3; P = .02) in patients without fractures

reduced the mean reading time by 15.0% (95% CI: –30.4, 3.8; P = .12).

CI refers to the confidence interval, which is a range of unknown estimates for a sample.

While the aid improved in each of the 4 metrics for skull fracture detection, the confidence intervals for each metric are huge.

This is problematic, as confidence intervals are used in studies that recruit a small sample of the overall population. We can infer that the true population effect lies between the lower limit and the upper limit of the confidence level. If the confidence interval crosses 1, then this implies there is no difference between arms of the study.

Because confidence intervals indicate level of uncertainty around the measure of effect, we want the range to be as small as possible. Each of the 4 measurements above have a confidence interval that crossed 1 (14.2, 7.7, 61.3, 3.8), even though the study reported a 95% confidence interval for each metric.

Even if the results looked promising, it wasn’t respective of the whole population due to no difference between the groups of the study. It’s unclear if the tool can correctly detect all skeletal fractures.

Furthermore, medical experts lament the lack of clinical data in these AI tool’s predictions. Without knowledge about the findings from the patients’ physical examination or their medical history, these tools have external biases that influence their predictions. Whether it’s a certain font hospitals use or a certain placement the X-Ray was positioned in. Or using a patient’s name to identify a back injury.

There are multiple ways to get the correct answer. The way many AI Tools learnt is wrong. This could be resolved if data scientists spent more time understanding and validating their datasets instead of quickly trying to push their inaccurate model first.

Even if they clean their data and improve their model features, will data scientists still create the perfect model? No. Before 2020, no one knew what COVID was. It took a year and four months to get a general idea of how this disease was different from others, and how many variants it has. Medical researchers are discovering new things they haven’t learnt about current and new diseases.

Medicine isn’t static. It’s continuously evolving. New researchers contribute something new to the medical community. The models will have to be updated with new data to keep up. The models need to continuously learn to predict accurately.

Humans will still have to find new data for the models to work. If models need clean data to be effective, humans will need to find ways to create that data. It’s why some machine learning models have human review for some data points that need revision. Whether it’s an unreadable medical form or an empty field, human input is required to make the model perform accurately.

As long as the world is changing, people will still have jobs.