For this purpose, a single radiologist delineated lesions within each scan and classified each one on a scale from 1–5, depending on how sure they were that a lesion was malignant. The researchers found that a model consisting of three CNNs performed best, identifying 85% of manually contoured lesions (923 of 1087, the so-called true positive rate). At the same time, it falsely identified four lesions per patient (the false positive rate). The time to evaluate a single scan was cut from 35 minutes using manual delineation to under two minutes for the model.
carefully delineated scans. The study authors tested how well their model performed depending upon the number of patients used for training. Interestingly, they found that a model trained on 40 patients performed just as well as one trained on 72. According to Weisman, obtaining the detailed lesion delineations for training the models proved a more challenging task: “Physicians and radiologists don’t need to carefully segment the tumours, and they don’t need to label a lesion on a scale from 1 to 5 in their daily routine.
Ten different examples of ensemble convolutional neural network (CNN) output. Maximum intensity projections (MIPs) of PET images with overlaying MIPs of the 3CNN output show true-positive findings (green), false-positive findings (red), and false-negative findings (blue). A-H show patients of varying disease burden and varying performance. I-J show patients with an above average number of false-positive findings who had significant 18F fluorodeoxyglucose uptake of brown fat in the neck, shoulders, and ribs.
It is extremely difficult to classify every lymph node in a scan as cancerous or not with 100% certainty. Because of this, if two radiologists delineate lesions for the same patient, they are not likely to agree with each other completely. When a second radiologist evaluated 20 of the scans, their true positive rate was 96%, while they marked on average 3.7 malignant nodes per patient that their colleague had not. In these 20 patients, the deep-learning model had a true positive rate of 90%, at 3.7 false positives per scan – making its predictions almost as good as the variation between two observers.
So asking our physicians to sit down and make decisions like that was really awkward for them,” she explains. The initial awkwardness was quickly overcome, though, says Weisman. “Because of this, Minnie (one of our physicians) and I got really close during the time she was segmenting for us – and I could just text her and say ‘What was going on with this image/lesion?’. Having a relationship like that was super helpful.”
Future research will focus on incorporating additional, and more diverse, data. “Acquiring more data is always Often, one of the biggest hurdles in creating this type the next step for improving a model and making sure of model is that training it requires a large number of it won’t fail once it’s being used,” says Weisman. At the same time, the group is working on finding the best way for clinicians to use and interact with the model in their daily work.
Expected, and unexpected, challenges
14