Issuu

TwoStream and TwoPlusOne: Two Convolutional Neural Networks for Cardiac Imaging Diagnosis

Ryan Liu, Yu Yang

Institute for Infocomm Research at the Agency for Science, Technology, and Research(A*STAR)

Abstract. Currently when attempting to diagnose heart diseases, there are two types of primary images: late gadolinium enhancements (LGE) and cine heart images. These two types of images make up the majority of images used by human doctors to diagnose a patient with heart disease. The key difference between the images lies between the contrast applied. LGE images have contrast applied while cine images do not. The contrast leads to highly accurate images that doctors can use to diagnose diseases more easily; however, certain patients cannot take contrast due to allergies or critical conditions. In these cases, cine images must be used. In order to increase accuracy with cine images, many of them, around 30, are taken within a short span of time that captures the beating cycle of a heart. Despite the temporal information gained, the clarity of the LGE images still makes for easier diagnoses.

Similar to human doctors, artificial intelligence also has a more difficult time diagnosing disease on cine images compared with LGE images. This paper will introduce two novel artificial intelligence structures. One: a novel artificial intelligence structure to boost the accuracy of artificial intelligence methods for patients who are unable to take contrast. The other: a structure for patients for whom LGE imaging is an option.

Both structures are based off of the commonly used “UNet” structure, an artificial intelligence method that is especially common in medical diagnoses due to its ability to produce highresolution segmentations, or annotated images, with diseased spots highlighted.

The first structure will modify the U-Net by adding an additional one dimensional convolution layer as the bridge between the final downsampling and the beginning of the upsampling routes. The second structure will modify the U-Net by adding an additional encoder so that the model has the ability to take in both a cine image composed of 30 frames and the LGE image and use them both to make a more accurate segmentation. The first encoder for the LGE image is typical; however the second encoder is using the previous structure’s method of running multiple frames through the encoder, then compressing it with a one dimensional layer.

Ultimately, the results for the TwoPlusOne model did not show an increase in performance. However, we believe that this is reflective of the lack of RAM available rather than the method itself. With only 5 frames of the 30, we were able to achieve similar performance to the default U-Net that utilized all 30 frames.

On the other hand, the TwoStream model proved extremely effective, achieving a performance increase of about ~4% higher compared to the LGE U-Net. This proves our hypothesis that by utilizing the advantages of both the cine sequences and the LGE imaging, we can significantly improve model performance.

1. Introduction

The field of medical artificial intelligence holds vast potential for revolutionizing the healthcare space. If developed far enough, artificial intelligence could drastically improve many of the factors that go into ensuring excellent patient care. These methods include improving diagnostic accuracy, reducing downtime between patient condition and treatment, and even interacting with the patients to maximize comfort. However, artificial intelligence in its current state is not accurate enough to effectively address healthcare’s problems.

Out of all of the fields, medical artificial intelligence has one of the highest responsibilities to be accurate in its predictions. A wrong prediction could easily lead to adverse effects for the patient and a faulty model at scale could produce thousands of incorrect predictions a day. Thus, it is of the utmost importance to improve medical artificial intelligence both for its benefits and for its negative effects if used improperly.

Every year 18 million people die from cardiovascular disease, representing one of the highest causes of death, especially in America. As such, cardiovascular disease puts heavy pressure on healthcare systems to more accurately and efficiently diagnose heart issues. Diagnosing heart disease early is especially important, as progression of disease often leads to severe complications and, not uncommonly, death. However, current methods to diagnose rely on subjective human input, which can on occasion lead to errors. The recent developments in artificial intelligence(AI) have the potential to solve these issues by improving accuracy of diagnoses, leading to better patient outcomes.

Currently when attempting to diagnose heart diseases, there are two types of primary images utilized: late gadolinium enhancements(LGE) and cine heart images. LGE images have contrast applied while cine images do not. The contrast leads to

highly accurate images that doctors can use to easily diagnose disease; however, certain patients cannot take contrast due to allergies, renal impairment, or other complications. In these cases, cine images must be used. At the start of any diagnosis regardless of whether a patient can handle contrast, a cine image will be taken. Afterwards, if the patient is able to use contrast, an additional LGE image will also be used.

Artificial intelligence is a promising solution to address the accuracy disparity between LGE and cine images. Through the virtues of image segmentation and feature extraction, AI can pick up on certain patterns that might be too subtle for a human observer to detect. However, current AI models have limitations as well. Most prominently is achieving the level of precision that an expert cardiologist can achieve. Even minor mistakes can lead to vast consequences such as dangerous unnecessary treatments, wasted time, as well as failure to detect the real underlying issue. As such, research into highly accurate artificial intelligence models is crucial to ensure AI diagnoses are as safe and effective as possible.

One of the most prevalent artificial intelligence structures in medical diagnosis is the U-Net, which is a type of convolutional neural network. Its widespread use is primarily due to its unique architecture, which highlights an encoder-decoder structure that ensures accurate segmentation of biomedical images. An encoder consists of multiple convolutional layers to extract features, or important aspects, from the input image. In a U-Net, each subsequent layer applied to the input image reduces the spatial dimensions of the original image through convolution and max pooling. Each layer attempts to capture features that represent different biological structures. The decoder path then restores the image’s spatial dimensions through a process called upscaling while preserving the encoder’s captured features. Additionally, the model utilizes skip connections that connect the encoder and

decoder. Each skip connection links a convolution layer’s output and a decoder step, allowing for high-resolution segmentations that are necessary for accurate diagnosis, as certain diseases can be no more than a few pixels wide. Variations on the U-Net structures have been proven to be extremely effective on a wide range of biomedical modalities including brain MRI, lung CTs, and radiographs.

Despite its advantages, the U-Net is not adequately equipped to handle dynamic data, like cine sequences. One of the most important features of capturing accurate cine data is recording 30 different frames that together make up a full heart beat cycle. In its current state, the U-Net performs extremely poorly on cine images.

To address these concerns, this paper presents two unique alterations to the U-Net architecture. The first structure will modify the U-Net by adding an additional one dimensional convolution layer between the final downsampling route and the start of the upsampling route. This one dimensional(1D) layer aims to process the temporal information contained within the cine images by compressing along the temporal axis. This 2D+1D U-Net is designed to be used solely on cine images for patients who cannot handle the contrast necessary for LGE imaging.

The second structure introduces a two stream architecture that takes both the advantages of LGE’s higher image quality and cine’s temporal information similarly to how a human doctor would diagnose issues. Two encoders would be utilized, one for the LGE modality and one for the cine modality. The outputs of both encoders are then fused at the end of the downsampling route to combine the features from both image types. Due to the dualstream nature of the modifications, we propose it be called the Two Stream U-Net.

For patients who cannot receive contrast, cine is the only imaging modality available for doctors. However, even for

experienced doctors diagnosing with only cine images is a challenge. By improving the accuracy of AI models on cine sequences, the 2D+1D U-Net aims to reduce this gap. For patients who can receive contrast, the Two Stream U-Net allows AI models to leverage both the cine and LGE imaging.

Ultimately, this paper aims to refine artificial intelligence’s cardiac diagnostic capabilities by proposing two new modifications to the original U-Net structure. By augmenting the model’s ability to process the temporal information of cine sequencing, the 2D+1D model has the potential to improve patient care for those unable to receive contrast. On the other hand, by incorporating information from cine sequences and LGE imaging, the Two Stream model builds upon AI’s existing medical capabilities. Through additional research into these models, AI has the potential to transform cardiac diagnostics.

2. Methods

2a. U-Net Overview

AI has made vast leaps in medical imaging, but its application to diagnosing heart diseases remains complex. One of the most relevant AI architectures in biomedical applications is the U-Net. While it was originally developed for cell segmentation, the U-Net has proven to be widely applicable over a wide range of biomedical tasks due to its ability to produce high-resolution outputs with a smaller dataset.

Its structure contains a symmetrical “U” shape, with the halves of the U representing the encoder and decoder. The encoder is responsible for downsampling the input image while also “extracting features”. The encoder is made up of multiple “blocks”. Each block consists of three types of layers: Convolutional Layers (Conv2D), Batch Normalization Layer(BatchNorm2D), and Rectified Linear Unit Activation layers

(ReLU). A Convolutional Layer scans the image with small filters called kernels that attempt to detect patterns and features like edges, textures, and shapes. A convolutional layer is a fundamental building block of a convolutional neural network (CNN), commonly used for image and signal processing. It applies a set of learnable filters (also called kernels) that slide across the input (e.g., an image) to extract spatial features like edges, textures, or shapes. Each filter is a small matrix (e.g., 3x3 or 5x5) that multiplies with the overlapping region of the input, and the resulting values are summed to produce a single value in the output feature map. The filter moves across the input with a certain stride (how many steps it shifts at a time), and optionally uses padding to control the spatial size of the output. Different filters learn to detect different features in the data. Multiple filters create multiple output channels, resulting in a 3D output (height × width × number of filters). Because filters are shared across the entire input, convolutional layers dramatically reduce the number of parameters compared to fully connected layers, making them efficient for high-dimensional inputs like images. These layers also preserve spatial relationships, which is essential for tasks like image recognition or segmentation.

Batch Normalization normalizes the output of the previous convolutional layer. Batch normalization is a technique used to improve training speed and stability in deep neural networks. It normalizes the input of each layer so that the mean is close to 0 and the standard deviation is close to 1, based on a batch of data. This helps reduce internal covariate shift the change in distribution of layer inputs during training which can slow down learning. Here's how it works: for each mini-batch, batch norm computes the mean and variance of the inputs. It then normalizes the inputs using these statistics and rescales them with learned parameters (gamma and beta) to maintain the network's capacity to model complex functions. This allows the model to decide whether

to preserve the normalized distribution or allow deviation. Applied after convolution or fully connected layers (and usually before activation functions), batch normalization can also serve as a form of regularization, sometimes reducing the need for dropout. It helps the network train faster by allowing the use of higher learning rates and reducing sensitivity to initialization. During inference, fixed statistics (mean and variance from training) are used instead of batch statistics, ensuring consistency across different input samples.

Finally, the ReLU Activation function allows the model to learn more complex patterns. ReLU, or Rectified Linear Unit, is a simple yet powerful activation function used in neural networks, especially in CNNs. It is defined as: ReLU(x) = max(0, x). This means any negative input is set to zero, while positive input values are left unchanged. ReLU introduces non-linearity to the network, allowing it to learn complex functions beyond just linear mappings. Unlike older activation functions like sigmoid or tanh, ReLU doesn’t saturate for large inputs and avoids vanishing gradients in positive ranges. This makes it computationally efficient and accelerates convergence during training. ReLU also has sparsity benefits. Since it outputs zero for any negative input, many neurons remain inactive at any given time, which can improve generalization and reduce model complexity. However, ReLU is not without drawbacks. One issue is the “dying ReLU” problem, where some neurons can become stuck with zero output and stop learning if their weights cause them to produce only negative values. Variants like Leaky ReLU or ELU aim to address this. Despite its simplicity, ReLU remains one of the most commonly used activation functions in deep learning because of its effectiveness and ease of implementation. It’s typically applied right after convolutional or fully connected layers. These three layers come together to form one block.

2b. Encoders

Each encoder uses multiple blocks that increase the number of channels while reducing spatial dimensions. For example, an input image of 224x224x3 (height, width, color channels) will be reduced to 112x112x12 after the first downsampling layer.

While spatial resolution (height and width) decreases after each layer, information is redistributed to the feature channels (at first the color channels). This redistribution allows the model to disregard noise while capturing the most important aspects of the image. This method has proven to be effective in capturing small differences between a normal image and a diseased one.

Each convolutional layer decreases the spatial resolution while each batch normalization layer keeps the dimensions of the image but applies a normalization function, changing the values of the pixels to be between a standardized range. A typical encoder block could look like this: Conv2D, Batch Normalization, Conv2D, Batch Normalization, Conv2D, Batch Normalization, ReLU Activation.

Finally, the ReLU Activation layer introduces nonlinearity to the model, further allowing it to learn more complex patterns and reducing the vanishing gradients problem. In total, multiple encoder blocks will be used in combination to transform the original image. However after the encoder is finished processing the image, its spatial resolution is extremely limited, oftentimes around 4x4xN feature channels. Thus, the encoder’s output must be reshaped and transformed once again to produce an image that is understandable to humans.

2c. Decoders

The decoder’s primary focus is to restore the encoder’s reduced image to the image’s original dimensions, most commonly 224x224x3. In the decoder’s process though, it is extremely important to conserve the features extracted from the encoder while also producing a final output image that closely resembles the original image but with the important features highlighted. This is where the U-Net’s unique skip connections shine. The decoder has an equal amount of blocks as the encoder, and so each encoder’s output is sent to the decoder in addition to the previous decoder step. By doing this, the model is able to combine fine details(the encoder’s outputs) with more global features. This is because after each block the encoder focuses on finer and finer details, so by fusing the encoder’s outputs with the decoder’s outputs, the model is able to use both the fine details without missing out on image-wide structures.

2d. Two Stream Model

Within our two model modifications we aim to improve model performance both for patients who can receive contrast and those who cannot. For patients who can receive contrast, the main artificial intelligence modality used is solely LGE imaging while cine sequences are not utilized at all. However, by also leveraging the temporal information from the cine sequences, we hypothesize that it is possible to achieve higher performance than the AI trained on LGE alone. Thus, we present a two-stream U-Net structure that is equipped with two encoders. One encoder will process the LGE stream while the other will process the cine MRI sequences. This type of dual-encoder design allows the model to extract complementary features from each imaging type. Using both modalities mimics how human experts would diagnosis cardiac disease: by looking at the LGE images for spatial features such as discoloration while looking at the cine

sequences for temporal features such as a smaller maximum expansion than usual or irregular contractions.

After the encoders, at the bottleneck layer, the outputs of both encoders are fused together into a shared space by adding the values at each pixel point. Multiple forms of fusion were tested, including subtraction and fusing the features together channelwise; however, fusion through addition displayed the best performance. Theoretically, addition pixel wide should conserve information from both of the inputs. The shared decoder then works the same as the single encoder U-Net while utilizing the skip connection inputs solely from the LGE encoder due to the larger number of files of the cine encoder.

There exists multiple advantages to this type of two-stream approach. First is its ability to integrate multimodal information, mimicking radiologists. The second though is its robustness in patients where cine or LGE alone does not provide enough information to accurately diagnosis. By training the model to analyze both types of images, the system draws from the individual aspects of each and leaves the model less prone to errors. This significantly enhances diagnostic accuracy as analyzed in the discussion and results section.

2e. TwoPlusOne Model

However LGE imaging is not always available to doctors. Thus, improving model performance on cine-only models is crucial for ensuring the best possible care for all patients.

Traditional 2D CNNs like the U-Net are optimized for processing 2D static images instead of the dynamic frames of cine imaging and are unable to accurately maximize the information gained from the changes between the images.

To address this, we introduce the 2D + 1D convolutional U-Net, which provides the U-Net with a method of analyzing the differences between each frame. This is accomplished by adding an additional one dimensional convolutional layer after each encoder block and running the encoder across each input frame in succession instead of simultaneously. Theoretically, this addition will allows the model to analyze the temporal information more effectively.

The model structure is as follows. An input of a single cine sequence, composed of 30 individual frames, is submitted to the model for processing. The model contains one encoder and one decoder. The first encoder block processes one frame at a time and produces the corresponding 30 feature maps. Afterwards, the one dimensional processes and compresses the 30 feature maps into one and saved so its dimensions are suitable for the skip connections to the decoder. The 30 feature maps, uncompressed, are then sent to the next encoder block and so forth. The main advantage of this structure is the potential to analyze temporal information more effectively than previously possible with even less data than before.

3. Results

Below is a table of each method’s average dice coefficient, equivalent to accuracy for segmentation models. The Label 1 coefficient denotes the accuracy of the model in diagnosing the first of three heart diseases (MVO). The label 2 coefficient corresponds to the second disease, and the third depicts the accuracy for the third disease.

Each result was run with a ResNet-50 encoder backbone, an Adam optimizer, was run for 50 epochs with a learning rate of 1e-3. ResNet-50 is a type of CNN that is used within U-Nets in the encoder step to increase accuracy.

Notably, the two stream approach achieved a higher average dice coefficient than both the LGE baseline and the cine baseline. Improvements were seen most heavily in the label 3 coefficient, with a full 6% higher accuracy increase. In the label 2 coefficient the two stream showed a 3% accuracy increase and for the label 1 coefficient the LGE baseline still reigned supreme.

While the LGE baseline performed admirably, the results of the two stream approach showed that cine sequencing data can be used alongside LGE imaging to increase overall accuracy. These results confirm our hypothesis that the temporal information of the cine MRI imaging is helpful when attempting to diagnose heart diseases. For patients who are able to receive contrast, these results are promising, offering higher accuracy and robustness than before.

4. Challenges and Considerations

Despite the performance increases of the two-stream model, it did consume more memory than the LGE baseline due to its increased inputs. While not unexpected, it still is a factor that should be considered. Additionally, its label 1 diagnosis was slightly worse than the LGE baseline.

More importantly though were the challenges we encountered with the 2D+1D model regarding VRAM. Due to the encoder being forced to run up to 30 times, one for each frame, the memory required was astronomical. Even with 40GB of VRAM on an A4500, the 2D+1D model could not run 30 frames. To solve this issue, we devised a solution of selectively choosing a certain number of frames within the cine sequences to reduce memory. Our first approach was to randomly select n number of frames, with n being anywhere from 5 to 15(not 30 because of memory). However, this approach did not prove successful. Thus, we pivoted to another method: choosing specific frames at specific moments. We chose to take every nth frame, where n was equal to 30(the max number of frames) divided by the number of frames we wanted. Thus, if we chose 5 frames, we would select frames 0,6,12,18,24. This reduced the memory by the number of frames we chose/30 linearly. At its most effective, this method was able to reduce memory by as much as 6x with only a slight decrease in performance.

5. Conclusion

Artificial intelligence holds vast potential in the field of medical imaging. With additional research, AI has the ability to improve the accuracy of diagnostics, decrease the time for a patient to be diagnosed, and reduce doctor burnout. However due to the suspicions surrounding AI, it must be extremely reliable and accurate in order to have widespread use. AI must be close to perfect to reduce misdiagnosis as much as possible as incorrect predictions can lead to deadly consequences.

Looking forward, the TwoStream model looks extremely promising; however further research must be completed in order to verify our findings. One such research objective should be fine tuning hyperparameters for both the LGE and Two Stream models. Additionally, running both models with data augmentation would make the results more accurate.

For the 2D+1D model, extensive research needs to be conducted. The cine baseline performed about 2% better than the 2D+1D model. However, the 2D+1D model was running on only 5 frames instead of 30 frames. Additionally because of memory constraints the size of the one dimensional layer had to be reduced.

The results of this study show that the modification of UNets is extremely effective, especially in cases where patients are able to receive contrast with the performance of the Two Stream model It consistently performed as well as or better than the LGE baseline, indicating that a multimodal approach could lead to significant advantages when compared to a single modality method. By improving AI’s ability to work on cardiac imaging, this research paves the way for more accurate and equitable diagnostics and brings AI closer to fulfilling its transformative potential in healthcare.

References

César Ríos‐Navarro, et al. “Microvascular Obstruction in STSegment Elevation Myocardial Infarction: Looking back to Move Forward. Focus on CMR.” Journal of Clinical Medicine, vol. 8, no. 11, 28 Oct. 2019, pp. 1805–1805, https://doi.org/10.3390/jcm8111805. Accessed 19 Oct. 2023.

Chen, Ju-Chin, et al. “Driver Behavior Analysis via Two-Stream Deep Convolutional Neural Network.” Applied Sciences, vol. 10, no. 6, 11 Mar. 2020, p. 1908, https://doi.org/10.3390/app10061908.

Feger, Joachim. “Cine Imaging (MRI) | Radiology Reference Article | Radiopaedia.org.” Radiopaedia, Aug. 8AD, radiopaedia.org/articles/cine-imaging-mri?lang=us.

. “Late Gadolinium Enhancement | Radiology Reference Article | Radiopaedia.org.” Radiopaedia, radiopaedia.org/articles/late-gadolinium-enhancement2?lang=us.

Huang, Yukun, et al. “Efficient Parallel Inflated 3D Convolution Architecture for Action Recognition.” IEEE Access, vol. 8, 2020, pp. 45753–45765, https://doi.org/10.1109/access.2020.2978223. Accessed 5 Oct. 2022.

Meier, Claudia, et al. “Myocardial Late Gadolinium Enhancement (LGE) in Cardiac Magnetic Resonance Imaging (CMR) an Important Risk Marker for Cardiac Disease.” Journal of Cardiovascular Development and Disease, vol. 11, no. 2, 26 Jan. 2024, pp. 40–40, www.ncbi.nlm.nih.gov/pmc/articles/PMC10888577/, https://doi.org/10.3390/jcdd11020040.

Nagy, Hassan, and Karthika Durga Veerapaneni. “Myopathy.” PubMed, StatPearls Publishing, 2021, www.ncbi.nlm.nih.gov/books/NBK562290/.

Rawson, James V., and Allen L. Pelletier. “When to Order Contrast-Enhanced CT.” American Family Physician, vol. 88, no. 5, 1 Sept. 2013, pp. 312–316, www.aafp.org/pubs/afp/issues/2013/0901/p312.html.

Ronneberger, Olaf, et al. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” ArXiv.org, 18 May 2015, arxiv.org/abs/1505.04597.

Simonyan, Karen, and Andrew Zisserman. “Two-Stream Convolutional Networks for Action Recognition in Videos.” ArXiv.org, 2014, arxiv.org/abs/1406.2199.

Xiong, Qianqian, et al. “Transferable Two-Stream Convolutional Neural Network for Human Action Recognition.” Journal of Manufacturing Systems, Apr. 2020, https://doi.org/10.1016/j.jmsy.2020.04.007.