Complementary Time-Frequency Domain Networks for Dynamic Parallel MR Image Reconstruction

Purpose To introduce a novel deep learning based approach for fast and high-quality dynamic multi-coil MR reconstruction by learning a complementary time-frequency domain network that exploits spatio-temporal correlations simultaneously from complementary domains

Purpose To introduce a novel deep learning based approach for fast and high-quality dynamic multi-coil MR reconstruction by learning a complementary time-frequency domain network that exploits spatio-temporal correlations simultaneously from complementary domains.
Theory and Methods Dynamic parallel MR image reconstruction is formulated as a multi-variable minimisation problem, where the data is regularised in combined temporal Fourier and spatial (x -f ) domain as well as in spatio-temporal image (x -t ) domain.An iterative algorithm based on variable splitting technique is derived, which alternates among signal de-aliasing steps in x -f and x -t spaces, a closed-form point-wise data consistency step and a weighted coupling step.The iterative model is embedded into a deep recurrent neural network which learns to recover the image via exploiting spatio-temporal redundancies in complementary domains.

Results
Experiments were performed on two datasets of highly undersampled multi-coil short-axis cardiac cine MRI scans.Results demonstrate that our proposed method outperforms the current state-of-the-art approaches both quantitatively and qualitatively.The proposed model can also

| INTRODUCTION
Magnetic Resonance Imaging (MRI) is a widely used diagnostic modality which generates images with high spatial and temporal resolution as well as excellent soft tissue contrast.Dynamic MRI is often used to monitor dynamic processes of anatomy such as cardiac motion by acquiring a series of images at a high frame rate.However, the physics of the image acquisition process as well as physiological constraints limit the speed of MRI acquisition, and long scan time also makes it difficult to acquire images of moving structures.Thus, acceleration of the MRI acquisition is crucial to enable these clinical applications.
Parallel imaging (PI) techniques [1][2][3] have been widely used to accelerate MR imaging.They speed up the scan time by sampling only a limited number of phase-encoding steps, and then exploiting the correlations to restore the missing information in the reconstruction phase.Compressed sensing (CS) techniques combined with PI have shown great potential in improving the image reconstruction quality and acquisition speed [4][5][6][7][8].CS-based methods exploit signal sparsity in some specific transform domain, and recover the original image from undersampled k-space data using nonlinear reconstructions.One effective mean to exploit spatio-temporal redundancies for signal recovery in dynamic MRI is to enforce the sparsity in combined spatial and temporal Fourier (x -f ) domain against consistency with the acquired undersampled k-space data, and this can be represented by methods such as k -t FOCUSS [4,5] and k -t SPARSE-SENSE [6].The combinations of CS with low-rank in matrix completion schemes and spatio-temporal partial separability [7,9,10] have also been proposed to exploit correlations between the temporal profiles of the voxels, e.g.k -t SLR [7].Some more recent approaches [11,12] also utilised patch-based regularisation frameworks to exploit geometric similarities in the spatio-temporal domain.However, these CS-based approaches often require careful selection of problem-specific regularisation schemes and the tuning of hyper-parameters is often non-trivial.
Furthermore, the reconstruction speed of these methods is often slow due to the iterative nature of the optimisation used, and in the context of dynamic imaging, the additional time domain further increases the computational demand.
In contrast, deep learning (DL) based reconstruction approaches have become extremely popular in recent years and have enabled progress beyond the limitations of traditional CS techniques [13][14][15][16][17][18].In DL methods, prior information and regularisation can be implicitly learnt from the acquired data without having to manually specify them beforehand.Additionally, image quality and reconstruction speed are improved substantially.These advances include applications in both PI [19][20][21][22][23][24][25][26] and dynamic MRI [27][28][29][30][31][32][33].Most current approaches in DL for accelerated PI are based on exploiting information in a single image either in image domain [19,20,24] or in k-space domain [34][35][36], where each image (or frame) is reconstructed independently.Examples of these include the variational network (VN) [19] and robust artificial-neural-networks for k-space interpolation (RAKI) etc.In accelerated dynamic MRI, one of the key ingredients is to exploit the temporal redundancies.To this end, 3D convolutional networks (Cascade CNN) [27] and bidirectional convolutional recurrent neural networks (CRNN) [29] have been proposed to exploit the temporal dependencies of dynamic sequences in spatio-temporal image domain.Most of these DL-based approaches so far have focused on either 2D static PI or single-coil dynamic MRI, whereas only a few methods exist for dynamic parallel MRI reconstruction [32,33,37].Thus more efficient and effective DL models for dynamic parallel MRI are highly desirable.
In this work, inspired by CS-based k -t methods, we formulate the dynamic parallel MR image reconstruction as a multi-variable minimisation problem considering regularisation in both spatio-temporal and temporal frequency domains.We propose a novel end-to-end trainable deep recurrent neural network to model the iterative process resulting from the multi-variable minimisation.Specifically, the proposed DL approach alternates among four steps: (1) a signal de-aliasing step in combined spatial and temporal frequency domain (x -f ) via an xf -CRNN; (2) a complementary de-aliasing step in spatio-temporal image domain (x -t ) with an x t -CRNN; (3) a closed-form point-wise data consistency (DC) step and (4) a closed-form weighted coupling step which are embedded as layers in the deep neural network (DNN).Each of these steps correspond to the iterative algorithm derived from a variable splitting technique (Section 2).As the proposed model exploits spatio-temporal redundancies from Complementary Time-Frequency domains for the effective image reconstruction, we term our model as CTFNet.
The main contributions of our work can be summarised as follows: Firstly, we propose a new regularisation method built on recurrent neural networks for data regularisation in complementary spatio-temporal and temporal frequency domains to fully exploit data redundancies.Though previous studies [15,38,39] have shown that MR reconstruction can be performed in both k-space and image domains, it is unclear how cross-domain knowledge can be effectively utilised by DNNs in the dynamic setting, with an extra temporal dimension.To the best of our knowledge, this is the first work that investigates how complementary domain knowledge can be exploited in learningbased dynamic reconstruction.Secondly, we propose a closed-form DC layer that does not require a complex matrix inversion, and operates together with a weighted coupling layer for multi-coil images.Compared to other works [19,20,40], our approach offers an exact update for DC, avoiding the expensive need of solving a linear system via gradient updates.This enables our approach to be computationally more efficient and simpler for implementation.
Finally, we demonstrate that our approach is able to further push the undersampling rates with improved image quality against state-of-the-art CS and DL methods, as well as with a good generalisation ability to unseen data.This indicates a great potential in achieving fast single-breath-hold 2D cardiac cine imaging.This work extends our preliminary conference work on single-coil dynamic MRI reconstruction [41] and 2D static parallel MRI reconstruction [42], where we explored dynamic MRI and static PI separately.In comparison to our previous work, this work presents a novel and unified end-to-end DL solution with a new formulation for dynamic parallel MRI reconstruction, which addresses a more common scenario in the use-case for clinical practice.It proposes a new way of exploiting complementary time-frequency domain information in DL.Significantly more thorough quantitative and qualitative evaluations of the proposed method including comparison, generalisation and ablation studies have been performed on multi-coil cardiac MR data with retrospective undersampling.

| Dynamic parallel MRI model
Assume that m ∈ ¼ N is a complex-valued MR image sequence in x -y -t space represented as a vector, and let v i ∈ ¼ M (M N ) denote the undersampled k-space data (in k x -k y -t space) measured from the i th MR receiver coil.The data acquired from each coil thus can be represented as where F s is the spatial Fourier transform matrix, D is the sampling matrix on a Cartesian grid that zeros out entries that are not acquired, and S i is the i th coil sensitivity map.The reconstruction of m from v i is an ill-posed inverse problem, where i ∈ {1, 2, ..., n c } and n c denotes the number of receiver coils.Similar to CS formulations [6,43,44] based on the SENSE model, we formulate dynamic parallel MRI reconstruction as the following optimisation problem: Here, R xt is defined as a regularisation term on the spatio-temporal domain (x -y -t space, also denoted as x -t ) of the image sequence m, similar to the spatio-temporal total variation in most CS-based approaches.To fully exploit the spatio-temporal correlations, we additionally add a regularisation term R xf to regularise the data in the combined spatial and temporal frequency domain (x -f space), in which F t denotes the temporal Fourier transform.This leverages the characteristic that the signal can be sparsely represented in the temporal Fourier domain, because of the periodic cardiac motion exhibited in dynamic imaging.Previous works [15,41,43,45] have shown that data regularisation in different domains is beneficial due to the complementary information they represent, and thus here we propose to combine the regularisation terms from the complementary time and frequency domains with µ to balance between R xf and R xt .The last term in Eq. 2 enforces the data fidelity in PI, and here we formulate it as a coil-wise DC term, which aims to avoid the need to solve a linear problem inside subsequent sub-problem and also makes it simple to be embedded in an end-to-end DL framework (see following Optimisation).The model weight λ balances between regularisation and data fidelity.
Optimisation: To optimise Eq. 2, we propose to employ the variable splitting technique [42,46] to decouple the data fidelity term and regularisation terms.Specifically, auxiliary splitting variables are introduced here, converting Eq. 2 into the following equivalent form: In detail, the introduction of the first constraint m = u decouples m in the regularisation term R xt from that in the data fidelity term, and the second constraint F t m = ρ enables the decoupling of R xf from the other terms.The introduction of the third constraint S i m = σ i is also crucial as it allows decomposition of S i m from DF s S i m in the data fidelity term, which avoids the difficult dense matrix inversion in subsequent calculations (see Eq. 6).Using the penalty function method, Eq. 3 can be reformulated to minimise the following single cost function: where α, β and γ are penalty weights.To minimise Eq. 4 which is a multi-variable optimisation problem, alternating minimisation over m, u, ρ and σ i is performed, resulting in iteratively solving the following sub-problems: Here, k ∈ {0, 1, 2, ..., n i t − 1} denotes the k th iteration and m 0 is the zero-filled reconstruction as an initialisation.
An optimal solution (m * ) can be found by iterating over ρ k +1 ,u k +1 , σ k +1 i and m k +1 until convergence or reaching the maximum number of iterations n i t .
Specifically, Eq. 5a and Eq.5b are the proximal operators of the combined temporal Fourier and spatial domain prior R xf and the spatio-temporal image domain prior R xt respectively.Eq. 5c is a coil-wise data consistency step in PI (pDC), which imposes the consistency between the acquired k-space measurements and the reconstructed data.A closed-form solution for Eq.5c can be derived as: in which F H s is the conjugate transpose of F s and I is the identity matrix.Similarly, by optimising Eq. 5d, we obtain the following solution: where S H i is the conjugate transpose of S i .This can be regarded as a weighted coupling (wCP) of the results obtained from Eq. 5a, Eq. 5b and Eq.5c.In particular, it can be seen that both Eq. 6 and Eq. 7 are closed-form solutions and can be computed in a point-wise manner due to the inversion of diagonal matrices.This avoids iterative gradient updates and thus enables fast reconstruction speed in comparison to conjugate gradient-based approaches [20,32,46].

| CTFNet for dynamic parallel MRI reconstruction
Based on the model formulation in Eq. 5, we propose to embed the iterative reconstruction process into a DL framework to further improve the reconstruction quality with faster reconstruction speed and higher acceleration rates.Specifically, we propose a complementary time-frequency domain network (CTFNet) for the dynamic parallel MRI reconstruction to exploit the spatio-temporal correlations in complementary spatio-temporal and temporal frequency domains.Our model consists of four core components: (1) an xf -CRNN to implicitly learn the regularisation from the training data itself and perform the iterative de-aliasing in x -f domain, corresponding to Eq. 5a; similarly as the learning-based proximal operator in the spatio-temporal image domain, corresponding to Eq. 5b; (3) a pDC layer that performs coil-wise DC in PI (Eq.5c); and (4) a wCP layer that is naturally derived from Eq. 5d and performs the weighted coupling.An illustrative diagram of the proposed model is shown in Fig. 1.Note that the iterative reconstruction process as stated in Eq. 5 is modelled via the convolutional recurrent neural networks (CRNN) with recurrence over iterations.Details of each component of our network is explained hereafter.

| xf -CRNN
Corresponding to Eq. 5a, we first propose to exploit the spatio-temporal correlations in the combined temporal Fourier and spatial domain.Instead of explicitly imposing the regularisation term on the data such as in conventional CS-based methods, here we propose to implicitly learn the regularisation from the training data itself by leveraging DNNs in the x -f domain.Specifically, motivated by some of the CS-based k -t methods such as k -t FOCUSS [4,5], where its solution to the underdetermined inverse problem can be expressed as the form that consists of a baseline signal ρ together with its residual encoding (ρ k − ρ) for the k + 1-th estimate of the x -f signal ρ k +1 , we propose to formulate our x -f domain reconstruction as Particularly, in our formulation of Eq. 8, different from model-based [47] or compressed sensing [4,6] algorithms, we employ a stack of convolutional layers to estimate the missing data based on other available points, typically within its vicinity in x -f space.To fully exploit the spatio-temporal redundancies, we use the temporal average of a sequence as the x -f baseline signal ρ, and thus xf -CRNN learns to reconstruct residuals of the temporal frequencies with respect to the temporal average (direct current) values.This makes the residual energy much sparser and enables the network to focus more on the dynamic patterns of the signals with less efforts in reconstructing static background regions.In contrast to k -t FOCUSS implementation where sparsity was exploited for each coil separately, the proposed approach exploits the joint information in the multi-signal ensemble that represents the combination from all coils.This has been shown to be effective in reducing the number of required samples per coil and providing increased acceleration capability [6].Furthermore, different from our previous work in [41], we propose to model the iterative reconstruction process in x -f domain with the recurrent neural network (CRNN-i [29]) where recurrence is evolving over iterations via hidden-to-hidden connections and the trainable network parameters are shared across sequential iteration steps.
The illustrative diagram of x -f reconstruction is shown in Fig. 2. Specifically, we formulate the k -t to x -f transformation process in PI as an x -f transform layer in the network.The x -f transform layer receives input from multi-coil k -t space data, and then transform it to x -f space as inputs to xf -CRNN.Details of the process are illustrated and explained in Fig. 2. Note that the value range of the direct current component of the undersampled data in x-f space is lower than that of the temporal average, therefore after subtraction, the direct current component still remains but with a different value range from the temporal average.This also means that the subtracted data in image space can look similar to the temporal average but with a lower intensity range and aliasing artefacts.After the signal de-aliasing in x -f domain, another inverse Fourier transform along f is adopted to transform the estimated x -f signal ρ k +1 back to dynamic image space for the subsequent weighted coupling with other predictions, as shown in Fig. 1.

| x t -CRNN
Corresponding to the formulation in Eq. 5b, we additionally propose to learn a regulariser in the spatio-temporal image domain complementary to the combined spatial and temporal frequency domain.Specifically, to effectively exploit the spatio-temporal redundancies in x -y -t space, we adopt a variation of our previous CRNN-MRI [29] network for image space de-aliasing which has been shown to be an effective technique in dynamic MRI reconstruction, termed as x t -CRNN.In detail, bidirectional CRNN layers [29] with recurrence evolving over both temporal and iteration dimensions via hidden-to-hidden connections are employed.This allows us to embed the iterative reconstruction process in a learning setting as well as to propagate information along temporal axis bidirectionally.Similar to the x -f space reconstruction, the proposed x t -CRNN also learns to reconstruct the combined data from all coils, and learns the residuals of the temporal average baseline m (Eq.12) in spatio-temporal domain with m k − m as input to the network.This can require fewer k -t samples for residual encoding and similarly enables the x t -CRNN to focus more on the dynamics of the reconstruction.The x -t domain and x -f domain reconstructions are complementary, which further enables the network to maximally explore cross-domain knowledge for the signal recovery.
The x -f transform and reconstruction diagram for a single iteration in the combined spatial and temporal frequency space.In detail, the x -f transform layer receives input from multi-coil k -t space data.The acquired multi-coil k-space data is firstly averaged along t to yield a temporal average for each coil separately.At iteration k , the temporally averaged data is subtracted from corresponding coil data at each time frame, and the subtracted data and temporally averaged data from multi-coils are then inverse Fourier transformed and sensitivity-combined back to image space.This yields a sequence of aliased images and a temporally averaged sequence (Eq.12).Each frequency-encoding position of the coil-combined images is then processed separately hereafter.The image rows from aliased images or baseline images are gathered and temporal Fourier transformed along t to yield an x -f image, corresponding to ρ k − ρ and ρ respectively.These signals are then fed as inputs to xf -CRNN for x -f space reconstruction (Eq.8 and Eq.11a).

| Data consistency layer
As discussed in Section 2, Eq.5c and Eq. 6 give a closed-form solution with no dense matrix inversion, so that we can exactly embed it as a PI data consistency (pDC) layer in the DNN.To make it concise, we reformulate Eq. 6 as: where i ∈ {1, 2, ..., n c } and λ 0 = γ/(λ + γ).The DC in PI is performed coil-wise and point-wise, which makes it simple and appealing for implementation in DNNs.Here λ 0 is a hyperparameter that allows the adjustment of data fidelity based on the noise level of the acquired measurements.

| Weighted coupling layer
Similarly, Eq. 5d can be formulated as a weighted coupling (wCP) layer in DNNs given estimations from Eq. 5a, Eq. 5b and Eq.5c, as represented in the closed-form solution Eq. 7. The coil sensitivity maps can be normalised to one along coil dimension, and thus we can simplify Eq. 7 as in which α 0 = α α+β +γ and β 0 = β α+β +γ control the weighted coupling of predictions from x -t domain and x -f domain respectively.

| CTFNet
Based on the proposed four modules, our CTFNet can thus be compactly represented as follows: Here m denotes the temporally averaged sensitivity-combined image of a sequence that is used as the baseline signal, and it can be mathematically expressed as in which max operation is performed element-wise, t indicates summation along the temporal dimension, and [•] T represents the repetition operation along the temporal dimension for T times (the number of frames in a sequence).
Given the proposed framework, our CTFNet can iteratively learn to reconstruct the true images from both spatiotemporal and temporal frequency spaces, so that the spatio-temporal redundancies can be jointly exploited from complementary domains for better reconstructions.

| Network Architecture and Learning
The detailed network architecture of the proposed CTFNet is shown in Fig. 1.The CTFNet architecture consists of four components, where each component is corresponding to each subequation in Eq. 11, respectively.In detail, xf -CRNN is composed of 4 layers of CRNN-i and 1 layer of 2D CNN with a residual connection from the baseline estimate.The 2D convolutions are applied over the x and f dimensions, and the y dimension is viewed as the batch dimension.For the x t -CRNN model, a variation of architecture [29] is employed which consists of 4 layers of BCRNN evolving over both temporal and iteration dimensions, 1 layer of 2D CNN and a residual connection.Here the 2D convolutions are applied on the spatial dimensions with recurrence over time dimension in BCRNN, whereas in 2D CNN layer the time dimension is viewed as the batch dimension.We used dilated 2D convolutions with kernel size 3 × 3 and dilation factor 3 × 3 to increase of the receptive field sizes.The number of input and output channels of the network was 2, representing the real and imaginary part of the complex-valued data.
Given the training set Ω with undersampled data m 0 as input and fully sampled data as target, the network is trained end-to-end by minimising the pixel-wise L1 norm between the reconstructed data and the sensitivity-weighted ground truth data m gt : where m n i t denotes the predicted image at iteration n i t , i.e., the final output of the proposed network, θ is the set of network parameters, and n Ω is the number of training samples.In our setting, we have the iteration step n i t set to 5. Specifically, all the components in CTFNet including the pDC and wCP layers are needed for training and involved in gradient backpropagation, where the backward pass of the pDC operation can be similarly derived as in [27] and backward pass for wCP operation with respect to input layers are merely their corresponding coefficients due to the weighted coupling operation.For stability of training, values of λ 0 , α 0 and β 0 were all set to 0.1 based on our preliminary works [25,42].

| Data
We used two datasets for the experimental evaluations.The first dataset (Dataset A) includes 38 sets of complexvalued multi-slice short-axis cardiac MRI scans acquired on a 1.5T Siemens scanner.2D bSSFP cine acquisition with retrospective gating and 2× GRAPPA acceleration was performed for 14 healthy subjects and 24 patients with suspected cardiovascular diseases for left ventricular coverage.In the patient population, we encountered myocarditis, arrhythmogenic right and left ventricular cardiomyopathy, restrictive cardiomyopathy, dilated cardiomyopathy, hypertensive cardiomyopathy, non-ischaemic cardiomyopathy, embolic myocardial infarction and eosinophilic granulomatosis with polyangiitis (EGPA) with cardiac involvement [37].

| Experiments
We firstly performed the comparison study where we compared our CTFNet against other competing approaches on Dataset A with mixed healthy subjects and patients for reconstructions from undersampling rates of 8, 16 and

24.
Here the models were trained on Dataset A with a 2-fold cross-validation, where each fold contained 7 healthy subjects and 12 patients with six slices for each subject.In the second step, we explored the generalisation potential of the proposed method.Specifically, we first investigated the robustness of the models when applied to data that were acquired with different scanners and acquisition settings from the training data.We employed models trained on Dataset A and directly tested them on Dataset B. Dataset B differs from Dataset A on the aspects of scanners, acquisition parameters, temporal resolutions, number of acquisition coils and sampling matrix size.In addition, we further investigated the generalisation performance of the proposed method from healthy subjects to patients that were not represented in the training set.In detail, we trained another model with only healthy subjects (14 subjects, 84 slices), and directly tested it on patients in Dataset A. To better understand the proposed method and its performance, an ablation study was also conducted on both datasets to gain more insights on the effects of regularisation in different domains for the dynamic parallel reconstruction problem.Specifically, we investigated and compared between the single domain regularisation (R xt or R xf ) and the complementary time-frequency domain regularisation.Lastly, a clinical evaluation of the proposed method was performed to investigate the clinical utility of the reconstructed data.In this regard, we measured the left ventricle end-diastolic volume (LVEDV), left ventricle end-systolic volume (LVESV) and left ventricle ejection fraction (LVEF) from the reference standard and the reconstructed accelerated data respectively.The segmentation masks were extracted by using a state-of-the-art DL-based segmentation algorithm [51,52], where the model was previously trained on a different set of cardiac MR data without acceleration.

| Evaluation Method
We compared our proposed approach (CTFNet) with representative MR reconstruction methods, including state-ofthe-art CS and low-rank based method k -t SLR [7], and two variants of DL methods, dynamic VN [33] and Cascade CNN [24,27], which have been substantially enhanced to adapt to dynamic parallel image reconstruction.Dynamic VN [33] learns the complex spatio-temporal convolutions in contrast to the original VN [19], and for strong comparisons with our method, we propose to improve it by incorporating the temporal average baseline as an initialisation.
Similarly, as to Cascade CNN with the D-POCSENSE framework [24] originally designed for static PI, we also refined it to learn the residual of the temporal average, and adjusted it with the same convolutional recurrent architecture as CTFNet to equip it with the ability to exploit spatio-temporal correlations.Thus we term it as CascadeCRNN.The network architecture for CascadeCRNN was the same as x t -CRNN and the number of iteration steps n i t was set to 5 for all DL methods.k -t SLR formulation has also been extended to be used with multi-coil data based on SENSE model in contrast to its original implementation [7].
Quantitative results were evaluated in terms of normalised mean-squared-error (NMSE) and peak-to-noise-ratio (PSNR) on complex-valued images, as well as structural similarity index (SSIM) and high frequency error norm (HFEN) on magnitude images.These metrics were made to evaluate the reconstruction results with complimentary emphasis.All quantitative results were computed only around cropped dynamic regions for better evaluation.Lower NMSE/HFEN and higher PSNR/SSIM indicate better results.
Statistical tests were performed for results of each metric to ensure that the differences between methods were significant.For multi-model comparisons, we first performed a Friedman test [53] to see if there was a significant difference in the population statistics (metric results).Then if the null hypothesis of the Friedman Test was rejected, we performed one-versus-all one-way Wilcoxon signed-rank test [54] with Bonferroni correction to find out if the results from our model significantly outperformed the others.

| Implementation details
The CTFNet approach as well as the compared DL methods were all implemented in PyTorch, and trained with the setting as described in Section 3.1.Experiments were performed on a 12GB Nvidia Titan Xp Graphics Processing Unit (GPU).For k -t SLR, we used the Matlab implementation provided by [7] with an extension to multi-coil data.
Experiments were conducted on a 16GB RAM, 3.60GHz Central Processing Unit (CPU).

| Comparison study
Quantitative comparison results of different methods on dynamic multi-coil cardiac data with various high acceleration rates (8×, 16× and 24×) are presented in Table 1.The results reported were on the entire 228 2D+t slices on Dataset A. It can be seen that our proposed CTFNet outperforms k -t SLR by a large margin in terms of all these measures at different undersampling rates.It also offers a much faster (∼1000×) reconstruction speed with 2.8s for the entire sequence of one slice compared with k -t SLR with 2444.8s for the same reconstruction.In comparison to other DL-based methods which have been carefully enhanced to incorporate temporal information, our proposed approach can still achieve better performance on all acceleration rates, with an improvement of around 1dB PSNR and 1.5% SSIM increase over the most competing method (CascadeCRNN).The performance gap of the improvement is also increasing as acceleration rate increases.All results were statistically significant with p 10 −5 .Additionally, we also compared the qualitative results on 16× and 24× undersampled data (equivalent scan time: 15s and 10s respectively within a single breath-hold) in Fig. 3 and Supporting Information Fig. S2, which shows the reconstructed images along both spatial and temporal dimensions as well as their corresponding error maps on a patient and a healthy subject.
Compared to other competing methods, it can be observed that our proposed model can faithfully recover the images with smaller errors especially around dynamic regions, and can also produce sharper reconstructions along temporal profiles.

| Generalisation study
In this study, we explored the generalisation potential of the proposed method.The generalisation test results of different DL models from Dataset A to Dataset B are presented.Particularly, to better understand how the methods perform on the key frames, here we show their comparisons on the end-systolic (ES) frame (Table 2) and end-diastolic (ED) frame respectively (Supporting Information Table S1).It can be seen that the proposed method achieves high performance on the unseen test dataset and also consistently outperforms against other competing methods, indicating its capability in effectively learning the inverse dynamic reconstruction problem.All results on both ES and ED frames were also statistically significant with p < 0.005.Besides, we also visualised the generalisation results of Dataset B under different acceleration rates, as presented in Fig. 4 and Supporting Information Fig. S3.It can be observed that our approach can recover the fine details and the temporal traces of the image very well on data from unseen domain even with extreme undersampling rate (24×), though it is anticipated that the reconstruction gets more challenging as acceleration rate increases.
In addition, we further investigated the generalisation performance of the proposed method from healthy subjects to patients that were not represented in the training set.In detail, we trained another model with only healthy subjects (14 subjects, 84 slices), and directly tested it on patients in Dataset A. In addition, the generalisation results from healthy subjects to patients were compared with models trained with mixed healthy subjects and patients (19 subjects, 114 slices), as shown in Table 3.Though the pathological conditions were not included in the training data, the generalisation results from healthy data to patients were very competitive to the mixed training models with an average of only 0.2dB PSNR and 0.2% SSIM drop of performance.This can also be observed from the qualitative comparison as shown in Fig. 5, where only subtle differences can be detected from these two training settings.

| Ablation study
In addition, we performed the ablation study to gain more insights on the effects of different regularisations for the problem.Particularly, we investigated on the effects of different regularisations (R xt and R xf ) on the dynamic parallel reconstruction problem.Specifically, The ablation study compared results from the spatio-temporal image space re-    4, where reconstruction models were trained on data with R=8 from datasets A and B respectively.A qualitative result is also given in Fig. 6 on data with R=16.

| Clinical Evaluation
Our experiments indicate a good reconstruction quality of the proposed method with respect to the reference standard, however, the question of the clinical validity of such reconstructions remains open.To understand the translational utility, we performed the left ventricular function assessment (LVEDV, LVESV and LVEF) on our reconstructed data in Dataset A. Note that the DL-based segmentation tool was trained on another group of cardiac MR data of healthy subjects without acceleration (UK Biobank data [55]) and DL-based approaches are also known to be sensitive to variations of image distributions and perturbations, so there were some inevitable failure cases on our test data.
Therefore, we performed manual quality control of the segmentation in our cases, where we have excluded severe failure cases from the cohort.The resulting dataset includes 12 healthy subjects and 13 patients, and we show their comparisons with the reference standard in Bland-Altman plots in Supporting Information Fig. S4 for all acceleration rates.It can be observed that our reconstructions have achieved reasonably good results on all three measures, and on average a bias for LVEDV, LVESV and LVEF of -0.04ml, 1.15ml and -0.97% was observed with all observations lying inside the 96% confidence interval of ±2.21%, ±2.53 and ±2.36% on 8× accelerated data.For 16× and 24× accelerated data, an average bias of -1.81ml, 1.98ml and -2.32% for LVEDV, LVESV and LVEF was observed with a slightly higher variance on 24× accelerated data than on 16× accelerated data.Both the biases and variances of these measurements are within an acceptable range, indicating that our reconstruction results can potentially have the clinical benefit.

| DISCUSSION
In this work, we have demonstrated that the proposed method is capable of recovering high quality images from highly undersampled dynamic multi-coil data.Different from existing DL-based approaches, we incorporated the combined  4), the proposed x -f space reconstruction (Proposed (R xf )) has shown to be more effective in exploiting the spatio-temporal correlations, with higher reconstruction accuracy and a smaller number of network parameters (Table 4).This is mainly due to the inherent nature of the periodic dynamic cardiac MRI data itself, where strong correlations exist in k-space and time and signal in temporal Fourier space is sparse.This has been represented in many traditional CS-based methods, and here our results have demonstrated that the learned implicit DNN-prior in the temporal Fourier domain can further increase the acceleration capability and achieve even better performance.In addition, combination of time-frequency cross-domain knowledge (Proposed (R xf + R xt ), Table 4 and Fig. 6) further enhances the reconstruction capability of the proposed method with better reconstruction quality.In comparison to three fully independent constraints which regularise the spatial, temporal and temporal-frequency dimensions independently, our approach offers the ability to efficiently exploit correlations in the joint spatial and temporal/temporal-frequency space, which enables the estimation of missing data to be based on other available points within its spatial vicinity at neighboring time points or temporal-frequency.
Additionally, the use of two constraints is also more efficient than three independent constraints, as adding one extra constraint will lead to another subproblem for minimisation, which thereby will increase complexity in network architecture design.The improved performance of CTFNet over other competing methods indicates that learning jointly from both spatio-temporal and temporal frequency domains can capture complementary useful information that can be effectively utilised by the proposed framework.
Furthermore, the proposed CTFNet builds on a multi-variable minimisation problem and embeds it into an efficient DL framework.The employed variable splitting technique effectively decouples data regularisation terms on various domains from the data fidelity term, which enables the natural derivation of pDC layer and wCP layer in PI with closedform point-wise solutions.Though the derived pDC layer shares similar form as the one proposed in D-POCSENSE [24] which is a simple extension from single-coil application [27], our solution (pDC with wCP layers) for the multicoil setting has the mathematical support based on variable splitting and alternating minimisation, and thus reasons the particular formulation and structure of our network.In contrast to [20,32] where data fidelity step is solved via conjugate gradient algorithm due to the difficult matrix inversion in their DC terms, our CTFNet offers a much simpler and more efficient solution with exact steps and avoids iterative gradient updates, allowing for faster reconstruction speed and easier embedding into DNNs.Besides, our approach also offers the flexibility of incorporating additional regularisation terms in the framework, whereas this will not be very straightforward for the other approaches.
It is worth mentioning that with the recurrent design of the architecture, it is natural that the parameters for each iteration are shared, whereas this is not the case for other approaches such as VNs [19].Similar to other DLbased unrolled network architecture, our proposed approach is a learning-based method which is designed to achieve optimal performance with the pre-set fixed number of iterations during the training, though the recurrent design enables it to be employed for arbitrary number of iterations at inference time.In our work, we determined heuristically that 5 iterations represents a compromise between GPU memory requirements and performance.Therefore, we fixed n i t = 5 to train our network, and thus it is expected that the optimal inference performance also comes from n i t = 5.
If n i t is increased correspondingly at training stage, we can expect that the performance of the network will be further improved and converge after a few iterations.However, due to the multi-coil setting and the limitation of GPU memory, we were not able to train the model with more iterations.More efficient training schemes will be investigated in the future to improve the performance of the proposed method.In addition, with respect to the hyper-parameters, our DL-based method appears to be more robust at test time while CS-based approaches may require case-level tuning for the optimal results.
In our work, we also propose to learn the residual of a temporally averaged frame, which is not necessarily required to be a fully-sampled image.This can be seen from our undersampling masks of R=24 (Supporting Information Fig. S1), where the averaged k-space is not fully-sampled but good reconstruction results can still be achieved.Besides, VISTA sampling pattern was designed in a way to spread the sampled lines out as much as possible to high frequencies, though it also appears to omit some of the highest k-space locations which is a common phenomenon for high accelerations of undersampling strategies.VISTA has been shown to be able to generate more consistent results with superior noise-sharpness trade-off compared to other commonly employed sampling patterns [48], and thus it is also expected to benefit the DL-based reconstruction.The fundamental acceleration limit for which the undersampling does not affect the effective reconstructed resolution was not investigated in [48], and this is also not the focus in our work.Nevertheless, the proposed approach is a learning-based method, i.e. the network is able to reconstruct the missing high frequency samples from the trained information and by that recover reasonably good reconstructions.Additionally, the impact of motion would also be crucial for understanding the robustness of the model.Theoretically, since the model allows high acceleration rates, it is likely that the model can accommodate even if some data are rejected or omitted.However, it would require comprehensive evaluation to study various degrees and types of motion and artifacts, and this will be an important direction of future work.
Moreover, the proposed method can generalise well to unseen cardiac MR data with different acquisition parameters and with pathology that were not seen in the training set.The method can achieve satisfactory performance on these scenarios even with highly aggressive undersampling strategies, which indicates that the proposed method is robust to unseen and unusual image features or temporal behaviours present in our currently used dataset.In addition, our clinical evaluation of the reconstructed images shows acceptable biases and variances on the left ventricular function assessment, indicating the great potential to deploy DL models for clinical practice.Also to note that the differences between the reference and reconstructed data could also be accumulated from errors induced by the segmentation performances especially on accelerated data, thus this shall also be taken into consideration when comparing them with the reference standard.However, the correction and improvement of the segmentation on accelerated data are out of scope of this work, and should be considered as an important avenue for future research.
Particularly, by exploiting spatio-temporal redundancies in the proposed DL framework, our approach can outperform the state-of-the-art CS and DL-based methods and can further push the acceleration capability with fast reconstruction speed for the dynamic parallel MR imaging.In our work, Dataset A was a multi-breath-hold acquisition of 8 consecutive breath-holds with 15s for each (2× GRAPPA accelerated).Hence, an acceleration rate of 16 or higher could result in the possibility of achieving the same acquisition in a single breath-hold, but this may also require further investigation with respect to transient state imaging, achievable spatial coverage and/or temporal resolution for single breath-hold cine imaging.Despite this being a retrospective undersampling study, our results indicate a great potential in facilitating fast single-breath-hold clinical 2D cardiac cine imaging.
For the future work, we will explore the dynamic parallel image reconstruction with other types of undersampling strategies, such as radial sampling which is also commonly used in acceleration of 2D cardiac MR imaging in practice.
In addition, we could also consider incorporating some other regularisation terms into the framework, such as regularisation on some other transform domains, to exploit the data redundancy for effective reconstruction.Besides, generalisation capability of the model can be further validated on more data from different domains and with various acquisition parameters and pathologies, as well as on more aspects such as generalisation to different orientations (e.g.chamber views) to investigate its potential application for clinical use.

| CONCLUSION
In this paper, we have proposed a novel DL-based approach, termed CTFNet, for highly undersampled dynamic parallel MR image reconstruction.The proposed method exploits spatio-temporal correlations in both the combined (2) an x t -CRNN F I G U R E 1 An illustrative diagram of the proposed CTFNet at a single iteration.Each component is corresponding to each subequation in Eq. 11 respectively.(a) Network architecture for xf -CRNN; (b) network architecture for x t -CRNN; (c) PI data consistency (pDC) layer; (d) weighted coupling (wCP) layer.Numbers inside CNN, CRNN-i and BCRNN layers indicate {kernel size, dilation factor, number of filters}.Note that features learned at each iteration is propagated along iteration steps via the hidden-to-hidden connections in CRNN and BCRNN units.For mathematical notations, please refer to Eq. 11.

For
training details, gradients were hard-clipped to[-5, 5]  to avoid the gradient explosion problem in training recurrent neural networks.The ADAM optimiser was employed with a learning rate of 10 −4 .The intensities of input images were normalised by the maximum value of their corresponding undersampled temporal average frame.During training, we extracted training patches along the frequency-encoding direction and used the entire sequence of the data.Networks for different undersampling factors were first trained jointly and then finetuned separately.A plateau of the performance can be observed with 10 5 backpropagations.Patch extraction and data augmentation were performed on-the-fly on the individual coil images, with random rotation and scaling, and the minibatch size during the training was set to 1.
The data was acquired with Cartesian sampling and with acquisition parameters including in-plane resolution of 1.9×1.9mm,slice thickness of 8mm, number of phase-encoding lines of 156, repetition time (TR) of 2.12ms, echo time (TE) of 1.06ms and temporal resolution of around 40ms.Images were reconstructed from the 2× acceleration to a fully sampled k-space by GRAPPA.Written informed consent was obtained from all subjects and the study was approved by the local ethics committee (healthy subjects: London Bridge Research Ethics Committee, patients: North of Scotland Research Ethics Committee).In experiments, six slices from each subject that cover the dynamic anatomy were extracted, resulting in a total number of 228 slices for the experiments.Each acquisition in this cohort consists of 25 frames with 30/34/38-channel multi-coil data.The second dataset (Dataset B) used in our experiments consists of 10 fully sampled complex-valued short-axis cardiac cine MRI acquired on a 1.5T Philips scanner.The data were acquired on healthy volunteers following written informed consent under approved research ethics (08/H0711/82).Each scan contains a single slice SSFP acquisition with 30 temporal frames and 32-channel multi-coil raw data.The acquisition has an in-plane resolution of 1.7 × 1.7mm, slice thickness of 10mm, 190 phase-encoding lines, TE=1.66ms,TR=3.32m and an average temporal resolution of 33.70ms.A variable density incoherent spatiotemporal acquisition (VISTA) sampling scheme[48] was employed to undersample the k-space data in our experiments, which has been shown to be an effective Cartesian sampling strategy for dynamic data.The scheme is based on a constrained minimisation of Riesz energy on a spatiotemporal grid.It allows uniform coverage of the acquisition domain with regular gaps between samples and guarantees a fully-sampled, time-averaged k-space to facilitate GRAPPA or ESPIRiT kernel estimation.In experiments, we undersampled the data at acceleration rates of 8, 16 and 24, and examples of them are shown in Supporting Information Fig.S1.Coil sensitivity maps were pre-computed from the fully-sampled, time-averaged k-space center with the ESPIRiT algorithm[49] by using the BART toolbox[50].

F I G U R E 3
Qualitative comparison results of different methods on spatial and temporal dimensions with their error maps.Results are shown for undersampling rates 16× of a patient (top) and 24× of a healthy subject (bottom) on Dataset A at systolic frames.The scan time for these two acquisitions are 15s and 10s within a single breath-hold respectively.The proposed method can well recover the fine details and preserve the temporal traces, though this gets more challenging on aggressively undersampled data.An example of the results at diastolic frames is shown in Supporting Information Fig.S2.
Generalisation reconstructions of the proposed method on the unseen domain Dataset B along spatial and temporal dimensions with various acceleration rates as well as their error maps.(a) Fully sampled image (b) Example of undersampling image with R=24 (c) (d) Reconstruction from R=8 (e) (f) Reconstruction from R=16 (g) (h) Reconstruction from R=24.The proposed method can well reconstruct the images with good preservation of temporal trace on various undersampling rates.Though reconstruction is more challenging as R increases, the reconstructed results can still be useful.construction (Proposed (R xt )), the combined temporal Fourier and spatial space reconstruction (Proposed (R xf )) as well as the complementary time-frequency domain reconstruction (Proposed (R xf +R xt )).All these ablated approaches with varying domain regularisations were conducted under the same variable splitting framework as in Section 2, where spatial and temporal frequency domain regularisation into the formulation of the dynamic parallel MRI reconstruc-F I G U R E 6 Qualitative comparisons of the ablated different domain reconstructions on spatial and temporal dimensions with their error maps.Results are shown for R=16 (scan time 15s) on Dataset A. Highlighted regions indicate improvement of the complementary time-frequency domain reconstruction.tion problem and exploited spatio-temporal redundancies from both x -t and x -f spaces with DNNs.Compared with spatio-temporal image (x -t ) domain reconstruction (Proposed (R xt ), Table spatial and temporal frequency domain and the spatio-temporal image domain based on a variable splitting and alternating minimisation formulation.The network is able to learn to iteratively reconstruct the images by jointly and effectively exploiting information from the complementary time-frequency domains.Our proposed CTFNet outperforms state-of-the-art dynamic MR reconstruction methods in terms of both quantitative and qualitative performance, with excellent recovery of fine details and preservation of temporal traces.It also enables increased accelerations of data acquisition with favorable generalisation ability, which is promising for realising single-breath-hold clinical 2D cardiac cine MR imaging.SUPPORTING INFORMATION S U P P O R T I N G I N F O R M AT I O N TA B L E S1 Generalisation results of different DL methods trained on Dataset A and deployed to Dataset B for different acceleration rates.Results (mean (standard deviation)) were computed and compared only around dynamic regions on the ED frame.NMSE is scaled to 10 −2 P O R T I N G I N F O R M AT I O N F I G U R E S1 Examples of the VISTA undersampling patterns for acceleration rates 8, 16, and 24.Top figures show the undersampling patterns in k -space, and the bottom figures show the undersampling patters in k -t space.S U P P O R T I N G I N F O R M AT I O N F I G U R E S2 Qualitative comparison results of different methods on spatial and temporal dimensions with their error maps.Results are shown for undersampling rates R=16 of a patient (top) and R=24 of a healthy subject (bottom) on Dataset A at diastolic frames.
TA B L E 1 Comparison results of different methods on Dataset A of dynamic multi-coil cardiac cine MRI with high acceleration rates (R).Results (mean (standard deviation)) were computed and compared only around dynamic regions.NMSE is scaled to 10 −2 .Best results are shown in bold.
TA B L E 2 Generalisation results of different DL methods trained on Dataset A and deployed to Dataset B for different acceleration rates.Results (mean (standard deviation)) were computed and compared only around dynamic regions on ES frame.NMSE is scaled to 10 −2 .Best results are shown in bold.
TA B L E 3 Generalisation results of the proposed method trained on healthy subjects only (84 slices) and tested on patients in Dataset A for different acceleration rates.Results (mean (standard deviation)) were computed only around dynamic regions and compared with models trained with mixed healthy subjects and patients (114 slices) also in Dataset A. NMSE is scaled to 10 −2 .Better results are shown in bold.Comparison of the proposed method between mixed training results (from mixed healthy subjects/patients to patients) and generalisation results (from healthy subjects to patients).Results shown are on one patient with hypertensive cardiomyopathy in Dataset A on R=8.The generalisation result is almost as well as the one from standard mixed training.TA B L E 4 Ablation study of effects of different regularisations on dynamic cardiac cine MRI reconstruction.Experiments were performed on two different datasets (A and B) with undersampling rate 8×.NMSE is scaled to 10 −2 .Results are presented in mean (standard deviation).Best results are indicated in bold.
. Best results are shown in bold.