Imagined speech is the mental task where individuals internally simulate the articulation of a prompt without actual vocalization. Recently, it gained widespread attention due to its simplicity and intuitiveness as a Brain-Computer Interface (BCI) paradigm. Hence, the decoding of imagined speech from brain signals emerges as a pivotal challenge addressed with various signal processing and machine learning techniques documented in the literature. The most commonly employed neuroimaging method is Electroencephalography (EEG) because of its non-invasive nature, low cost and high temporal resolution. Recent attempts of deciphering imagined speech from EEG signals deploy Convolutional Neural Network (CNN) architectures such as shallow Conv Net, deep Conv Net and EEGNet while others use Cross-Covariance (CCV) matrices as an alternative form of signal representation. Our novel architecture combines EEGNet with CCV matrices, extracting discriminative features from the latter with the use of bilinear transformations as proposed in the SPDNet architecture. Our method is validated on two publicly available datasets and exhibits on par with State-of-the-Art performance, while substantially surpassing EEGNet performance on both datasets.