Gas chromatography-ion mobility spectrometry (GC-IMS) is a powerful technique for the separation and sensitive detection of volatile organic compounds (VOCs). It is a rapid, robust and easy-to-handle technique, which has recently gained attention for targeted as well as non-targeted screening (NTS) approaches. In this article, the general working principles of GC-IMS are presented.
Quality control and early detection of hazard chemicals, allergens, or biological contaminants are a critical measure to ensure product safety. Environmental pollutants, pesticides, or toxins, among others, can compromise food safety and pose a public health risk . Furthermore, food adulteration and food fraud, accelerated by globalization, continue to cause economic losses and customer dissatisfaction and emphasize the need for robust, inexpensive, and fast analytical methods . Due to the inherent diversity of biogenic samples, as observed in food analysis, and the chemical complexity of the sample matrices, analytical approaches covering a multitude of parameters in parallel paired with strong discrimination power are required . Analysis of the volatile organic compounds (VOCs) of samples, also known as VOC profiling, allows for the detection of compounds in complex sample matrices without the need for detailed a priori knowledge of the molecular composition. Furthermore, VOC profiling can be performed without advanced sample preparation and without the need for detailed knowledge of the molecular composition. Gas chromatography-mass spectrometry (GC-MS) is commonly used for VOC analysis, however, typically requires specific laboratory infrastructure, such as helium gas supply and vacuum technology. As an alternative with significantly less demand on infrastructure, ion mobility spectrometry (IMS) has gained more and more popularity over the last years. Due to its high sensitivity and resolving power on the one hand and its simplicity and robustness on the other, IMS has become a powerful tool for VOC based trace analyses . Moreover, gas chromatography coupled to ion mobility spectroscopy (GC-IMS) has been shown to be an easy-to-handle and yet highly effective tool for VOC profiling . The complexity of biological samples results from the presence of a variety of compounds, which in their entirety provide a characteristic GC-IMS spectrum, often referred to as the VOC profile or “fingerprint” . Due to the large amount of data obtained by VOC profiling based on GC-IMS, machine learning tools are required for data analysis. In literature, these are often differentiated into targeted screening and non-targeted screening (NTS) approaches. In short, targeted analysis is based on defining specific markers before analysis, while NTS of GC-IMS data usually does not require prior knowledge and the entire spectral fingerprint is subject to data analysis.
Since the 1970s, when IMS was first known as ‘plasma chromatography’, IMS has developed into a highly sensitive technique for the analysis of VOCs at ultra trace concentration levels, which accounts for additional information regarding the ion’s mobility . Due to the robust and easy-to-handle instrumentation, a wide range of application fields have been found for IMS today, such as food flavor analysis , process monitoring , and quality control , as well as detection and quantification of warfare agents  and explosives .
With IMS, analytes are first ionized in the ionization region of the instrument. The most common ionization method is the atmospheric pressure chemical ionization  by beta emitters, such as nickel-63  or the less hazardous beta-emitting tritium , or alpha-emitting americium-241 . Other ionization methods are atmospheric pressure photo ionization (APPI) , corona discharge (CD) atmospheric pressure chemical ionization , or laser desorption/ionization technique (LDI) .
According to the European Union directive, the exemption limit for the total activity of ionization sources was set to 1 GBq . Therefore, the usage of low-radiation tritium ion sources with an activity of 300 MBq or less is not subject to authorization, hence leading to a broad adoption of tritium ion sources in a number of commercially available systems on the market . Beta particles, which are emitted by the tritium source, initiate a gas-phase reaction cascade of the drift gas (nitrogen or air), resulting in predominant proton-water clusters H+[H2O]n, which are commonly referred to as ‘reactant ions’ . The number of water molecules (n) depends on the gas temperature and the moisture content of the gas atmosphere . Depending on the proton affinity, molecules entering the ionization region react with the reactant ions to protonated monomers MH+[H2O]n‑x, which leads to an intensity decrease of the reactant ion peak (RIP). At higher analyte concentrations, proton-bound dimers M2H+[H2O]m‑x are formed by the attachment of additional analyte molecules. When the concentration further increases, the formation of higher molecular cluster ions, such as trimers or tetramers, is possible; however, due to their low stability and short lifetime, higher molecular cluster ions are rarely observed . In general, nonlinear behaviors are observed for the ratio of the RIP and the distribution between the protonated monomer and the proton-bound dimer . The principles of a drift-time IMS including a tritium ionization source are shown in Figure 1 .
After ionization, the analyte ions are transferred into the drift region through a gating mechanism based on a charged electrode. For precise control of the ion pulse width admitted into the drift tube, complex gating systems, such as Bradbury-Nielsen or field switching shutters are employed . In the drift tube, ions are accelerated towards the detector, a Faraday plate, and subsequently separated by their drift time (or mobility) in an electrical field at ambient pressure. The ions are slowed down by collision with counter flowing drift gas molecules in the collision cross section (CCS). The equilibrium of the acceleration by an electric field and deceleration by collision with the drift gas molecules results in ions to move with a constant velocity towards the detector. Depending on mass, charge and spatial structure, the ions are separated in the drift tube and reach the detector at different drift times . The drift time may be used to calculate the reduced ion mobility K0 via the Mason-Schamb equation .
To avoid clustering in the ionization or drift region, IMS devices are commonly hyphenated to chromatographic techniques, such as liquid chromatography (LC) or gas chromatography (GC). Column separation coupled to drift time IMS separates analytes into two orthogonal “features”, the retention time through chromatography and following, drift time or mobility through IMS, resulting in a highly resolved, two dimensional (2D) GC-IMS spectrum . While hyphenation to LC is still more challenging, mainly due to the high dependence on sample preparation as a critical step for the data quality , GC-based IMS is typically easier to realize, mainly due to fact that the mobile phase is nearly inert. In particular, headspace (HS)-based techniques allow for the analysis of nearly untreated samples, which reduces the need for time-consuming sample pre-treatment steps  and furthermore, minimizes formation of handling-associated artifacts.
To address the large amount of data obtained by GC-IMS measurements, specific markers may be selected prior to data analysis. As prior knowledge is often included in the selection process, this strategy is commonly referred to as targeted screening. The markers used for targeted screening can be either handpicked or mathematically determined . One-way analysis of variance (ANOVA), using for example a Tukey’s test, is often applied to identify volatile compounds which exhibit significant differences, commonly quantified at a 5% significance level (p ≤ 0.05) . Various other methods, such as Gabor filters, local binary pattern, Haar, and histograms of oriented gradients (HOG), have been proposed for feature extraction . Chen and coworkers applied MPCA and HOG for feature extraction and data reduction of multicapillary column (MCC)-IMS data, with subsequent canonical discriminant analysis for the generation of nonlinear boundaries, for the successful quantification of the adulteration degree of canola oil. A predictive accuracy of more than 95.2% was reported for a partial least squares model, which was obtained using a train–test split of 70 to 30 . Using targeted approaches and applying principal component analysis together with a k-nearest neighbors classifier (PCA-kNN), the same authors reported a successful classification of rapeseed oils according to their quality (grade 1–4) and a successful determination of vegetable oil according to its botanical origin (sesame oil, canola oil, and camellia oil). For the classification of the canola oil quality, the colorized differences method was applied to capillary column (CC)-IMS data, resulting in 34 peaks of interest and a predictive accuracy of 100% . Furthermore, Otsu’s method and colorized differences method was used for automatic peak detection, resulting in 88 peaks of interest and a predictive accuracy of 98.3% for the classification of vegetable oil using MCC-IMS data . The advantage of preselecting markers with significant differences is the simultaneous reduction of noise in the data, which, however, includes the risk of overlooking valuable information. Still, a targeted approach might be preferable in cases where the identification and quantitation of few marker compounds is sufficient to characterize a product in its quality or state.
Targeted screening approaches, which focus on the detection, identification and quantitation of a particular compound or class of compounds, may lack the ability to detect anomalies in form of new or unknown compounds. However, new scientific findings continuously identify potential hazardous or allergenic compounds . Therefore, for systematic monitoring of product quality, it is therefore desirable to develop analytical methods capable of highlighting unknown or non-targeted compounds from the complex sample matrices. This approach, also referred to as NTS, requires comprehensive extraction and analysis of potential compounds of interest.
NTS aims to identify the compounds of unknown molecular composition. Similar to targeted approaches, the workflow for NTS generally consists of sample preparation, instrumental analysis, and post acquisition data processing . A detailed overview about the NTS-Workflow using GC-IMS is described by Capitain and Weller (2021) . The first step, data acquisition, involves sample preparation and subsequent extraction and separation of VOCs. Since little or no a priori knowledge of the chemical structures and behavior of compounds is required, NTS approaches benefit from gentle sample preparation, robust instrumental analysis, and standardized data processing. The collected data are then preprocessed and analyzed in the data-processing step. Since no pre-existing knowledge is used, the entire spectral fingerprint obtained by HS-GC-IMS analysis is subject to data analysis and classification or quantification models being built using machine learning tools. Exploratory methods, such as principal component analysis (PCA) are often used for data reduction and pattern recognition . The information extracted from a data matrix is explained by principal components (PCs), which are orthogonal (i.e. mathematically independent) to each other. Since PCA models are predicted without labels or validation by test data, they are generally considered unsupervised. Unsupervised statistical methods are exploratory methods that can be used to study data structures and search for clusters of samples . Hierarchical cluster analysis (HCA) of PCA models in a tree-like diagram (dendrogram) is, e.g. used for the visualization of multivariate association and sample similarities . An extension of PCA for processing three-dimensional (3D) data is provided by multiway principal component analysis (MPCA) , which has been applied for the feature extraction of GC-IMS matrices, without prior transformation of the 2D data .
Compared to unsupervised techniques, which provide predictions without labels or target variables, supervised techniques aim to build models able to predict target variables. In supervised learning, several data points or samples are described using predictor variables or features and target variables. For classification tasks, the scores obtained by the unsupervised exploratory analysis are combined with subsequent supervised pattern recognition techniques to distinguish samples according to defined categories. Among PCA-based qualitative methods are linear discriminant analysis (LDA) and k-nearest neighbors (kNN). Whereas PCA-LDA maximizes the interclass variance, kNN assigns the category most common among the k-nearest neighbors. The downside of using PCA-based methods is that the correlation between dependent and independent variables are not considered during PCA analysis, which can result in the loss of information included in higher PCs . An alternative is provided by partial least squares (PLS), where the scores are calculated by considering the relationship between the independent and dependent variables.
For quantification tasks, partial least squares regression (PLSR) has become the standard method used in chemometrics, including the fields of sensorial analysis in food chemistry . PLSR is used to describe the relationship between two data matrices, X (experimental data) and Y (actual concentrations), which are decomposed into X = TPT + E und Y = UQT + F, by finding the maximum covariance and linear relationship between the score matrices T and U. One limitation to PLSR is the typically lower performance on nonlinear and heteroscedastic data, as is partially the case for IMS data. Several studies have analyzed nonlinear IMS data . Nonlinear relationships between the matrices T and U can be described by kernel PLSR (also known as nonlinear PLSR), where the data are transformed into higher-dimensional spaces using the kernel trick .
In the final step of NTS, model interpretation, key compounds are identified, which can be extracted through back projection of loadings. Due to the complexity of biological matrices, the substance identification with GC-IMS data alone can be challenging, which is why stand-alone IMS are rarely used to investigate the sample composition. Complementary techniques, such as GC-MS  or 1H NMR , are often used to provide further insight into sample composition and to identify decisive marker substances . Lastly, the model coherency is evaluated and finally applied for benchtop profiling.
A plethora of studies have shown the potential of HS-GC-IMS in combination with NTS for monitoring food quality or confirmation of geographical or botanical origin, despite the complexity of the samples. For example, HS-GC-IMS with NTS has been widely applied for the classification of olive oil between high-priced type 1 extra-virgin olive oil (EVOO), medium-priced type 2 virgin olive oil (VOO or OO), and non-edible type 3 olive oil, also known as pomace olive oil (POO) or lampante (virgin) olive oil (L(V)OO) . Furthermore, HS-GC-IMS with NTS was successfully used for reliable classification of geographical origins for both olive oil (EVOO)  and wine . Moreover, HS-GC-IMS with NTS was applied for the classification of honey according to botanical origin , as well as for the detection and quantification of honey adulterated with sugar cane or corn syrups . Recently, HS-GC-IMS with NTS has been applied to assess the freshness of food  and for the detection of mold formation on milled rice , peanut kernels , and wheat kernels .
NTS approaches (spectral fingerprinting) have been directly compared to targeted approaches (extraction of specific markers) for the analysis of IMS data. Garrido-Delgado and coworkers compared targeted and NTS approaches for the classification of olive oil into EVOO, OO, and LVOO, using data obtained by MCC coupled to IMS . A PCA-LDA model was used for data reduction and data clustering, followed by kNN (k = 3) for classification, obtaining a prediction percentage of 79% for the targeted strategy and 85% for NTS strategy. For the classification of olive oil harvested in 2014–15, Contreras and coworkers obtained a prediction percentage of 56.9% for the targeted strategy and 67.8% for the NTS strategy . An improved prediction success was achieved for models built with olive oil samples from 2014–15 and 2015–16, obtaining 74.3% for the targeted strategy and 79.4% for the NTS, hence suggesting superior abilities of the NTS approach versus the selection of specific markers. By contrast, the authors also reported that a targeted model built with samples from the years 2014–15 (prediction success of 51.6%) was superior to the NTS approach (prediction success of 36.0%) when applied to the years 2015–16. Both models built with the data from the years 2014–15 show weak prediction abilities for the prediction of samples from the following year, revealing some fundamental challenges in data science: the predictive ability of a model is highly dependent on the number of samples as well as on the sample diversity. Both approaches include the risk of overfitting to a specific problem .
Arroyo-Manzanares and coworkers likewise obtained superior classification accuracy using HS-GC-IMS for a model based on a targeted marker selection (100%) compared to a model based on the whole spectral fingerprint (90%) for the distinction between dry-cured Iberian ham from pigs fattened on acorns and pasture or on feed . However, the model based on marker selection was built using orthogonal PLS-DA, while kNN (k = 3) and PCA-LDA were used for the model based on spectral fingerprints; hence, the differences in the predictability of the models may result from the use of different mathematical tools and do not provide inferences about targeted and non-targeted approaches. Gu and coworkers by contrast obtained better classification results with the NTS versus a targeted approach for distinguishing between fungal infections of wheat kernels, as well as for the quantification of fungal colony counts . In conclusion, NTS and targeted screening approaches are both effective tools for data analysis, with different challenges and application areas.