Machine Learning in Molecular Design for Fragrance Molecules

Machine Learning in Molecular Design for Fragrance Molecules: Comparison

Please note this is a comparison between Version 2 by Jason Zhu and Version 1 by Nishanth Gopalakrishnan Chemmangattuvalappil.

The demand for new novel flavour and fragrance (F&F) molecules has boosted the need for a systematic approach to designing fragrance molecules. However, the F&F-related industry still relies heavily on experimental approaches or on existing databases without considering the consequences resulting from changes in concentration, which could omit potential fragrances. Computer-aided molecular design (CAMD) has great potential to identify novel molecular structures to be used as fragrances.

fragrance molecules
computer-aided molecular design
rough sets

1. Introduction

Fragrances are applied extensively as an attractive attribute in the formulation of many consumer products. The global flavours and fragrances (F&F) market size is expected to expand from the original value of USD 26.54 billion (2022) to USD 36.49 billion (2029) at a compounded annual growth rate of 4.7% ^[1]. The demand for novel fragrance molecules in the industry is greater than ever due to the stricter safety and environmental (e.g., biodegradability) regulations, which have led to the obsolescence of some existing products ^[2]. Unlike other senses, olfaction is poorly understood ^[2]. The design of fragrance molecules still heavily depends on empirical methods, either referring to the knowledge from experts or through experiments. This trial-and-error approach is too tedious to allow the exploration of all potential candidates, as fragrance molecules have complex structures. Thus, there is a risk of missing better fragrance molecules that have the potential to be incorporated into consumer products ^[3]. The conventional method is a resource-intensive process, which makes launching a new fragrance molecule costly and time-consuming ^[2]. Moreover, most of the fragrances’ odour descriptions in established databases are reported without the indication of concentration ^[2]. This could be another hurdle as the concentration of fragrance required in various products might be different.

To address the challenges involved in the design of fragrance molecules, a systematic framework should be developed for designing and screening suitable fragrances that fulfil the product’s requirement before experimental verification. Computer-aided molecular design (CAMD) approach is a potential tool for the screening and/or design of fragrance molecules by predicting the molecular structure using a set of desired sensorial and technical properties. However, a pre-requisite for the initiation of CAMD modelling is property predictive models. Perceived odours are determined by the structure of a fragrance molecule, the latter of which can be described using structural, geometrical, topological, physicochemical, and electronic descriptors ^[4]. Hence, machine learning (ML) tools have the potential to develop prediction models by linking the molecular structure to properties using topological indices as the numerical representation of the structure.

2. Computer-Aided Molecular Design (CAMD)

CAMD is a reverse engineering approach to screening novel chemicals by combining structural groups systematically to yield high-performance molecules ^[5]. In CAMD, property prediction models, such as group contribution (GC) models, are required. GC methods assume that the properties of a molecule can be estimated by the number of occurrences of different sub-structures, known as “groups”. In addition to GC methods, topological indices (TIs), one of the structural descriptors, were employed by the quantitative structure-property relationship (QSPR) for property estimation. Some of the common TIs, which include connectivity index, shape index, etc., can be used to differentiate very similar structures like isomers ^[6].

CAMD is applied widely in various applications related to solvent design ^[7] and integrated process and product design problems ^[8]. In recent years, there have been several developments in the application of these tools in the field of product development as well. Liu et al. ^[9] coupled ML-based atom contribution (MLAC) with CAMD to forecast the surface-charged density profile and construct a solvent for ibuprofen with improved economic, safety, health, and environmental aspects. An artificial neural network model was utilized to generate the structure-odour relationship (SOR) model for aromatic component mixtures by utilising the profiles of molecular surface charge density (r-profiles) as the descriptors ^[10]. It was also employed for the identification of potential solvent candidates that allow bio-oil to satisfy targeted properties with minimal solvent addition ^[11]. Moreover, Yee et al. ^[12] developed a framework for personal care product design by incorporating safety, health, and performance aspects in CAMD. By imposing constraints for safety and health hazards in CAMD, molecules generated were less harmful while possessing excellent product performance. There are some recent works in the CAMD field related to fragrance products. MILP/MINLP models for the design and screening of fragrance in shampoo were developed by Zhang et al. ^[3]. The CAMD model was utilised to remove the molecules that are out of the range of the constraints and properties of fragrant molecule design. In addition, fragrances in body lotion were modelled using rules generated with an enhanced hyperbox ML coupled with CAMD ^[13]. The hybrid CAMD framework was able to produce a variety of viable compounds that met all structural and physical property requirements. In both works, CAMD was proven to be effective in developing potential fragrant molecules for consumer products. Comprehensive reviews of the latest developments in this field can be found in the review articles by Chemmangattuvalappil ^[14] and Zhang et al. ^[15].

A recent contribution has demonstrated that rough set-based machine learning (RSML) can be used to develop a model to predict the fragrance of molecules and used the developed model for identifying novel fragrant molecules ^[16]. In thise previous work, a single molecular descriptor called molecular signature was used to build a predictive model for fragrance. However, the different molecular characteristics cannot be covered using a single descriptor. Moreover, the presence or absence of certain molecular signatures was used in building the predictive model. The shortcoming of such an approach was that typical databases contain different types of molecules with very few common signatures appearing in the different molecules. Therefore, the model had to be developed using a very small subset of the database. While this approach can develop models with a low number of false positives, it leads to a high percentage of false negatives. Finally, the dilution of the fragrance molecule was not incorporated in the development of the model. However, from the fragrance molecule database, it is clear that the same molecules possess different fragrance characteristics at different concentrations. To address the limitations of the previous RSML approach, there is a need for a model that makes use of various molecular descriptors that consider a variety of structural characteristics and also the ability to make use of the available data. The approach developed in this model has attempted to address these research gaps.

To conclude, CAMD is an important approach to expanding the portfolio of chemical product design. Prediction models for scent and physical properties must be available so that the desired attributes can be incorporated as constraints in CAMD. However, due to the lack of established mechanistic odour predictive models, it is necessary to develop an empirical model for aroma using ML. This approach can generate models from data by detecting and summarising the underlying patterns. The potential of ML to generate odour predictive models can address the inherent lack of understanding of the olfaction process.

3. Topological Indices (TIs)

In general, the models of group contribution (GC) are extensively applied to describe the pure component properties based on molecular structure. However, differentiation of molecule position in a compound cannot be achieved by the additive group contribution methods. Even a small distinction of group position in isomers might affect the odour characteristic of molecules ^[17]. Since fragrances are made up of multiple building blocks, there should be other structural attributes that contribute to fragrance in addition to the groups ^[3]. Thus, topological indices, the most used descriptors for chemical structure, have been used in this study to relate molecular structure to their fragrance.

Topological indices (TIs) are molecular structure descriptors that are generated from a chemical molecular graph that characterises its topology. There are a huge number of topological indices, which can be further categorised into a few groups such as degree, spectrum and distance ^[18]. Representing the chemical species using Tis provides convenience as they encode the topological structure into a mathematical form. TIs are applied extensively in developing QSPRs, which are mathematical correlations between molecular structures and molecular properties ^[19]. For instance, TIs were utilised in QSPR modelling to predict the biodegradability of the molecules for the development of safer fragrance molecules ^[20]. The results have shown that there are two remarkable TIs that contribute to the biodegradability of the molecules studied.

In a related study, De Mello Castanho Amboni et al. ^[21] explained that the structural parameters, including TIs, are related to the odour of aliphatic esters. From the QSAR study, it is notable that the TIs such as the electro topological state index and second order shape index, Kappa 2, are the relevant molecular descriptors for odour prediction. Nevertheless, the study conducted by Chacko et al. ^[22] has shown that the third-order shape index, Kappa 3, is one of the most crucial TIs for the categorisation of distinct odours. From the study by Ham and Jurs ^[23], the first-order chi connectivity index and molar refractivity are the distinguishing characteristics of musk and non-musks. Therefore, several TIs are used in this study for the development of odour-predictive models as they can shed light on the structure-odour relationship of fragrance molecules. Since there are no comprehensive predictive models for fragrance prediction, machine learning approaches have been explored to relate topological indices to olfaction.

4. Rough Set-Based Machine Learning (RSML)

ML is a subset of artificial intelligence (AI) and consists of techniques to discover patterns in data, which can then be used for future prediction or other related tasks ^[24]. Artificial neural networks (ANNs) and support-vector machines (SVMs) are particularly versatile and popular supervised ML techniques ^[25]. Despite the extensive applications of SVM and ANN in QSPR, QSAR, and GC modelling, their black box nature is a crucial weakness. The outputs of ANN and SVM cannot be translated into insights easily, making it difficult to support the decision provided by the algorithms ^[26]. This lack of inherent interpretability can only be addressed using additional algorithms ^[27]. One alternative approach is the utilisation of inherently interpretable models ^[28]. For example, hyperbox and RSML techniques can generate rule-based predictive models that are directly interpretable because they readily map to human thought processes. Because of this feature, they are better alternatives for the prediction of olfaction characteristics. Hyperbox ML has significant potential due to its ability to provide intuitive prediction accuracy in the identification of disjoint data regions ^[29]. However, there are computational challenges with large datasets with imperfections (e.g., non-deterministic patterns). On the other hand, RSML has advantages for the determination of more odour characteristics. RSML has proven to be especially robust for dealing with vagueness, imprecision, inconsistency and uncertainty in datasets ^[30].

Rough set theory (RST) which was first introduced by Pawlak ^[31], possesses the rough equality key concept for the designated sets in a given space. An approximation space is considered a pair

(U, R)

, where

U

is a certain set known as the universe and

R \subset U^{2}

is an indiscernibility relation ^[31]. In RST, any vague concept will be substituted by a pair of precise concepts, which is known as the lower and upper approximation of the vague concept ^[32]. The major advantage of utilising RST is that there is no preliminary or additional information required regarding the data ^[33]. RST has been applied in the areas of decision making, pattern recognition and knowledge acquisition. The very few early applications of rough set theory are mainly in the medical field for clinical data reduction applications and decision-making scenarios ^[34], rough classification of highly selective vagotomy (HSV) patients ^[35], reduction in information systems for medical diagnosis ^[36], etc. Recently, RSML has been employed to determine secure geological reservoirs to minimize the unintended release of CO₂ by analysing data from secure and insecure storage sites of CO₂. The results showed the prediction models generated from RSML are comparable with the site selection rules that were constructed based on proficient knowledge ^[37]. In addition, the RST was utilised as the front-end processor for deep learning to reduce the redundant influencing factors and to identify the critical factors of building energy consumption ^[38].

The key concept of RST is its indiscernibility relation, which could be tabulated into an information table. It is also known as an information system or attribute-value table, which consists of objects and their corresponding attributes ^[32]. The latter is comprised of conditional attributes (inputs) and decision attributes or classes of the object (outputs). An information system is defined by a pair

(U, A)

, where

U

is the finite nonempty set of objects (universe) and

A

is the objects ‘attributes. For every attribute

a \in A

, it has a value set defined by a value,

V_{a}

as shown in Equations (1) and (2) ^[39].

where $C$ is the set of conditional attributes, and $D$ is the decision attribute.

Furthermore, RST also enables the identification of reducts, defined as a minimal subset of attributes that preserve the indiscernibility relation. In the context of RSML, a reduct is a reduced set of attributes that can be used to generate a rule-based model. It should be noted that there may be more than one reduct set in a single dataset. Therefore, further analysis is required to determine which reduct can generate more feasible rules. Another important concept in RST is the intersection of all reducts, which is known as the core. It is the most important subset of attributes that contribute to classification accuracy ^[32].

For every information system, there is a set of decision rules known as a decision algorithm. Each decision algorithm reveals certain properties that fulfil both the total probability theorem and Bayes’ theorem ^[40]. Hence, these properties provide a new method for concluding the data by using three terms, namely strength

(σ_{x})

, certainty

(c e r_{x})

and coverage

(c o v_{x})

, as presented in Equations (3)–(5). Let

S = (U, C, D)

, where

C

is the conditions and

D

is the decisions ^[33].

The strength represents the total number of samples that follow the generated rule divided by the total number of samples. The certainty factor is defined as the frequency of samples having the decision, $D$ , in the sets of samples that fulfil conditions, $C$ . Lastly, the coverage factor is the frequency of samples possessing conditions, $C$ in the decision class. The former measures the predictive reliability of a rule, whilst the latter measures the generalisation power of a rule. A higher certainty indicates a lower chance of a molecule being misclassified, whereas a high coverage suggests that a rule is a good approximation of an underlying general principle. These three parameters will provide quantitative evidence to help select the most useful rule-based models.

References

Fortune Business Insights. Flavors and Fragrances Market Size, Share Report (2021–2028). 2021. Available online: https://www.fortunebusinessinsights.com/flavors-and-fragrances-market-102329 (accessed on 3 April 2022).
Sell, C.S. Chemistry and the Sense of Smell; John Wiley & Sons, Incorporated: Somerset, CA, USA, 2014.
Zhang, L.; Mao, H.; Liu, L.; Du, J.; Gani, R. A machine learning based computer-aided molecular design/screening methodology for fragrance molecules. Comput. Chem. Eng. 2018, 115, 295–308.
Korichi, M.; Gerbaud, V.; Floquet, P.; Meniai, A.H.; Nacef, S.; Joulia, X. Quantitative structure-Odor relationship: Using of multidimensional data analysis and neural network approaches. Comput. Aided Chem. Eng. 2006, 21, 895–900.
Linke, P.; Kokossis, A. Simultaneous Synthesis and Design of Novel Chemicals and Chemical Process Flowsheets. In European Symposium on Computer Aided Process Engineering-12; Grievink, J., van Schijndel, Eds.; Elsevier: Amsterdam, The Netherlands, 2002; Volume 10, pp. 115–120.
Austin, N.D.; Sahinidis, N.V.; Trahan, D.W. Computer-aided molecular design: An introduction and review of tools, applications, and solution techniques. Chem. Eng. Res. Des. 2016, 116, 2–26.
Zhou, T.; Zhou, Y.; Sundmacher, K. A hybrid stochastic–deterministic optimization approach for integrated solvent and process design. Chem. Eng. Sci. 2017, 159, 207–216.
Chemmangattuvalappil, N.G.; Ng, D.K.S.; Ng, L.Y.; Ooi, J.; Chong, J.W.; Eden, M.R. A review of process systems engineering (PSE) tools for the design of ionic liquids and integrated biorefineries. Processes 2020, 8, 1678.
Liu, Q.; Zhang, L.; Tang, K.; Liu, L.; Du, J.; Meng, Q.; Gani, R. Machine learning-based atom contribution method for the prediction of surface charge density profiles and solvent design. AIChE J. 2021, 67, e17110.
Zhang, L.; Mao, H.; Zhuang, Y.; Wang, L.; Liu, L.; Dong, Y.; Du, J.; Xie, W.; Yuan, Z. Odor prediction and aroma mixture design using machine learning model and molecular surface charge density profiles. Chem. Eng. Sci. 2021, 245, 116947.
Mah, A.X.Y.; Chin, H.H.; Neoh, J.Q.; Aboagwa, O.A.; Thangalazhy-Gopakumar, S.; Chemmangattuvalappil, N.G. Design of bio-oil additives via computer-aided molecular design tools and phase stability analysis on final blends. Comput. Chem. Eng. 2019, 123, 257–271.
Yee, Q.Y.; Hassim, M.H.; Chemmangattuvalappil, N.G.; Ten, J.Y.; Raslan, R. Optimization of quality, safety and health aspects in personal care product preservative design. Process Saf. Environ. Prot. 2022, 157, 246–253.
Ooi, Y.J.; Aung, K.N.G.; Chong, J.W.; Tan, R.R.; Aviso, K.B.; Chemmangattuvalappil, N.G. Design of fragrance molecules using computer-aided molecular design with machine learning. Comput. Chem. Eng. 2022, 157, 107585.
Chemmangattuvalappil, N.G. Development of solvent design methodologies using computer-aided molecular design tools. Curr. Opin. Chem. Eng. 2020, 27, 51–59.
Zhang, L.; Mao, H.; Liu, Q.; Gani, R. Chemical product design—Recent advances and perspectives. Curr. Opin. Chem. Eng. 2020, 27, 22–34.
Radhakrishnapany, K.T.; Wong, C.Y.; Tan, F.K.; Chong, J.W.; Tan, R.R.; Aviso, K.B.; Janairo, J.I.B.; Chemmangattuvalappil, N.G. Design of fragrant molecules through the incorporation of rough sets into computer-aided molecular design. Mol. Syst. Des. Eng. 2020, 5, 1391–1416.
Brookes, J.C.; Horsfield, A.P.; Stoneham, A.M. Odour character differences for enantiomers correlate with molecular flexibility. J. R. Soc. Interface 2009, 6, 75–86.
Islam, T.U.; Mufti, Z.S.; Ameen, A.; Aslam, M.N.; Tabraiz, A. On Certain Aspects of Topological Indices. J. Math. 2021, 2021, 9913529.
Dearden, J.C. The use of topological indices in QSAR and QSPR modeling. Chall. Adv. Comput. Chem. Phys. 2017, 24, 57–88.
Blay, V.; Gullón-Soleto, J.; Gálvez-Llompart, M.; Gálvez, J.; García-Domenech, R. Biodegradability Prediction of Fragrant Molecules by Molecular Topology. ACS Sustain. Chem. Eng. 2016, 4, 4224–4231.
Amboni, R.D.D.C.; Junkes, B.D.; Yunes, R.A.; Heinzen, V.E.F. Quantitative structure—Odor relationships of aliphatic esters using topological indices. J. Agric. Food Chem. 2000, 48, 3517–3521.
Chacko, R.; Jain, D.; Patwardhan, M.; Puri, A.; Karande, S.; Rai, B. Data based predictive models for odor perception. Sci. Rep. 2020, 10, 17136.
Ham, C.L.; Jurs, P.C. Structure-activity studies of musk odorants using pattern recognition: Monocyclic nitrobenzenes. Chem. Senses 1985, 10, 491–505.
Belyadi, H.; Haghighat, A. Introduction to machine learning and Python. In Machine Learning Guide for Oil and Gas Using Python; Elsevier: Amsterdam, The Netherlands, 2021; pp. 1–55.
Dey, A. Machine Learning Algorithms: A Review. Int. J. Comput. Sci. Inf. Technol. 2016, 7, 1174–1179.
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215.
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 4768–4777.
Dobbelaere, M.R.; Plehiers, P.P.; van de Vijver, R.; Stevens, C.V.; van Geem, K.M. Machine Learning in Chemical Engineering: Strengths, Weaknesses, Opportunities, and Threats. Engineering 2021, 7, 1201–1211.
Xu, G.; Papageorgiou, L.G. A mixed integer optimisation model for data classification. Comput. Ind. Eng. 2009, 56, 1205–1215.
Zhang, Q.; Xie, Q.; Wang, G. A survey on rough set theory and its application. CAAI Trans. Intell. Technol. 2016, 1, 323–333.
Pawlak, Z. Rough sets. Int. J. Comput. Inf. Sci. 1982, 11, 341–356.
Pawlak, Z. Rough set approach to knowledge-based decision support. Eur. J. Oper. Res. 1997, 99, 48–57.
Pawlak, Z. Some issues on rough sets. Lect. Notes Comput. Sci. 2004, 3100, 1–58.
Mohamed, A.S.A. Application of rough set theory for clinical data analysis: A case study. Math. Comput. Model. 1991, 15, 19–37.
Słowiński, K. Rough Classification of HSV Patients. Intell. Decis. Support. 1992, 11, 77–93.
Tanaka, H.; Ishibuchi, H.; Matsuda, N. Fuzzy Expert System Based on Rough Sets and Its Application to Medical Diagnosis. Int. J. Gen. Syst. 1992, 21, 83–97.
Aviso, K.B.; Janairo, J.I.B.; Promentilla, M.A.B.; Tan, R.R. Prediction of CO2 storage site integrity with rough set-based machine learning. Clean Technol. Environ. Policy 2019, 21, 1655–1664.
Lei, L.; Chen, W.; Wu, B.; Chen, C.; Liu, W. A building energy consumption prediction model based on rough set theory and deep learning algorithms. Energy Build. 2021, 240, 110886.
Raza, M.S.; Qamar, U. Rough Set Theory. In Understanding and Using Rough Set Based Feature Selection: Concepts, Techniques and Applications; Springer: Singapore, 2017; pp. 53–79.
Pawlak, Z. Rough sets, decision algorithms and Bayes’ theorem. Eur. J. Oper. Res. 2002, 136, 181–189.