Technique for Searching Data | Encyclopedia MDPI

Technique for Searching Data: Comparison

Please note this is a comparison between Version 3 by Vitalii Yesin and Version 4 by Fanny Huang.

The growing popularity of data outsourcing to third-party cloud servers has a downside, related to the serious concerns of data owners about their security due to possible leakage. The desire to reduce the risk of loss of data confidentiality has become a motivating start to developing mechanisms that provide the ability to effectively use encryption to protect data. However, the use of traditional encryption methods faces a problem. Namely, traditional encryption, by making it impossible for insiders and outsiders to access data without knowing the keys, excludes the possibility of searching.

database
security
database management system (DBMS)
confidentiality
encryption

1. Introduction

Today, storing and processing data on third-party remote cloud servers is widely used, showing explosive growth ^[1][ 1 ]. However, as the scale, value, and centralization of data increases, the reverse side of this process is revealed — the problems of ensuring the security and privacy of data are aggravated, which causes serious concern for owners and users of data. There is an identified risk that data stored in databases may be compromised [ 2 ^[2]], and this, in accordance with various international laws and standards such as: General Data Protection Regulation (GDPR ^[3][ 3 ], Payment Card Industry Data Security Standard (PCI DSS) [ 4 ^[4]], the Health Insurance Portability and Accountability Act (HIPAA) [ ^[5],5 ] and some others, cannot be allowed. The owner of the data must be sure that the data stored on the third party remote servers of the service provider are protected from theft by outsiders. Moreover, these data must be protected even from the service provider itself (a valid user, known as an insider), if the respective provider cannot be trusted.

As you know, one of the fundamental solutions to this problem is the use of relevant cryptographic methods and primitives. Encryption is the standard approach to providing data confidentiality that is outsourced to so-called honest-but-curious cloud servers. Encryption makes it impossible for both insiders and outsiders to access data without knowing the keys. However, encryption also has a downside. The direct use of traditional data encryption/decryption approaches in most cases makes it difficult to perform search operations in encrypted data [ ^[6][7][8]6 , 7 , 8 ]. A simple solution to this problem is to download the entire dataset of the corresponding storage, then decrypt it locally and search for the required data. This approach creates serious performance issues that negate the benefits of outsourcing, making it unacceptable for most applications. The other method allows the server to decrypt the data, execute the query on the server side, and send only the results back to the user. However, in this case, the level of security is reduced, since data protected by encryption can potentially become available to the service provider (privileged user). Therefore, it is desirable to support the fullest possible server-side search functionality with the least possible loss of data confidentiality. In particular, a secure search system should aim to ensure that the service provider does not learn anything about the data stored in the secure database or about the queries, and the requester of the relevant data (querier) learns nothing, except for the query results [ ^[2]2 ].

The problem of searching data in encrypted databases has aroused great interest both in the scientific community and in industry ^[9][ 9 ]. To solve the problem of providing a search in cryptographically protected databases, relevant studies were carried out related to the development of new cryptographic primitives, new data structures for searchable encryption, and the development of views on security ^[6][10][6 , 10 ] . The solutions available today for searching encrypted data combine non-trivial ideas from cryptography, from the main provisions of the theory of algorithms and data structures, information search, and databases [2 , ^[2][6][11]6 , 11 ] . However, despite the wide variety of options offered, there is no dominant solution for all use cases. The goal of a security plan, according to Andress ^[12][ 12 ], is to find the balance between protection, usability, and cost. Similar views are held by Fuller et al. [ 2 ^[2]], who believe that designing a protected search system is a balance between security, functionality, performance, and usability. Therefore, it is important for data owners and users to understand how a fairly wide range of secure database systems are offered for their various applications and what compromises are acceptable for their respective use case. All this has stimulated research in the field of secure data management and increased its relevance.

2. A Bbrief Ssurvey of Ttechnique for Ssearching Ddata in a Ccryptographically Pprotected Ddatabase

Security, as is known, is associated with information that, during the operation of searchable encryption schemes, is revealed or leaked to an attacker who has access to the database server. Bösch et al. ^[6][ 6 ] believe that information leakage is possible in such schemes, which can be divided into three groups:

(a): index information (refer

index information (refers to the information about the keywords contained in the index);

search pattern (information that can be obtained by knowing whether two search results refer to the same keyword);

an access pattern (refers to information that is implied by the query (search) results, namely which documents contain the requested keyword for each of the queries [ 13 ] or which document identifiers match the query [ 14 ]).

Bös to the information about the keywords contained in the index);

(b)

search path etern (information that can be obtained by knowing whether two search results refer to the same keyword);

(c)

an access pattern (refers to information that is implied by the query (search) results, namely which documents contain the requested keyword for each of the queries ^[13] or which document identifiers match the query ^[14])al.

Bösch et[ al.6 ^[6]] note that in many schemes, there is leakage of at least the search pattern and the access pattern. At that, identifying the search pattern may not be a problem in some scenarios, whereas for others it is unacceptable. For example, in a medical database, disclosing a search pattern through statistical analysis (which allows an attacker to get full information about the plaintext keywords) can lead to the leakage of a large amount of information. This information can be used to match it with other (anonymous) public databases.

Fuller et al. [ 2 ^[2]] distinguish two types of entities that can pose a security threat to a database: a valid user known as an insider who performs one or more roles and an outsider. The latter can monitor and potentially modify network interactions between valid users, separating attackers into those that persist for the lifetime of the database and those that obtain a snapshot at a single point in time. At that, attackers are divided into those that persist for the lifetime of the database and those that obtain a snapshot at one a single point in time ^[15][ 15 ]. In addition, Fuller et al. [ ^[2]2 ] differentiate attackers into those who are: semi-honest (or honest-but-curious), i.e., those who follow the prescribed protocols, but may try to get additional information from data that they observe; and malicious, that is, those that actively perform actions aimed at obtaining additional information or influencing the operation of the system. They also note that much of the active research in protected search technology considers semi-honest security against a persistent insider adversary. At that, special attention is paid to such types of objects within a protected search system that are vulnerable to leaks, such as: (a) data items and any indexing data structures; (b) queries; (c) records returned in response to queries or other relationships between data items and queries; (d) access control rules and the results of their application.

The cryptographic community has developed several common primitives:

−

fully homomorphic encryption ^{[16][17][18][19]}[ 16 , 17 , 18 , 19 ],

−

functional encryption [ 20 , ^[20][21]21 ] with its subclasses and earlier representatives:

−

predicate encryption [ 22 , 23 ^[22][23]],

−

identity-based encryption [ ^[24]24 ],

−

attribute-based encryption ^[25][ 25 ]

and some others that completely or partially solve the problem of searching in a secure database. Protected search techniques are often based on these primitives, but rarely rely solely on one of them. Instead, they tend to use specialized protocols, often with some leakage in order to improve performance [ ^[2]2 ].

One possible approach to reduce the damage caused by a server compromise is to encrypt sensitive data and run all computation (application logic) on the clients. However, as noted by Popa et al. ^[26],[ 26 ] some important applications are not suitable for this approach. For example, database-backed websites that process queries to generate data for the user, and applications that compute large amounts of data. Another possible approach is the use of such theoretical solutions as fully homomorphic encryption (FHE) ^{[16][17][18][19]}[ 16 , 17 , 18 , 19 ]. Its use allows servers to compute arbitrary functions over encrypted data while only clients see the decrypted data. However, one of the problems of schemes with fully homomorphic encryption is performance, since current schemes require large computational resources and large storage overheads [6 , ^[6][26]26 ]. For some applications, so-called somewhat homomorphic encryption schemes may be used. These schemes are more efficient than FHE, but only allow a certain number of additions and multiplications [ 16 ^[16][18], 18 ]. The main problem when using somewhat or fully homomorphic encryption is that the resulting search schemes require a linear search time in the length of the dataset and this is too slow for practical use in modern applications.

As noted earlier, the problem of searching over encrypted data is of great interest from both theoretical and practical points of view. This is explained by the importance of ensuring the security and privacy of data stored and processed on third-party remote cloud servers of the service provider. However, as noted by some experts in this field [ 9 , ^[9][27]27 ], research on this topic is more focused on the scenario of a user who outsources an encrypted set of documents (such as e-mails or medical records) and would like to continue keyword search in this encrypted dataset. However, in practice, many companies, organizations, and institutions store data in databases that use the relational data model. Users are accustomed to using widely accepted SQL, which allows them to store, query, and update their data in a convenient way. Databases that support SQL (this applies in general to both NewSQL and some NoSQL databases that also allow you to work in the SQL query paradigm) provide fast search and retrieval of records, provided that the database can read out the data contents. However, encryption makes it difficult to search encrypted databases. Therefore, the direct application of solutions to search for the required information in the encrypted data of traditional databases is not an easy task.

In order to solve certain issues, Hacigümüş et al. ^[28][ 28 ] have developed techniques by which the bulk of the work of executing SQL queries can be performed by the service provider without the need to decrypt the stored data. The paper explores an algebraic structure for query splitting to minimize client-side computations. Using a so-called “coarse index” allows you to partially execute the SQL query on the provider side. The result of this query is sent to the client. The final correct result of the query is found by decrypting the data and executing a compensation query on the client side.

Popa et al. ^[26][ 26 ] proposed a system called CryptDB that supports SQL queries over encrypted data. This solution is based on various types of encryption, such as random (RND), deterministic (DET), and order-preserving encryption (OPE), applied to a SQL table column. To request data from an encrypted database, CryptDB converts an unencrypted SQL query into its encrypted equivalent and decrypts the appropriate encryption layers. CryptDB achieves its goals using three ideas: running queries over encrypted data using a new encryption strategy with SQL support, dynamically adjusting the encryption level using encryption onions to minimize the information disclosed to the untrusted DBMS server, and chaining encryption keys to user passwords in a way that only authorized users can access to encrypted data. At that, although CryptDB protects data confidentiality, it does not guarantee the integrity, actuality, or completeness of the results returned to the application. However, the main disadvantage of CryptDB, as noted by Azraoui et al. [ 9 ^[9]] is that whenever one layer is removed, the encryption scheme becomes weak. In light of this, the main problem is to provide a practical solution for searching over encrypted databases that does not suffer from the leakage occurring in CryptDB and that provides transparent processing of complex queries over encrypted SQL databases. In their paper ^[9][ 9 ] the authors attempt to solve this problem by proposing a practical construct for searching data in an encrypted SQL databases that limits information leakage. Their solution is based on the searchable encryption technique developed by Curtmola et al. [ ^[29]29 ] and applied to unstructured documents. This mechanism creates an inverted search index of keywords in the database to enable keyword search queries over encrypted data. The practicality of this solution is achieved through the use of the cuckoo hashing technique, which makes the search in the index efficient. The proposed solution supports Boolean and range queries.

Pilyankevich et al. ^[27][ 27 ] propose a system (called Acra) which allows, among other things, to provide a search for encrypted data in SQL databases. The proposed Acra Searchable Encryption (Acra SE) solution is based on a blind indexing approach that develops the original idea of the CipherSweet project [ ^[30]30 ]. The main component of the Acra SE scheme is the so-called Acra Server, which works as a reverse proxy (transparent encryption/decryption proxy server). It sits between the application and the database. The application does not know that the data are encrypted before it gets into the database, the database does not know that someone encrypted the data. It is worth noting that the encryption and secure search functions of Acra Server can be configured for each column. This means that every table in the database can be fully encrypted (every column), partially encrypted (some columns are encrypted, some not), or fully unencrypted. All Acra’s searchable encryption security properties are very similar to the security properties of CipherSweet, which poses the risk of partially known plaintext attacks. In this connection, Pilyankevich et al. [ ^[27]27 ] provide practical recommendations to ensure security. However, despite certain solutions aimed at ensuring the security of storing and searching for sensitive data, Acra, like CipherSweet, which was taken as a prototype of a searchable encryption scheme, supports the minimum functionality of queries, namely, only for equality.

Various DBMSs are characterized by the so-called technology of “transparent data encryption” (TDI) ^[31][ 31 ], which allows you to selectively encrypt sensitive data stored in database files, as well as in files related to data recovery, such as redo logs, archive logs, backup tapes. The essence of transparent encryption is that a combination of two keys is used: a key for each database table, which is unique, a master key that is stored outside the database in the so-called “wallet”. Data stored on disk are encrypted; however, they are automatically decrypted for the legitimate user to process queries. That is, when the user selects encrypted columns, the DBMS quietly extracts the key from the “wallet”, decrypts the columns and shows them to the user. As a result, the server must have access to the decryption keys, and an attacker who has compromised the DBMS software can gain access to all data. Therefore, the main goal of TDE is to protect sensitive data located in the corresponding files of the operating system. TDE is not a full blown encryption system and it should not be used in this capacity.

In addition, attention should be paid to the fact that the ability to perform search operations over encrypted databases leads to the complexity of systems and an increase in the amount of memory required and query execution time. At that, some searchable encryption schemes when performing certain queries do not provide sufficient data confidentiality. That is, with long-term observation, an attacker can obtain a significant part of the information about sensitive data. All this testifies to the openness of the secure search problem and the need for further research in this direction to ensure secure work with remote databases and data storages.