De-identification and Privacy Issues on Bigdata Transformation

As the number of data in various industries and government sectors is growing exponentially, the ‘7V’ concept of big data aims to create new value by indiscriminately collecting and analyzing information from various fields. At the same time as the ecosystem of the ICT industry arrives, big data utilization is threatened by the privacy attacks such as infringement due to a large amount of data. To manage and sustain the controllable privacy level, there need some recommended de-identification techniques. This paper exploits those de-identification processes and three types of commonly used privacy models. Furthermore, this paper presents use cases that can be adopted those kinds of technologies and future development directions.


Over the last 20 years, the number of data has occurred in a multi-dimensional industry or sector, and the number has increased exponentially. According to the “Data Age 2025” white paper published by the International Data Corporation (IDC) in 2018, the amount of data worldwide will increase by about five times to 175 ZB by 2025 from 33ZB (Zettabyte, 1 trillion GB) as of 2018 [1]. Big data means not only just big data but also enormous data, which includes letters and phase data, compared to data generated in analog environments in the past. In recent years, various industries have been fascinated by the potential of big data, making analyses for various value creation, and the individual, corporate and humming countries that effectively utilize this big data have brought about the effect of opening a new paradigm chapter. Since big data is intended to improve the interests of social organizations, it is necessary to push for the beneficial use of big data at a national level. However, big data has problems with the premise of including a lot of data. For most of the data with value in use, it is often the attribute that identifies the individual, the attribute that is not the identifier by itself, but the attribute that can be used to deduce a particular person indirectly through combination with other data, and the attribute that can reveal the person’s privacy (e.g., card information, salary sensitivity, etc.). In other words, because big data is produced, collected, and analyzed indiscriminately, there is a risk that personal privacy may be infringed upon depending on the information collection process or the results of the analysis. In addition to the explosion of data resulting from the advent of the ICT industry ecosystem, the scope of the personal information area continues to expand, and the potential for various categories of crimes, including cybercrimes and invasion of privacy, can increase at a time when too much information is missing from data tests. As a result, it is predicted that building and utilizing appropriate unidentifiable data will ensure stability and reliability in making proper use of big data.

The composition of this paper is as follows. Chapter II provides a brief supplement to the big data mentioned in the introduction and introduces current guidelines for nonidentification processing and the technologies and models that apply. Chapter III introduces three cases of big data analysis based on non-identified data and tries to examine what value can be created. Finally, Chapter IV aims to use this study to apply non-identifiable treatment guidelines in the future to suggest ways to use them in research on privacy and security aspects.


To take the benefits which big data provides to individuals and society, it is essential to ensure that individuals can trust the privacy protection arising from the use of big data. Finding meaningful patterns in unstructured forms of big data is becoming increasingly important to create value. The personal information(privacy) which we commonly use also can make the value that has major importance as big data. As I mentioned in the text, it must meet the basic deidentification process to take advantage of it. In this paper, rather than introduce new concepts of privacy models or de-identification methods, this study looked at how the privacy model was based and, also looked at how values were produced based on the data from which the de-identification measures were taken. This is expected to be used to define new assumptions or to conduct analyses based on de-identified data in the future. Meanwhile, the three methods of k-anonymity, l-diversity, and t-closeness presented in the privacy model and its parameters are determined by experts. In future research, we will adopt the deep-learning models to find the optimal parameter from k-anonymity, l-diversity, and t-closeness.

Hyo-jun Lee; Si-heon Cho; Ji-won Seong; Suan Lee; Wookey Lee




