Privacy-Preserving Deep Learning NLP Models for Cancer Registries

Privacy-Preserving Deep Learning NLP Models for Cancer Registries

Table of Contents


Population cancer registries can benefit from deep learning (DL) to automatically extract cancer characteristics from pathology reports. The success of DL is proportional to the availability of large labeled datasets for model training. Although collaboration among cancer registries is essential to fully exploit the promise of DL, privacy and confidentiality concerns are the main obstacles for data sharing across cancer registries. Moreover, DL for NLP requires sharing a vocabulary dictionary for the embedding layer which may contain patient identifiers. Thus, even distributing the trained models across cancer registries causes a privacy violation issue. In this paper, we propose a DL-NLP model distribution via privacy-preserving transfer learning (TL) approaches without sharing sensitive data. These approaches are used to distribute a multi-task CNN-NLP model among cancer registries. The model is trained to extract six cancer characteristics from pathology reports. We compare our proposed approach to conventional TL without privacy-preserving, single-registry models, and a model trained on centrally-hosted data. The results show that TL approaches including data sharing and model distribution outperform significantly the single-registry model. In addition, the best performing privacy-preserving model distribution approach achieves statistically indistinguishable average micro- and macro-F scores across all extraction tasks (0.823,0.580) as compared to the centralized model (0.827,0.585).


  • Author Keywords

    • Privacy-preserving,
    • multi-task CNN,
    • transfer learning,
    • NLP,
    • information extraction,
    • cancer pathology reports,
  • IEEE Keywords

    • Cancer,
    • Pathology,
    • Data models,
    • Natural language processing,
    • Training,
    • Vocabulary,
    • Tumors



ACCURATE, timely, and comprehensive cancer monitoring is critical for not only assessing the population-level impact of cancer but also for informing population-based cancer control policies. Population cancer registries process annually large volumes of unstructured pathology reports extracting cancer characteristics such as tumor anatomic location site, histological type, tumor grade, and stage at diagnosis for reporting to the national cancer surveillance programs. Such critical information resides in a narrative text full of typos, abbreviations, and linguistic variation. Natural language processing (NLP) has been explored extensively in oncology to semi-automate the timeconsuming and laborious manual effort [1], [2]. Scalable NLP can have a dramatic impact on cancer surveillance by assisting cancer registries in providing near real-time detailed measurements of cancer incidence, progression, survival, and mortality. However, existing clinical NLP methods are mainly rule-based requiring human experts to manually engineer input features. This is an unsustainable endeavor due to the prohibitively large number of rules that need to be carefully curated by domain experts to comprehensively capture all possible linguistic expressions. Therefore, artificial intelligence (AI) could potentially address clinical NLP challenges [3] and facilitate effective translation of NLP tools across cancer registries. Among different AI approaches, Deep Learning (DL) has been successfully applied to classify and recognize complex features in images, speech, and text data. Recent studies have shown the potential of DL models in automatically extracting cancer key characteristics from cancer pathology reports [4], [5], [6], [7] by achieving accuracy superior to traditional machine learning NLP methods. Successfully applying DL in the specific domain requires a large training corpus that has similar characteristics as the prospective testing data. Furthermore, this success is proportional to the size of the training corpus. Obtaining a large enough corpus from a single cancer registry is challenging, particularly with respect to rare cancer anatomic location sites (i.e., body organs where cancer develops) and histologies (i.e., different cell types). This challenge can be overcome by aggregating cancer pathology reports from multiple cancer registries in a centralized hub which can serve as a neutral entity to train a generalized model on all the data. Upon completion of training, the trained model can be shared with the registries. However, data privacy and confidentiality concerns prevent cancer registries from sharing patient data and benefiting from each other’s knowledge by leveraging DL. Transfer learning can be exploited to avoid data sharing by distributing learning models across cancer registries instead of distributing pathology reports. In transfer learning, a model can be developed at one clinical site and then reused as a starting point at another clinical site. Therefore, a cancer registry can benefit from other registries labeled datasets to get a more generalized model and reach better performance by using fewer training samples on its end. Although the transfer learning approach has been widely and successfully used in many computer vision applications [8], applying the same approach on text applications and sharing the whole model across data holders still requires access to the source data dictionary which includes sensitive information, such as patient names and residential addresses. Without a universally accepted de-identification algorithm, large scale de-identification is not currently a viable option across cancer registries. Image-based DL models do not contain any individually-identifiable patient information; however, text-based DL models contain such information as part of the word embeddings. To distribute a trained text-based DL model across cancer registries, the vocabulary dictionary, which contains individually-identifiable patient information, must be distributed too. Therefore distributing DL NLP models across cancer registries poses privacy concerns. This work builds upon our previous work [9], in which we implemented a conventional transfer learning (TL) approach among cancer registries and applied it on a single task CNN model for cancer subsite extraction from pathology reports. We also compared the model trained via TL with a model trained on centrally hosted data. The main contributions of this work are as follows: • We develop a multitask CNN (MT-CNN) model for information extraction from cancer pathology reports. It differs from the previous work [7] by extracting information at the pathology report level instead of the tumor-level. Also, we consider all available classes of cancer characteristics without condensing low prevalent classes. The model is used to extract six key cancer characteristics – tumor anatomic location site (i.e., site) (70 classes), subsite (313 classes), laterality (7 classes), behavior (4 classes), histology (543 classes), and grade (9 classes). • We propose a new privacy-preserving approach that protects any PHI information in the word embedding vocabulary dictionary by applying restrictions on which word tokens are included in the vocabulary. To prevent PHI information such as patient names and residential addresses from being included, we limit the vocabulary to words from a publicly available corpus that has been prescreened for PHI, such as the MIMIC-III dataset and the PubMed abstracts dataset. Thus, a trained model can be shared with other registries without data restrictions. • We evaluate the effectiveness of collaboration across cancer registries on the performance of the MTCNN using different TL methods with and without our privacy-preserving vocabulary. These methods are necessary for scenarios where cancer registries are unable to directly share their patient data for training. We compare the conventional TL approach, acyclic TL, and the state-of-the-art model distribution approach, cyclic transfer learning [10]. Cyclic transfer learning has been used in medical imaging applications, but to our knowledge, this is the first time it is applied to the medical text. We compare these approaches against the baselines of training the MTCNN on data from only a single registry and training the MT-CNN on data from all available registries without any restrictions.


In this paper, we propose a privacy-preserving technique to share a DL NLP model across cancer registries, excluding any data that may compromise patient privacy. We demonstrate the value of our technique with an MT-CNN model for abstracting cancer characteristics from cancer pathology reports, a time-consuming manual activity across cancer registries. In addition, we study different model distribution and data sharing approaches with cancer registries. The experiments demonstrate that model distribution and data sharing approaches achieve the highest micro- and macro-F1 scores across all information extraction tasks, as compared to the single-registry model. The performance improvement is especially noticeable for macro-F1 scores, suggesting that these approaches do a better job classifying low prevalent cases which is an important advantage. Finally, our proposed transfer learning with privacy-preserving models achieve comparable performance as the conventional transfer learning approach without privacy-preserving and the centralized model. This opens the possibility of sharing knowledge through NLP models across cancer registries without violating data privacy rules.


This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DEAC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725. This work has also been supported by the National Cancer Institute under Contract No. HHSN261201800013I/HHSN26100001 and NCI Cancer Center Support Grant (P30CA177558).

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

Privacy-Preserving Deep Learning NLP Models for Cancer Registries



M. Alawad et al.,




Privacy-Preserving Deep Learning NLP Models for Cancer Registries

Publish in

in IEEE Transactions on Emerging Topics in Computing,



PDF reference and original file: Click here

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.