A Two-Phase Dynamic Throughput Optimization Model for Big Data Transfers

A Two-Phase Dynamic Throughput Optimization Model for Big Data Transfers

Table of Contents


The amount of data transferred over dedicated and non-dedicated network links has been increasing much faster than the increase in the network capacity. On the other hand, the current data transfer solutions fail to guarantee even the promised achievable transfer throughput. In this paper, we propose a novel two-phase dynamic throughput optimization model based on mathematical modeling with offline knowledge discovery/analysis and adaptive online decision making. In the offline analysis, we mine historical transfer logs to perform knowledge discovery about the transfer characteristics. The online phase uses the discovered knowledge from the offline analysis along with the real-time investigation of the network condition to optimize the protocol parameters. As the real-time investigation is expensive and provides partial knowledge about the current network status, our model uses historical knowledge about the network and data characteristics to reduce the real-time investigation overhead while ensuring near-optimal throughput for each transfer. Our novel approach is tested over different networks with different datasets, and it has outperformed its closest competitor by1.7x and the default case by 5x. It also achieved up to 93% accuracy compared to the optimal achievable throughput possible on those networks.

  • Author Keywords

    • Throughput optimization,
    • big data transfers,
    • offline analysis,
    • dynamic learning,
    • protocol tuning
  • IEEE Keywords

    • Throughput,
    • Protocols,
    • Data transfer,
    • Data models,
    • Bandwidth,
    • Optimization,
    • Real-time systems


EXTREMELY large scale data in numerous fields, such as scientific research, industrial applications, e-commerce, social networks, and the Internet of Things (IoT), are ubiquitous nowadays. As a result, global data traffic has been increasing exponentially over the years, and it is expected to reach an annual rate of 4.8 zettabytes by 2022, which corresponds to nearly 1 billion DVDs of data transfer per day for the entire year [1]. This plethora of data generated from diverse sources make data management tasks (i.e., accessing, sharing, and disseminating of data) tremendously challenging [2], [3]. In recent years, Managed File Transfer (MFT)services, such as Mover.IO [4], Globus [5], B2SHARE [6], and CloudFuze [7] have been developed to ease data management, sharing, and replication. However, these services suffer from under-utilization of network resources, especially in diverse long-haul Wide-Area Networks (WAN), which hurts the overall data sharing performance. This performance degradation can be attributed to the “one protocol fits all” approach used in the existing solutions.

Changing the protocol stack is expensive and requires low-level updates and its adaptation by large-scale needs considerable time and effort. Therefore, a user-space tuning mechanism for the existing protocols is more lucrative for adaptation and fast deployment. Intelligently and dynamically tuning the protocol parameters that are accessible from the user-space (such as buffer size, pipelining, parallelism, and concurrency levels) can significantly improve the end-to-end data transfer performance [8], [9], [10]. Most of the existing MFTs use default or statically assigned values for these parameters that cannot adapt to the dynamic nature of the network and perform sub-optimally during the end-to-end data transfer. Moreover, an aggressive assignment of these parameters can lead to link congestion, queuing delay, and packet-loss events in a shared network. Therefore, we aim for a solution that can provide personalized parameter settings for different data transfer tasks and can adapt the parameters in real-time to cope with the dynamics of the shared networks without hurting the performance of other contending transfers.

In this paper, we present a novel two-phase application-layer throughput optimization model for big data transfers based on offline knowledge discovery and adaptive real-time tuning to ensure continuous performance guarantee and fairness among the contending transfers in the shared network. To provide network-specific optimal parameter settings, we mine the historical data transfer logs offline. During the online phase, we monitor the health of the data transfer and use the feedback from both the network and offline module to tune the parameters dynamically. Real-time investigations (probing for bandwidth) are expensive and provide a partial view of the dynamics of the shared network. Therefore, our model uses historical knowledge and online network feedback to decide on near-optimal parameter settings. We have tested our network and data-agnostic solution over different networks and observed up to 93%accuracy compared to the optimal achievable throughput possible on those networks. Extensive experimentation and comparison with best known existing solutions in this area revealed that our model outperforms existing solutions in terms of accuracy, convergence speed, and achieved end-to-end data transfer throughput.

In summary, the contributions of this paper include: (1)our proposed model performs end-to-end big data transfer optimization completely at the application-layer, without any need to change the existing infrastructure, networking stack, or the operating system kernel; (2) it combines offline knowledge discovery with adaptive real-time sampling to achieve close-to-optimal end-to-end data transfer through-put with very low sampling overhead; (3) we define a detailed mathematical model that uses historical data to find near-optimal parameters for different data transfer tasks in diverse shared networks; (4) in real-time, it combines the current network condition and knowledge from offline historical analysis to provide faster convergence towards maximally achievable throughput; (5) to guarantee fairness among the concurrent data transfer tasks, we propose a fair parameter scheduler that can converge faster, distribute bandwidth fairly and increase bandwidth utilization by56%.

The rest of the paper is organized as follows: section provides background information on the application-layer protocol parameters we tune; Section III presents the basic Adaptive sampling model; Section IV discusses the fair parameter scheduling model; Section V elaborates the evaluation of the model; Section VI describes the related work in this field, and Section VII concludes the paper with a discussion on future work.


In this study, we have explored a novel big-data transfer throughput optimization model that relies upon offline mathematical modeling and online adaptive sampling. Existing literature contains different types of throughput optimization models that range from static parameter based systems to dynamic probing based solutions. Our model eliminates the online optimization cost by performing offline analysis, which can be done periodically. It also provides accurate modeling of throughput, which helps the online phases to reach a near-optimal solution very quickly. For large-scale transfers, when external background traffic can change during transfer, our model can detect the harsh changes and can act accordingly. The FPS module can converge faster than existing solutions. The overall model is resilient to harsh network traffic changes. We performed extensive experimentation and compared our results with the best known existing solutions in this area. Our model outperforms its closest competitor by 1.7x and the default case by 5x in terms of the achieved throughput. It also provides high utilization and achieves up to 93% accuracy compared with the optimal achievable throughput possible on the tested networks. As future work, we are planning to increase the achievable throughput further by reducing the impact of the TCP slow-start phase. Another exciting path is to reduce the overhead introduced by real-time parameter changes. We are also planning to investigate other application-layer protocol parameter sets that can be optimized to achieve better performance improvement.


This project is in part sponsored by the National Science Foundation (NSF) under award number OAC-1724898.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

A Two-Phase Dynamic Throughput OptimizationModel for Big Data Transfers



M. S. Q. Z. Nine and T. Kosar,




A Two-Phase Dynamic Throughput OptimizationModel for Big Data Transfers

Publish in

in IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 2, pp. 269-280, 1 Feb. 2021,



PDF reference and original file: Click here

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.