2010 12th IEEE International Conference on High Performance Computing and Communications

Fault-Tolerant Scheduling with Dynamic Number of Replicas in Heterogeneous Systems Laiping Zhao∗ , Yizhi Ren∗‡ , Yang Xiang† , and Kouichi Sakurai∗

∗ Department of Informatics, Kyushu University, Fukuoka, Japan Email: {zlp,ren}@itslab.csce.kyushu-u.ac.jp, sakurai@inf.kyushu-u.ac.jp † School of Information Technology, Deakin University, Australia Email: yang.xiang@deakin.edu.au ‡ School of Software, Dalian University of Technology, China

Abstract—In the existing studies on fault-tolerant scheduling, the active replication schema makes use of ε + 1 replicas for each task to tolerate ε failures. However, in this paper, we show that it does not always lead to a higher reliability with more replicas. Besides, the more replicas implies more resource consumption and higher economic cost. To address this problem, with the target to satisfy the user’s reliability requirement with minimum resources, this paper proposes a new fault tolerant scheduling algorithm: MaxRe. In the algorithm, we incorporate the reliability analysis into the active replication schema, and exploit a dynamic number of replicas for different tasks. Both the theoretical analysis and experiments prove that the MaxRe algorithm’s schedule can certainly satisfy user’s reliability requirements. And the MaxRe scheduling algorithm can achieve the corresponding reliability with at most 70% fewer resources than the FTSA algorithm. Index Terms—Resource scheduling; Fault-tolerance; Reliability; Heterogeneous system

Amazon, for example, claims that its S3 service stores three replicas of each file. That means, to store x gigabytes data, Amazon has to supply 3x gigabytes storage space located on three different drives, with x gigabytes corresponding to each drive. Assuming the economic cost of each drive is y, then 3 drives will cost 3y, including extra 2y cost. It is believed that these extra cost will be passed on to the customers eventually. Not only the storage service, the active replication scheme in computing service also consume much extra resources and economic cost. How to achieve a higher reliability with minimum resources is a challenge for the scheduling algorithm. In this study, specifically, reliability is interpreted as a probability value of the successful completion of a job. Our objective is to design a fault tolerant scheduling algorithm to satisfy the user’s reliability requirement with minimum resources.

I. I NTRODUCTION

C. Previous work

A. Background

Reliability analysis based scheduling algorithms are addressed by many works. J.J. Dongarra et al. [8] design two algorithms that optimize both makespan and reliability. The first scheduling algorithm in [8] targets to maximize the reliability subject to makespan minimization. And the second one is based on the product failure rate × unitary instruction execution time to trade off between reliability maximization and makespan minimization. Accounting for both the execution time and the failure probability, A. Dogan et al. [9] develops a genetic algorithm based scheduling algorithm to trade off the execution time and the reliability. In [10], MCMS and PRMS are developed to achieve the maximum system reliability while satisfying a given time constraint. S. Swaminathan et al. [11] proposes a reliability-aware valuebased dynamic scheduling algorithm, which aims to maximize the overall PI of the system. M. Hakem et al. [14] presents the BSA scheduling algorithm, which takes into account both the time makespan and the failure probability of the application. Obviously, the reliability achieved by these research is limited. And to achieve a higher reliability, special schemas, such as active replication, are necessary. The primary and backup scheduling algorithm can tolerate one failure in the systems. Q. Zheng et al. [3] [4] consider the response time and replication cost in the scheduling process

Cloud computing is becoming increasingly popular, and more and more services are continuously emerging on the Internet. To provide high reliability, cloud providers generally schedule tasks with redundancy. In summary, resource and time redundancy correspond to the active replication and backup/restart scheme respectively [1]. For the active replication scheme, several processors are scheduled simultaneously, and the task will succeed if at least one processor does not encounter a failure. For the backup/restart scheme, when a processor encounters a failure, the task will be rescheduled on the backup processor [2]. B. Motivation To ensure the job is finished in time, active replication schema is a good choice for providing high reliability. Active replication schema is exploited in the algorithms [3] [5] [6] [16] and [17]. In order to tolerate ε failures, these algorithms have to schedule ε + 1 replicas for each task in the workflow. As discussed in the background, this leads to a large resource redundancy, which has an adverse impact for the system performance, especially when the resources are limited. Besides, for the economic attributes of cloud services, more resource consumption comes with higher economic cost. 978-0-7695-4214-0/10 $26.00 © 2010 IEEE DOI 10.1109/HPCC.2010.72

386 434

using the primary and backup scheduling algorithm. In [12], although the dynamic number of replicas are scheduled for each task, only one failure can be tolerated by the scheduling result. In order to reduce the schedule length, X. Qin et al. [13] puts the emphasis on the conditions for the backup copies to safely overlap with each other, and propose the eFRD scheduling algorithm. However, these works only can tolerate one failure, which is far from enough for resource scheduling problem. Considering crash failures, the active replication schema is incorporated into the scheduling algorithms in [5] and [6], and CAFT and FTSA scheduling algorithm are proposed respectively. FTSA is an extended version of the classic HEFT algorithm [15]. And CAFT put more emphasis on the practical one-port communication model. In [16], active replication and standby parallel replication strategies are exploited in managing the redundancy existing in each task. And based on the analysis on price, reliability and response time, algorithms are developed to meet user-specified QoS requirements. In [17], Alain Girault et al. propose the FTBAR scheduling algorithm, which automatically produces a static distributed fault-tolerant schedule of a given algorithm on distributed architecture. However, the high resources waste of the active replication schema is not considered by these algorithms.

2) To exhibit the same reliability degree, the MaxRe algorithm saves at most 70% resources with comparison to the FTSA algorithm [6]. 3) The earliest finish time of MaxRe is only worse than the FTSA algorithm for at most 20%. While for the latest finish time, The MaxRe even performs better than the FTSA algorithm. II. S YSTEM MODEL AND PROBLEM STATEMENT The processor model, job model and the system model are given in this section. A. Processor model Generally, faults can be categorized into crash faults (or failstop faults) and byzantine faults. Crash faults usually comes with the hardware failure, power failure etc., which may result in the data loss completely. Therefore, only crash faults are considered in this paper. Suppose the heterogeneous system consists of m processors: P = {p0 , p1 , p2 , ..., pm−1 } It is reasonable to believe that the processor is fault-free while it is idle. Based on the common Exponential distribution assumption in the reliability research [7] [20], for each processor pi (1 ≤ i ≤ m−1), the arrival of failures follows a Poisson distribution with λi , which is a positive real number, and equal to the expected number of occurrences of failures in unit time t. So the failure distribution in unit time t can be represented as:

D. Challenging issues To satisfy the user’s reliability requirement with the minimum resources, the number of replicas for each task should be as few as possible. How to decide the number of replicas for each task is one challenge. And even with the minimum resources, the scheduling algorithm should guarantee that user’s required reliability is satisfied. This is another challenge. Moreover, while meeting the user’s reliability requirement, the scheduling algorithm’s performance on execution time should also be acceptable, which is the third challenge.

λk e−λ (1) k! where k is the number of occurrences of failures in unit time t. f (k, λ) =

B. Job model A job is represented as a weighted directed acyclic graph (DAG): G = (V, E), where V is the set of nodes corresponding to the tasks, and E is the set of edges corresponding to the precedence relations between the tasks. Suppose: V = {τ0 , τ1 , τ2 , ..., τn−1 }, n = |V | is the number tasks. e = |E| is the number of edges. The node without any predecessor is called an entry node, and the node without any successor node called an exit node. Task τi cannot start being executed before it received the output from its all its predecessors, and the result of task τi can be sent to its successor tasks after the task has been finished.

E. Our contribution with comparison to related works The basic idea for minimizing the resource consumption is incorporating the reliability analysis into active replication scheme. There have been some research achievements on the reliability analysis [19], where the Exponential distribution and Weibull distribution are two common assumptions for the failure distribution. Being aware of the failure distribution of processors, a minimum, dynamic number of replicas can be fixed for each task. Based on this idea, we design the MaxRe scheduling algorithm. The analysis of MaxRe scheduling algorithm shows that, without exceeding the system capacity, the reliability achieved by MaxRe algorithm certainly satisfies the user’s reliability requirement. Four typical workflow examples are used to verify the algorithm. Experiment results show that: 1) If the required number of replicas for each task does not exceed the total number of processors, the reliability achieved by MaxRe algorithm certainly can satisfy the user’s requirement.

C. The system model and its some properties A heterogeneous system contains many different kinds of hardware and software working cooperatively to solve problems. The processors have different λ values: Λ = {λ0 , λ1 , λ2 , ..., λm−1 }. Therefore, for the processor pi , we λk e−λi have the failure distribution is: F (k, λi ) = i k! . we also assume all processors are fully connected, each processor can communicate with every other processors. We do not consider the failures on the communication device in current work.

435 387

l2

l1

l2

e−λ1 c1 × (1 − (1 − e−λ1 c1 )(1 − e−λ2 c2 )) l −λ1 c1 1

l −λ1 c2 1

l −λ1 c1 1

l −λ2 c2 2

l1

l2

l2

l2

l1

l2

− e−λ1 c1 −λ1 c1 −λ2 c2 (2) 2) In the situation B (Fig. 1(b)), the reliability is: =e

l2

l1

(a)

+e

l1

l2

1 − (1 − e−λ1 c1 −λ1 c1 )(1 − e−λ2 c2 −λ2 c2 )

(b)

l −λ1 c1 1

l −λ1 c2 1

l −λ2 c1 2

l −λ2 c2 2

l1

− e−λ1 c1 −λ1 c1 −λ2 c2 −λ2 c2 (3) We want to know whether there exists some examples such that formula 2 is greater than 3 or not. It means we need to prove, in some situations, we have (2) > (3). =e

Fig. 1. More replicas do not always lead to a higher reliability: (a) scheduling with 1 replica for task 1, and 2 replicas for task 2; (b) full-schedule.

Proposition 1 When submitting a DAG-based workflow to a heterogeneous system, if each task in the workflow can be replicated and scheduled on multiple processors, the total n number of scheduling methods is (2m − 1) .

+e

l1

l1

l2

l1

l1

l2

l1

(2) > (3) ⇔ e−λ1 c1 −e−λ1 c1 −λ1 c1 > e−λ2 c2 −e−λ1 c1 −λ1 c1 −λ2 c2

(4) Let l1 = l2 = 1, c1 = c2 = 1, λ1 = 0.1, and λ2 = 10. Formula 4 is equal to: e−0.1 − e−0.2 > e−10 − e−10.2 . The above inequation is established. So we have (2) > (3) under certain situations. In conclusion, the more replicas may not lead to a higher reliability.

Proof: For each task in the workflow, for example: τi , it can be either scheduled on processor pj or not. For all m processors, there are 2m possibilities to schedule τi . Excluding the one possibility that the task τi is not scheduled on any processors, we have 2m − 1 possibilities to schedule this task. Therefore, for all the n tasks in the workflow, the total number n of scheduling methods is (2m − 1) .

Proposition 2 can be interpreted as: the more tasks that one processor executes, the greater probability of failures occurring is faced by the processor. Therefore, a full-schedule may not give a higher reliability.

Definition 1 (Full-schedule) Every task in the workflow has m replicas, which means every task is replicated and scheduled on all the m processors.

D. Problem statement

The probability of no failures occurring in time period T λk e−λi T is: f (k = 0, λi ) = i k! = e−λi T , so the reliability of m−1 ) ∏ ( the full-schedule is: 1 − 1 − e−λi Ti , where Ti is the

Given the job, processor and system models, we seek a resource scheduling algorithm with the target to satisfy the user’s reliability requirement with the minimum resources.

i=0

total execution time of processor i. To tolerant more failures, traditional research have to schedule more replicas for each task. However, more replicas may not always provide higher reliability. In the extreme, a full-schedule may be not more reliable than a non-full schedule.

III. T HE MaxRe SCHEDULING ALGORITHM This section analyzes the task priority of all tasks in a workflow, and gives the description of the MaxRe scheduling algorithm.

Proposition 2 When submitting a DAG-based workflow to a heterogeneous system, the more replicas for each task may not lead to a higher reliability.

A. Task priority Based on the job model described in section II(B), we determine the scheduling order of each task using its upward rank value. The upward rank value of task τi is computed by Formula 5, which is an classic study on workflow scheduling [15]. In the future works, we also would like to explore the impact of different scheduling orders.

Proof: This proof is mainly based on the fail-stop property of crash failures. Suppose the system consists of two processors: p1 , p2 . The computing speed of both processors are c1 and c2 respectively. Suppose the workflow consists of two tasks: t1 , t2 . The task load of both tasks are l1 and l2 respectively. The arrival of failures follows a Poisson k −λ e . The λ values of both distribution with λ: f (k, λ) = λ k! processors are λ1 and λ2 respectively.

rank (τ i ) = ET (τi , p) +

max

τj ∈succ(τi )

(ci,j + rank (τj )) (5)

Where rank (τ i ) is the priority value of task τi , ET (τi , p) is the average execution time of task τi , succ (τi ) is the successor tasks set of task τi , and ci,j is the average communication time of from task τi to task τj . The rank (τ i ) can be computed by traversing the task graph upward recursively.

Let k = 0, the probability of no failures is e−λt , and the corresponding reliability values of both two tasks being l1 l2 l1 scheduled on both processors are: e−λ1 c1 , e−λ1 c1 , e−λ2 c2 , l2 and e−λ2 c2 . 1) In the situation A (Fig. 1(a)), the reliability is: 436 388

τi to these processors. Delete the task τi from U , and repeat this whole process until all tasks are scheduled. The replica num (r, τi , CR) algorithm is given in Alg. 2. The processors with maximum CR value will be selected to execute the user’s task (Line 2). And this process is repeated until the achieved reliability is not less than r (Lines 3-6). The number of replicas for the task τi is ξ.

B. The MaxRe scheduling algorithm design Suppose the user required reliability is: ℜ (For example, the required reliability of building China ChangZheng II F rocket is 0.97). It is reasonable to assume that the required reliability for each task in the workflow is the n-geometric mean of ℜ (Formula 6), because for the exit tasks, we cannot arbitrary distinguish their importance without any preliminaries, and for the entry tasks and internal tasks, we believe that they are all equally important for their descendant tasks. √ n (6) r= ℜ

Algorithm 2 Decide the number of replicas for task τi : ξ ← replica num (r, τi , CR) Require: r, τi , CR. Ensure: The number of replicas for task τi . //Variable counter stores the number of replicas 1: counter = 0; //Variable fail represents the probability of all scheduled processors failing 2: f ail = 1 − CR (τi , p0 ); 3: while (1 − f ail) < r&&counter < m do 4: counter = counter + 1; 5: f ail = f ail × (1 − CR (τi , pcounter )); 6: end while 7: return counter;

To decide the processors that the current task τi will be scheduled on, the Total time (TT) and Current reliability (CR) are used: Definition 2 (TT)

∑

T T (pj ) = ET (τi , pj ) +

ET (τk , pj )

(7)

τk ∈on(pj )

where ET (τi , pj ) is the execution time when scheduling task τi on processor pj . on (pj ) is the previous tasks that have already been scheduled on processor pj . Definition 3 (CR) CR(pj ) is the probability that no failures occur on processor pj during T T (pj ) period. For the memoryless property of Poisson distribution, we can get the current reliability value as follows: CR (pj ) = e−λj T T (pj ) −λj

= e−λj ET (τi ,pj ) ×∏ e = R (τi , pj ) ×

τk ∈on(pj )

∑

ET (τk ,pj )

τk ∈on(pj )

C. Analysis of the MaxRe Proposition 3 Denote the reliability value provided by the MaxRe algorithm as Ψ. When the required number of replicas for each task does not exceed the total number of processors, we have ℜ ≤ Ψ.

(8)

R (τk , pj )

Proof: Based on the description of the MaxRe algorithm, the reliability (r) for the task τi follows: ∏ r ≤1− (1 − CR(τi , pk )) (9)

The MaxRe scheduling algorithm is given in Alg. 1. In lines 1-6, we compute the task τi ’s execution time ET (τi , pj ) if it is scheduled on processor pj , and based on the execution time, compute its corresponding probability R (τi , pj ) which represents the probability of no failures occurring during the execution time period. In line 7, compute the communication time between two consecutive tasks. Based on the execution time set (ET ) and the communication time (C), compute the task priority set (T P ) in line 8 using Formula 5. All the tasks will be sorted according to the task priority value in line 9. In line 10, get user’s required reliability value ℜ, and compute the n-geometric mean (r) of the reliability value. We use the n-geometric mean as the required reliability value for each task in the workflow. In lines 11-24, select the head task from sorted set U , and schedule it onto processors. First, select the head task τi (Line 13), then use Formula 7 and Formula 8 to compute the total time T T (τi , pj ) and the current reliability (CR (τi , pj )) for each processor. The CR (τi , pj ) will be used in algorithm replica num (r, τi , CR) (Alg. 2) to decide the number of replicas (ξ) for task τi . In line 20-21, select the first ξ maximum CR (τi , pj ) value’s processors, then schedule task

pk ∈sche(τi )

where sche(τi ) is the processors set on which the task ∏τi ’s replicas are scheduled. If let F (τi ) = 1 − (1 − CR(τi , pk )), we have:

pk ∈sche(τi )

ℜ = rn ≤

n−1 ∏

F (τi )

(10)

i=0

The next, we only need to prove that:

n−1 ∏

F (τi ) ≤ Ψ.

i=0

In Formula 8, CR (τi , pj ) implies the probability that no failures occur on processor pj from the beginning of the first task that is scheduled on processor pj , until the finish of the task τi . So F (τi ) implies the probability that, for task τi , at least one scheduled processor does not encounter failures from the very beginning to the τi ’s finish. If randomly select two tasks (τi and τj ) from the set V , we get F (τi ) and F (τj ) respectively. Now, we analyze the relation between the probability of both tasks’s success and the value F (τi ) × F (τj ). There are three situations:

437 389

Algorithm 1 The MaxRe Scheduling Algorithm Require: G = (V, E), ℜ, and Λ = {λ1 , λ2 , λ3 , ..., λm }. Ensure: To what processors the tasks will be scheduled. 1: for each task τi ∈ V do 2: for each processor pj ∈ P do 3: ET (τi , pj ) ← compute the execution time using (τi .load, pj .speed); 4: R (τi , pj ) ← compute the reliability using (ET (τi , pj ) , λj ); 5: end for 6: end for 7: C ← compute the average communication time using (G, P ); 8: T P ← compute the priority value for all tasks using Formula 5; 9: sort (V, T P ); (Sort all tasks according to the task priority value) 10: r ← root (ℜ); (Compute the geometric mean of ℜ using Formula 6) 11: Θ = ∅, U = V ; //Start scheduling 12: while U ̸= ∅ do 13: τi = head(U ); 14: for each processor pj ∈ P do 15: T T (τi , pj ) ← compute the total execution time using Formula 7; 16: CR (τi , pj ) ← (T T (τi , pj ) , λj ); (Compute the CR for each processor using Formula 8) 17: end for 18: sort (P, CR); (Sort all processors according to CR (τi , pj )) 19: ξ ← replica num (r, τi , CR); (Compute the number of replicas) 20: S ← select the first ξ maximum CR value processors from sorted P ; 21: Schedule task τi on processors in S; 22: Put τi into Θ; 23: U ← U \{τi }; 24: end while 1) If sche(τi ) ∩ sche(τj ) = Ø , the probability of both tasks’ success (Ψ) is equal to F (τi ) × F (τj ). 2) If sche(τi ) = sche(τj ), and suppose the task τj is scheduled later than the τi , the probability of both tasks’ success would be equal to F (τj ). That is F (τj ) > F (τi )F (τj ). 3) If sche(τi ) ∩ sche(τj ) ̸= Ø and sche(τi ) ̸= sche(τj ). The probability of both tasks’ success meets the following conditions: ∏ 1 − (1 − CR(τi(j) , pk )) > pk ∈sche(τ ∏ i )∪sche(τj ) (1 − (1 − CR(τi , pk ))) × (1 − pk ∈sche(τi ) ∏ (1 − CR(τj , pk ))).

t1

P0

Fig. 2.

t 31 P1

t 21

t 20

t 22

t 30

t 32 P2

P3

t 11

t 10

P0

t 31 P1

t 22

t 32 P2

P3

An example of MaxRe scheduling result

Proof: The time complexity of Alg. 2 is: O(ξ), where ξ is the number of replicas, and ξ ≤ M .

Therefore, generalized to the case of n tasks, we have: F (τi ) ≤ Ψ

t 21

t 30

x t0

t 11 Fail

t 20

t3

pk ∈sche(τj )

n−1 ∏

t 10

t2

1

t 00

t 01

t 00

t0

In the MaxRe algorithm, from line 1 to line 6, the time complexity is O(M N ). And from line 7 to line 8, it is O(e + eN ). If we use the quick sort algorithm, the time complexity of line 9 is O(N log N ). And from line 12 to line 23, the time complexity is: O(N (M + M log M + ξ)).

(11)

i=0

with equality holding if and only if when all tasks in set V are scheduled on total different processors. Above all, ℜ ≤ Ψ.

Above all, the time complexity of the MaxRe scheduling algorithm is O(M N +e+eN +N log N +M N +M N log M + ξN ). That is O(M N log M + N log N + eN ), where M is the number of processors, and N is the number of tasks in the workflow.

Theorem 1 The time complexity of the MaxRe scheduling algorithm is O(M N log M + N log N + eN ).

438 390

TABLE I T HE PARAMETERS FOR THE TASK AND PROCESSOR Task Load Com load 100 ∼ 500 9 ∼ 29 (b) LQCD

(c) Stencil

1

(d) Doolittle

0.99 0.98

The workflows used in the experiments

Processor λ × 103 2∼8

R(MaxRe)

0.99

R(F'(x))

0.98

0.96

R(user)

R(user)

0.995

R(MaxRe) 0.99

R(F'(x))

Reliability

Reliability

0.93

0.95

0.92

0.94

0.975 0.97 0.965

0.91 0.93

0.9

D. The reliability of the MaxRe’s result

workflow 1

Proposition 4 The reliability Ψ provided by the MaxRe scheduling algorithm meets the following conditions: ℜ≤Ψ≤

i<n ∏

∏

(1 −

(1 − R(τi , pj )))

(12)

probability that at least one replica of the task τi succeeds not i<n ∏ ′ considering the dependency with the other tasks. F (τi ) i=1

implies the probability that, for all tasks not considering the dependency, at least one replica of each task succeeds. 1) If all the replicas of all tasks are scheduled on totally difi<n ∏ ′ ferent processors, equation is satisfied: Ψ = F (τi ). i=1

2) If at least two replicas are scheduled on the same processor. For example, as shown in Fig. 2, the workflow consists of 4 tasks: τ0 , τ1 , τ2 , τ3 , and the system consists of 4 processors. After the MaxRe scheduling, τ0 is scheduled on the processor p1 and p3 (τ0 → {p1 , p3 }), τ1 → {p2 , p4 }, τ2 → {p1 , p3 , p4 }, τ3 → {p1 , p2 , p3 }. For the processor’s fail-stop property, the failure of τ01 results in both replica τ21 and τ32 ’s failure. And the failure of τ01 , τ11 , τ20 and τ31 will result in all processors’ failure. i<n ∏ ′ So we have Ψ < F (τi ).

i=1

(1 −

∏ pj ∈sche(τi )

workflow 4

0.955 workflow 1

workflow 2

workflow 3

(b)

workflow 4

workflow 1

workflow 2

workflow 3

workflow 4

(c)

task. No. is the number of processors in the system. Speed represents the computation speed of one processor. λ represents the parameter of the Possion distribution. Com speed is the communication speed between two processors. All values of these parameters are initialized with random numbers in the corresponding range. The MaxRe scheduling algorithm is evaluated from three aspects: the verification to Proposition 4, the resource usage with comparison to FTSA algorithm, and the execution time with comparison to FTSA algorithm. Because both the FTSA and CAFT algorithm employ the same original active replication scheme, and our main target is comparing the resource usage with active replication scheme, so only the FTSA algorithm is considered in experiments. Besides, we believe the FTSA and CAFT do not differ much on resource usage with compared to the MaxRe.

pj ∈sche(τi )

Above all, ℜ ≤ Ψ ≤

workflow 3

Fig. 4. The verification to Proposition 4: (a) describes the situation when user required reliability is 0.93; (b) describes the situation when user required reliability is 0.95; (c) describes the situation when user required reliability is 0.97;

Proof: ℜ ≤ Ψ 3. Let ∏ has been given in Proposition F ′ (τi ) = 1 − (1 − R(τi , pj )), F ′ (τi ) implies the

i=1 i<n ∏

workflow 2

(a)

pj ∈sche(τi )

i=1

0.96

0.92

0.89

R(F'(x))

0.98

0.96

0.94

R(MaxRe)

0.985

0.97

0.95

Com speed 0.8 ∼ 1.2

1

1

R(user)

0.97

Fig. 3.

Speed 5 ∼ 19

Reliability

(a) Sample 1

No. 10/20

A. The verification to Proposition 4 The correctness of Proposition 4 is evaluated in three experiments (Fig. 4). Corresponding to each experiment, the workflow is executed for 1000 times. We set the user required reliability value with 0.93, 0.95 and 0.97 respectively. The exact reliability of the MaxRe’s scheduling result is equal to success rates = success/1000. In Fig.4(a), when the user required reliability is set with 0.93, Ψ ≥ 0.96, and i<n ∏ ′ F (τi ) ≥ 0.98. In Fig. 4(b), when the user required

(1 − R(τi , pj )))

is established, with equality holding if and only if all replicas of all tasks are scheduled on totally different processors. IV. E XPERIMENTS

i=1

We make use of 4 workflows with considerable complexity to evaluate the MaxRe scheduling algorithm. As shown in Fig. 3(a): the first workflow is a classic workflow example from [15]. To study the algorithm’s practicability, we also select 3 realistic workflows: LQCD (Fig. 3(b)) [18], has the similar structure with the sample; while stencil (Fig. 3(c)) [5], and doolittle (Fig. 3(d)) are two realistic workflows with multiple entry nodes and exit nodes. The parameters for the tasks and processors are shown in Tab. I. load is the computation load for a task, Com load is the transmission load from a parent task to one of his child

reliability is set with 0.95, we have 0.97 < Ψ < 0.98 i<n ∏ ′ and F (τi ) > 0.98. In Fig. 4(c), when the user required i=1

reliability is set with 0.97, we have 0.98 < Ψ < 0.99 and i<n ∏ ′ F (τi ) > 0.99. Therefore, the Proposition 4 is completely i=1

verified by these experiments. B. The resource usage with comparison to the FTSA algorithm FTSA (Fault tolerant scheduling algorithm) algorithm is introduced in [6] as a fault tolerant extension of the classic 439 391

600

700

MaxRe

Resource usage

500

1000

FTSA MaxRe

400

800

300

600

200

400

100

200

250

MaxRe

200

FTSA-E FTSA-L MaxRe-E MaxRe-L

400

120

200

200

100

FTSA-E FTSA-L MaxRe-E MaxRe-L

150

80

150

300

250

FTSA-E FTSA-L MaxRe-E MaxRe-L

140

Time(second)

500

FTSA

160

300

1200 FTSA

600

100

60

100

100

40 50 50

0

0 workflow 1

workflow 2

workflow 3

20

0 workflow 1

workflow 4

workflow 2

workflow 3

workflow 4

workflow 1

workflow 2

workflow 3

workflow 1

(a)

(b)

(c)

900

FTSA

1600

800

MaxRe

1400

700

workflow 2

workflow 3

workflow 1

workflow 4

(a)

workflow 2

workflow 3

workflow 1

workflow 4

(b)

workflow 2

workflow 3

workflow 4

workflow 3

workflow 4

(c)

1600

1800

1000

0

0

0

workflow 4

1400 FTSA MaxRe

1200

1200

FTSA-E FTSA-L MaxRe-E MaxRe-L

200

1000

600

300

250

FTSA MaxRe

1000 800

500

250 200

200

160 140 120

150

800 400

FTSA-E FTSA-L MaxRe-E MaxRe-L

180

FTSA-E FTSA-L MaxRe-E MaxRe-L

600 400

200

400

100

200

200

0

0

0

workflow 1

workflow 2

workflow 3

workflow 4

workflow 1

workflow 2

workflow 3

workflow 4

100

workflow 1

(e)

workflow 2

workflow 3

50

50

0

0

workflow 4

(f)

workflow 1

Tasks τ0 τ1 τ2 τ3 τ4 τ5 τ6 τ7 τ8 τ9

FTSA p6 p9 p6 p9 p2 p0 p6 p3 p0 p9 p2 p1 p6 p2 p9 p0 p4 p8 p6 p9

60 40

MaxRe p0 p0 p9 p2 p3 p5 p6 p4 p0 p6 p0 p9 p7 p1 p0 p2

workflow 4

0 workflow 1

workflow 2

(e)

workflow 3

workflow 4

workflow 1

workflow 2

(f)

FTSA is 0.45, which equals to the sum of 0-failure’s and 1failure’s probability. The MaxRe sets ℜ = 0.45, schedules less processors than FTSA (e.g. τ0 , τ1 , τ4 , τ5 need only 1 processor). Moreover, the average CU A cost by FTSA is 396, while by MaxRe is 262 (Fig. 5(a)). For all situations (Fig. 5), we get the following conclusions: 1) The MaxRe consumes at most 70% fewer resources than the FTSA algorithm in Fig. 5(e) and Fig. 5(f). 2) To tolerant more failures, FTSA algorithm needs more resources than the MaxRe algorithm. When the system consists of 10 processors, as shown in Fig. 5(a), Fig. 5(c), and Fig. 5(e), the average ratio of CU A(M axRe)/CU A(F T SA) decreases from 71.3%, 49.4% to 31.9%. When the system consists of 20 processors, the average ratio decreases from 71.7% (Fig. 5(b)), 45%(Fig. 5(d)) to 32.2% (Fig. 5(f)). 3) The workflow that consists of more tasks requires more CUAs. 4) The workflow consumes fewer CUAs if more processors exist in the system. This is because a larger system always have more faster processors.

HEFT algorithm [15]. In FTSA, at each step of the scheduling process, the free task τi with the highest priority is simulated by mapping on all processors. The first ε + 1 (ε is the number of failures that FTSA can tolerate) processors that allow the minimum finish time are scheduled. The resource usage is measured by the metric of CUA (CPU Usage Amount), which is defined in Formula 13: (ET (τi , pj ) × xij )

workflow 3

Fig. 7. The execution time: (a) describes the execution time when ε = 1, m = 10; (b) describes the execution time when ε = 1, m = 20; (c) describes the execution time when ε = 2, m = 10; (d) describes the execution time when ε = 2, m = 20; (e) describes the execution time when ε = 3, m = 10; (f) describes the execution time when ε = 3, m = 20;

The schedule for Sample 1 by FTSA and MaxRe

i<n j<m ∑ ∑

workflow 2

(d)

Fig. 5. The resource usage: (a) describes the resource usage when ε = 1, m = 10; (b) describes the resource usage when ε = 1, m = 20; (c) describes the resource usage when ε = 2, m = 10; (d) describes the resource usage when ε = 2, m = 20; (e) describes the resource usage when ε = 3, m = 10; (f) describes the resource usage when ε = 3, m = 20;

CU A =

80

100

20

(d)

Fig. 6.

100

150

600

300

(13)

i=0 j=0

where xij = 1 if the task τi is submitted on processor pj , otherwise xij = 0. In this experiment, we set the ε value with 1, 2, and 3, and the number of processors with m = 10 and m = 20. The ability of tolerating ε failures is translated into reliability value by computing the probability of at most ε failures occurring. The achieved reliability by FTSA algorithm is also the requirement reliability for the MaxRe algorithm. Taking the Sample 1(Fig. 3(a)), ε = 1, m = 10 as an example, as shown in Fig. 6, FTSA schedules all tasks to 2 processors (e.g. τ0 is scheduled on p6 and p9 ). Using the parameters in Tab. I, we get the reliability value achieved by

C. The execution time with comparison to the FTSA algorithm The earliest finish time and latest finish time are considered respectively in both FTSA and MaxRe algorithm. In FTSA-E and MaxRe-E, once one replica of the τi finishes successfully, its results will be sent to its successors. In FTSA-L and MaxRe-L, after all replicas of task τi finish, its result is sent to its successors. te and tf are computed to evaluate the time performance. For the earliest finish time, compute te = time(F T SA − E)/time(M axRe − E), while for the latest finish time, compute tf = time(F T SA − L)/time(M axRe − L). The

440 392

TABLE II T HE COMPARISION ON TIME BETWEEN THE FTSA AND THE M AX R E Experiment te tf

a 0.79 0.82

b 0.81 0.74

c 0.93 1.02

d 0.86 0.906

e 0.825 0.96

[3] Q. Zheng, B. Veeravalli, On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs. IEEE TRANSACTIONS ON COMPUTERS, vol. 58(3), pp.380-393, 2009. [4] Q. Zheng, B. Veeravalli, Chen-Khong Tham. On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices. Journal of Parallel and Distributed Computing, vol. 69(3), pp.282-294, 2009. [5] A. Benoit, M. Hakem, Y. Robeert, Contention awareness and faulttolerance scheduling for precedence constrained tasks in heterogeneous systems. Parallel Computing, vol. 35(2), pp.83-108, 2009. [6] A. Benoit, M. Hakem, Y. Robeert, Fault Tolerant Scheduling of Precedence Task Graphs on Heterogeneous Platforms. INRIA Research Report No. 2008-03, 2007. [7] H. Jin, X.H. Sun, Z. Zheng et al., Performance under Failures of DAGbased Parallel Computing. Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, Paris, pp. 236-243, 2009. [8] J.J. Dongarra, E. Jeannot, E. Saule, et al., Bi-objective Scheduling Algorithms for Optimizing Makespan and Reliability on Heterogeneous Systems. In proceedings of the 19th annual ACM symposium on Parallel algorithms and architectures, ACM Press, San Diego, pp. 280-288, 2007. [9] A. Dogan, F. Ozguner, Biobjective scheduling algorithms for execution time-reliabilty trade-off in heterogeneous computing systems. The Computer Journal, vol. 48(3), pp. 300-314, 2005. [10] Y. He, Z. Shao, B. Xiao, et al., Reliability Driven Task Scheduling for Tightly Coupled Heterogeneous Systems. In Proc. IASTED International Conference on Parallel and Distributed Computing and Systems (IASTED PDCS), pp. 465-470, Marina Del Ray, CA, Nov. 2003. [11] S. Swaminathan, G. Manimaran, A Reliability-aware Value-based Scheduler for Dynamic Multiprocessor Real-time Systems. In Proceedings International Parallel and Distributed Processing Symposium(IPDPS). pp. 98-104, Florida, US, 2002. [12] K. Hashimito, T. Tsuchiya, T. Kikuno, Effective scheduling of duplicated tasks for fault-tolerance in multiprocessor systems, IEICE Transactions on Information and Systems E85-D (3), pp. 525-534, 2002. [13] X. Qin, H. Jiang, A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems, Parallel Computing, vol.32, pp. 331-356, 2006. [14] M. Hakem, F. Butelle, Reliability and Scheduling on Systems Subject to Failures, In proceedings of the 2007 International Conference on Parallel Processing, pp. 38-47, 2007. [15] H. Topcuoglu, S. Hariri, M.Y. Wu. Performance effective and low complexity task scheduling for heterogeneous computing. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, vol. 13(3), pp.260-274, 2002. [16] C.M. Wang, S.T. Wang, H.M. Chen, et al., A Reliability-Aware Approach for Web Services Execution Planning. 2007 IEEE Congress on Services, pp.278-283, Salt Lake City, USA, 2007. [17] A. Girault, H. Kalla, M. Sighireanu et al., An Algorithm for Automatically Obtaining Distributed and Fault-Tolerant Static Schedules. The 2003 International Conference on Dependable Systems and Networks (DSN’03), pp.1-10, San-Francisco, USA, 2007. [18] L. Piccoli, X.-H. Sun, J. N. Simone, D. J. Holmgren, et al., The LQCD Workflow Experience: What Have We Learned, Posters of ACM/IEEE SuperComputing Conference(SC07), Nov. 2007. [19] N. Raju, Y. Liu, C.B. LeangsuksunL et al., Reliability Analysis in HPC clusters,Proceedings of the High Availability and Performance Computing Workshop, 2006. [20] J. W. Young, First Order Approximation to the Optimal Checkpoint Interval, Comm. ACM, Vol 17, No 9, pp. 530-531, 1974.

f 0.896 0.96

results are shown in Tab. II. From Tab. II and Fig. 7, we have the following conclusions: 1) With the increasing number of processors, tasks and failures in the system, Both te and tf values are increasing. For workflow 4 in Fig. 7(c),Fig. 7(d), Fig. 7(e), and Fig. 7(f), the MaxRe-L is even shorter than FTSA-L. This proves that MaxRe algorithm is more suitable for largescale systems. 2) The te values are mostly greater than 0.8, and tf values are even greater than 1. In the worst case, the MaxRe-E is 21% longer than FTSA-E (Fig. 7(a)), and the MaxRe-L is 26% longer than FTSA-L (Fig. 7(b)). Compared with the 70% less resource usage, this is quite acceptable. V. C ONCLUSION AND FUTURE WORKS In order to satisfy user’s reliability requirement with minimum resources, we design the MaxRe scheduling algorithm for heterogeneous systems. The MaxRe algorithm schedules dynamic number of replicas for different tasks. Both theoretical analysis and experiments prove that MaxRe algorithm can satisfy the user’s reliability requirement. Compared with the FTSA algorithm, the experiments show that the MaxRe algorithm is much scalable especially when more processors, tasks, and failures existed in the system. Specifically, the resource usage is almost 70% fewer than FTSA, and the performance on time is also acceptable, and even shorter than FTSA in some experiments. Our future work will concentrate on the scheduling algorithms that consider how to reduce the resource usage while ensuring the reliability and deadline requirement. And the communication link’s failure analysis also should be considered in our future works. Moreover, we also should do more work on the system performance analysis, for example: the maximum reliability value of the system. ACKNOWLEDGMENT The authors would like to thank Mr. Rishiraj Bhattacharyya, who gives many constructive comments to this paper. The first and second author of this research are supported by the governmental scholarship from China Scholarship Council. And the first author is also partly supported by the grant of graduate school of ISEE, Kyushu University for supporting students’ overseas traveling. R EFERENCES [1] C. Dabrowski, Reliability in grid computing systems. Concurrency and Computation: Practice and Experience, vol. 21(8), pp.927-959, 2009. [2] I. Koren, C. Mani Krishna, Fault-Tolerant Systems. Organ Kaufmann, San Francisco, 2006.

441 393