I am programming performance-sensitive code. I implement a simple scheduler to distribute workloads and master thread takes charge of the scheduler.
cpu_set_t cpus;
pthread_attr_t attr;
pthread_attr_init(&attr);
for(int i_group =0; i_group<n_groups; i_group++){
std::cout << i_t<< "\t"<<i_group << "th group of cpu" <<std::endl;
for(int i =index ; i < index+group_size[i_group]; i++){
struct timeval start, end;
double spent_time;
gettimeofday(&start, NULL);
arguments[i].i_t=i_t;
arguments[i].F_x=F_xs[i_t];
arguments[i].F_y=F_ys[i_t];
arguments[i].F_z=F_zs[i_t];
CPU_ZERO(&cpus);
CPU_SET(arguments[i].thread_id, &cpus);
int err= pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cpus);
if(err!=0){
std::cout << err <<std::endl;
exit(-1);
}
arguments[i].i_t=i_t;
pthread_create( &threads[i], &attr, &cpu_work, &arguments[i]);
gettimeofday(&end, NULL);
spent_time = ((end.tv_sec - start.tv_sec) * 1000000u + end.tv_usec - start.tv_usec) / 1.e6;
std::cout <<"create: " << spent_time << "s " << std::endl;
}
i_t++;
cpu_count++;
arr_finish[i_group]=false;
}
}
like above the master thread create. For the simple explanation, i will assume i_group=1. The child threads divide and conquer a bunch of matrix-matrix multiplications. Here rank means thread_id.
int local_first = size[2]*( rank -1 )/n_compute_thread ;
int local_end = size[2] * rank/n_compute_thread-1;
//mkl_set_num_threads_local(10);
gettimeofday(&start, NULL);
for(int i_z=local_first; i_z<=local_end; i_z++ ){
cblas_dgemm( CblasColMajor, CblasNoTrans, CblasNoTrans,
size[0], size[1], size[0], 1.0, F_x, size[0],
rho[i_z], size[1], 0.0, T_gamma[i_z], size[1] );
}
for(int i_z=local_first; i_z<=local_end; i_z++ ){
cblas_dgemm( CblasColMajor, CblasNoTrans, CblasNoTrans,
size[0], size[1], size[1], 1.0, T_gamma[i_z], size[0],
F_y, size[1], 0.0, T_gamma2[i_z], size[0] );
}
gettimeofday(&end, NULL);
std::cout <<i_t <<"\t"<< arg->thread_id <<"\t"<< sched_getcpu()<< "\t" << "compute: " <<spent_time << "s" <<std::endl;
Even though workload fairly distributed, the performance of each thread vary too much. see the result below
5 65 4 4 compute: 0.270229s
5 64 1 1 compute: 0.284958s
5 65 2 2 compute: 0.741197s
5 65 3 3 compute: 0.76302s
second column shows how many matrix-matrix multiplications are done in a particular thread. last column shows consumed time. When I saw this result firstly, I thought that it related to the affinity of threads. Thus, I added several lines to control the binding of threads. However, it did not change the trends of last column.
My computer has 20 physical cores and 20 virtual core. I made only 4 child threads to test. Of course, it was tested in a Linux machine
via Chebli Mohamed
Aucun commentaire:
Enregistrer un commentaire