mercredi 5 août 2015

performance difference of pthreads


I am programming performance-sensitive code. I implement a simple scheduler to distribute workloads and master thread takes charge of the scheduler.

cpu_set_t cpus;
pthread_attr_t attr;
pthread_attr_init(&attr);
     for(int i_group =0; i_group<n_groups; i_group++){
        std::cout  << i_t<< "\t"<<i_group << "th group of cpu"  <<std::endl;
        for(int i =index ; i < index+group_size[i_group]; i++){
            struct timeval start, end;
            double spent_time;
            gettimeofday(&start, NULL);
            arguments[i].i_t=i_t;
            arguments[i].F_x=F_xs[i_t];
            arguments[i].F_y=F_ys[i_t];
            arguments[i].F_z=F_zs[i_t];
            CPU_ZERO(&cpus);
            CPU_SET(arguments[i].thread_id, &cpus);
            int err= pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cpus);
            if(err!=0){
                std::cout << err <<std::endl;
                exit(-1);
            }
            arguments[i].i_t=i_t;
            pthread_create( &threads[i], &attr, &cpu_work, &arguments[i]);
            gettimeofday(&end, NULL);
            spent_time = ((end.tv_sec  - start.tv_sec) * 1000000u + end.tv_usec - start.tv_usec) / 1.e6;
            std::cout <<"create: " << spent_time << "s " << std::endl;
        }
        i_t++;
        cpu_count++;
        arr_finish[i_group]=false;
    }
} 

like above the master thread create. For the simple explanation, i will assume i_group=1. The child threads divide and conquer a bunch of matrix-matrix multiplications. Here rank means thread_id.

int local_first = size[2]*( rank -1 )/n_compute_thread ;
int local_end = size[2] * rank/n_compute_thread-1;
//mkl_set_num_threads_local(10); 

gettimeofday(&start, NULL);
for(int i_z=local_first; i_z<=local_end; i_z++ ){
    cblas_dgemm( CblasColMajor, CblasNoTrans, CblasNoTrans,
                size[0], size[1], size[0], 1.0,  F_x, size[0],
                rho[i_z], size[1], 0.0, T_gamma[i_z], size[1] );
}
for(int i_z=local_first; i_z<=local_end; i_z++ ){
    cblas_dgemm( CblasColMajor, CblasNoTrans, CblasNoTrans,
                 size[0], size[1], size[1], 1.0, T_gamma[i_z], size[0],
                 F_y, size[1], 0.0, T_gamma2[i_z], size[0] );
}
gettimeofday(&end, NULL);
std::cout <<i_t <<"\t"<< arg->thread_id <<"\t"<< sched_getcpu()<<  "\t" << "compute: " <<spent_time << "s" <<std::endl;

Even though workload fairly distributed, the performance of each thread vary too much. see the result below

5 65 4 4 compute: 0.270229s

5 64 1 1 compute: 0.284958s

5 65 2 2 compute: 0.741197s

5 65 3 3 compute: 0.76302s

second column shows how many matrix-matrix multiplications are done in a particular thread. last column shows consumed time. When I saw this result firstly, I thought that it related to the affinity of threads. Thus, I added several lines to control the binding of threads. However, it did not change the trends of last column.

My computer has 20 physical cores and 20 virtual core. I made only 4 child threads to test. Of course, it was tested in a Linux machine



via Chebli Mohamed

Aucun commentaire:

Enregistrer un commentaire