Finally, the values of z are accessed by the host which causes an additional 4MB to be transferred from device to host, which brings the total to 20MB. When the kernel is launched, the runtime does not know what data will be used so it transfers all 16MB from to the host to the device. This causes a copy of 16MB (via page faults) from device to host memory. After allocation, the four blocks are initialized on the CPU. Initially, when cudaMallocManaged is called, 4MB is allocated for each pointer, x,y,z,w on the device. It can be described as a “demand paging in one direction”. When we run nvprof on K80 we get the follwing :Ĭount Avg Size Min Size Max Size Total Size Total Time Nameġ2 1.3333MB 128.00KB 2.0000MB 16.00000MB 2.365652ms Host To Deviceġ40 146.29KB 4.0000KB 0.9961MB 20.00000MB 3.146866ms Device To HostĪs mentioned before, the pre-Pascal (K80) has limited support for page migration. NOTE: Unified Memory features are limited on Windows (and WSL). To compare we run the code on Tesla K80 (Kepler) and Tesla P100 (Pascal). Template _global_ void saxpy ( T * z, T * x, T * y, T a, int n ) Note that w points to a memory allocated but never used on the GPU. We illustrate the above difference with the code below. If a thread accesses a page that is not resident in GPU memory, it page faults and the migration engine will transfer said page to GPU memory. Starting from the Pascal architecture, the GPU uses “demand paging” like the CPU. When a kernel is launched, the migration engine on the GPU will transfer all allocated data to GPU memory, in case the GPU needs them. In pre-Pascal architecture, the opposite direction, from host to device, did not quite work the same way. That is basically demand paging, but instead of the data being on disk, it is in GPU memory. When the CPU accesses data that is actually on the device, it page faults and the data is transferred to host memory. In pre-Pascal architecture, cudaMallocManaged will allocate memory on the device. In this section we will look in more details of how Unified Memory works. Allocated memory using cudaMallocManaged. ![]() To make sure that it is, we call cudaDeviceSynchronize. There is, however, one additional requirement. Float * x cudaMallocManaged ( & x, n * sizeof ( float )) Īfter that, the CUDA runtime will take care of wether x is a host or device pointer.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |