How to re-use global memory between kernel invocations

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to re-use global memory between kernel invocations

devmonkey
Hi,

My usecase is neural network related.

I need to copy a large amount of sample data to global memory, this data does not change between training cycles (which are kernel invocations) so can sit on the card all day/week. However the weights to the net are updated after every invocation (updated on the host) and therefore need to be copied back to the card on every kernel invocation.

Can anyone suggest the correct approach to this or should I not be copying data at all but rather mapping memory from the card back to the host and writing to it?

Thanks, Joe
Reply | Threaded
Open this post in threaded view
|

Re: How to re-use global memory between kernel invocations

Wade Walker
Administrator
Hi Joe,

Not quite clear about your use-case -- you say the data doesn't change between kernel invocations, but then you say it's changed on the host, and needs to be copied back out to the card between kernel invocations? Do you mean that the kernels don't change it, but the host does? Or something else?

Also, are you on an architecture that shares physical memory between the host and the OpenCL device (like Intel OpenCL)? Or are you on an architecture that has separate device memory (like most Nvidia and AMD cards)?
Reply | Threaded
Open this post in threaded view
|

Re: How to re-use global memory between kernel invocations

Wade Walker
Administrator
In reply to this post by devmonkey
Ah, after reading more closely I think I get it: you've got sample input data that you're using to train a neural net (copied to the device once and left there), and you've got the neuron weights (modified on the host, not sure whether modified on the card).

Performance still depends on whether you're on a shared-memory device or not, but usually you'd create the sample data buffers with CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR, and the weights buffers with CL_MEM_USE_HOST_PTR (assuming the host already created the sample data and weights buffers initially). This should minimize copying, even though on a non-shared-memory device there will still be unavoidable copying out to the device. Not sure if you can also use CL_MEM_READ_ONLY on the weights data, that depends on your algorithm and whether your kernels ever write to the weights.