I've been knee deep in this stuff for the last few months (or similar stuff for years) and I spotted a few obvious simple things that were easy to address. Might be a bit off topic for the forum, but it's a bit quiet otherwise.
Actually I meant the input values, vin, etc. The non-memory arguments are handled by the driver (kernel.setArg() etc), and are potentially cached on the server anyway and in any event are too small to matter in most cases. It's the memory buffer synchronisation that slows things down.
I don't have a specific example handy, but the example i gave should be simple enough. But perhaps think of the allocations on the device like any other memory allocations, they stay around with their content until you free them, as far as any kernel is concerned at least. If you have an array on cpu code you don't (normally) copy the array to call a function which works on it, and a gpu is no different.
Your code was already basically doing it, but it was copying the same data to the cpu and then back to the gpu every loop - for no purpose or effect.
More or less, but just write the results to global memory - local memory only exists for the duration of the kernel execution, and only amongst kernels in the same workgroup ( I think this is how your original code was treating the global memory). If they've been written in the kernel you can just then read that memory from the cpu. I mentioned an additional kernel if you didn't want to put it into the main kernel and just grabbed some samples from the 'current result'.
But since in this case it's only for debugging it probably isn't too important how it's done.
It could easily explain such a difference. Mobile devices usually have slower memory/i/o and sometimes fewer transistors (i.e. less processors) at lower clock speeds. And even apart from that there are so many different devices with varying capabilities/generations.
On the flip side you might actually have more to gain by better code which more fully utilises the processors and lowers the memory bandwidth requirements. Even on a slow card registers are fast.