This all sounds pretty normal, and there are many possible reasons for what you're seeing.
1. To execute on the GPU, your data has to be copied over the bus from CPU to GPU, then the results copied back. If you're only doing one multiply, the execution time of your program will be dominated by the data copy time. To get the full benefit of a GPU, you have to move data over to it, then do *lots* of computation before copying results back.
2. Trig functions are more difficult than multiply and divide. A modern GPU's execution units can complete a multiply every clock cycle, for every thread in a warp. But trig functions are usually done by "special function units", which are fewer in number than multipliers, and are also slower. So anything that you want to be fast, find a way to take the trig out of it :)
3. You also have to set up the problem just right. There are a bunch of things about the way you write OpenCL code that could affect the performance. It'll just take reading and practice to get good at this.
Is there any specific problem you're trying to solve, or are you just learning OpenCL in general?