This is written in response to a Quora question. It is a somewhat vague question wondering why we cannot use GPU hardware for all computation tasks. Feel free to vote there for my answer on Quora!
CPU and GPU are fundamentally very different computational devices, but not many people realize it. CPU has a few low-latency cores, elaborate large caches and flow control (prefetching, branch prediction, etc) and a large relatively inexpensive RAM. GPU is a massively parallel device, which uses an expensive high-throughput memory. GPU memory is optimized for throughput, but not necessarily for latency.
Each GPU core is slow, but there can be thousands of them. When a GPU starts thousands of threads, each thread knows its “number” and uses this number to figure out which part of the “puzzle” it needs to solve (by loading and storing corresponding areas of memory). For example, to carry out a scalar product between two vectors, it is fine to start a GPU thread to multiply just two vector elements. However, it is quite unusual from the perspective of a software developer who has been programming CPUs all their life.
GPU designers make a number of trade-offs that are very different from the CPU trade-offs (in terms of flow control, cache size, and management, etc), which are particularly well suited for parallelizable tasks. However, it does not make GPU universally faster than CPU. GPU works well for massively parallel tasks such as matrix multiplication, but it can be quite inefficient for tasks where massive parallelization is impossible or difficult.
Given a large number of “data-hungry” cores, it is IMHO more important (than in the case of the CPU) to have a high-bandwidth memory (but higher memory latency can be tolerated). Yet, due to a high cost of the GPU memory, its amount is limited. Thus, GPU often relies on external lower-bandwidth memory (such as CPU RAM) to fetch data. If we did not have CPU memory, loading data directly from the disk (even from an SSD disk) would have slowed down many GPU workloads quite substantially. In some cases, this problem can be resolved by connecting GPUs using a fast interconnect (NVLink, Infiniband), but it comes with an extra cost and does not resolve all the issues related to having only very limited memory.
Some answers claim that all GPU cores can do only the same thing, but it is only partially correct. However, cores in the same group (warp) do operate in a lock-step. To process a branch operation, GPU needs to stop some of the cores in the warp and restart them when the branch finishes. Different warps can operate independently (e.g., execute different CUDA kernels).
Furthermore, GPU cores are simpler than CPU cores primarily in terms of the flow control. Yet, they are not primitive by far and support a wide range of arithmetic operations (including lower-precision fast operations). Unlike CPU that manages its caches automatically, GPU have fast shared memory, which is managed explicitly by a software developer (there is also a small L1 cache). Shared memory is essentially a manually-managed cache.
Note that not all GPUs support recursive calls (those that support seem to be pretty restrictive about the recursion depth) and none of the GPUs that I know support virtual memory. In particular, the current CUDA recursion depth seems to be 24. GPUs do not have interrupts and lack support for communication with external IO devices. All these limitations make it difficult or impossible to use GPU as the main processing unit that can run an operating system (See also the following paper for more details: GPUfs: the case for operating system services on GPUs. M Silberstein, B Ford, E Witch, 2014.) I am convinced that future computation systems are going to be hybrid systems that combine low-latency very generic processing units and high-throughput specialized units suitable for massively parallel tasks.