What are the capabilities of GPT-4?

GPT-4 possesses a variety of capabilities beyond the previous state of the art in the field of artificial intelligence. Despite being primarily famous for the breadth of abilities developed after training, GPT-4 also supersedes many domain-specific models in their domain of expertise. This is especially impressive since GPT-4 was trained on a very general task (predicting the next textual token), and has not received any training meant to tailor it towards these specific tasks.

Information about GPT-4s capabilities comes from the “Sparks of Artificial General Intelligence” paper, as well as the initial technical report by OpenAI.

Most notable capabilities of GPT-4 include:

  • Logical reasoning. GPT-4 achieves very high results on a range of multiple-choice and freeform answer tests in STEM and humanities, as well as on tests of logical reasoning. For example, on the MMLU benchmark, GPT-4 has an accuracy of 86%, compared to the 25% expected by chance and 75% that was the state of the art in AI before the release of GPT-4. GPT-4 has not been trained on these questions specifically, so it cannot be memorizing answers. Here are two examples of questions that are typical of the MMLU benchmark:

  • Competence at exams. On many exams, GPT-4 achieves results comparable to humans. For example, on a range of Advanced Placement tests (College-level exams in the US), GPT-4 scores well above 80%. One notable exam on the list is the Uniform Bar Exam - an exam used in the majority of US states to certify lawyers - where GPT-4 has scored in the 90th percentile.

    • It is important to contextualize these numbers. While OpenAI has attempted to simulate the exam conditions as closely as possible, the question remains as to what kind of capability is demonstrated by GPT-4 achieving passing grades. For example, the bar exam in California has long been criticized for relying too much on memorizing specific rules and precedents, as opposed to more practical skills used by lawyers in practice. California does not utilize UBE - the bar exam GPT-4 has been tested on - but an argument can be made that it’s high performance on these exams is largely due to memorizing a large number of factoids, as opposed to achieving a more general competence at the subject that we hope to see in the exam participants. Nonetheless, together with GPT-4s performance on a range of logical benchmarks, it seems clear that it is capable of resolving a range of technical and logical problems.

  • Understanding of multiple languages. Despite GPT-4 being primarily trained in English, its training data nonetheless contained some text from other languages, and GPT-4 had acquired an ability to understand these other languages. When prompted in different languages, GPT-4 retains its high capabilities (though prompting in English does produce better outputs), even in cases where the total amount of text in a given language in its training data is very small, and would not be enough by itself to train the model to anywhere near a comparable level. High consistency of GPT-4s abilities across many different languages demonstrates that its reasoning ability is separate from the English language itself.

  • Translation. Due to the aforementioned ability of GPT-4 to understand multiple languages, it can be used to translate from one language to another.

  • Visual reasoning. GPT-4 can reason logically about visual images. This ability is present in all versions of GPT-4, but much more prevalent in GPT-4 augmented with an ability to tokenize images directly. An example of this ability can be found below:

  • Rudimentary drawing ability. Due to some image formats (such as SVG and TIKZ) being based on text, and being included into the GPT-4s training data by chance, GPT-4 has acquired a rudimentary ability to draw. It must be emphasized that this is an ability that GPT-4 has acquired without any intention by the developers.

  • Broad, encyclopedic knowledge. As mentioned above, GPT-4 possesses a broad set of knowledge on a large variety of topics. In fact, for most trivia questions, one can bet that GPT-4 will be capable of producing an accurate answer (but see below about hallucinations).

  • Summarization. GPT-4 possesses a much larger context window than the previous GPT models. Because of this, entire texts could be directly entered into the context window, with a request to GPT to summarize the text, or otherwise take some critical information from it. GPT-4 manages this sort of task with high competence.

  • Programming. GPT-4 is a passable programmer, and performs comparatively on a suite of tasks covering a variety of programming challenges. For example, on a list of tasks from the leetcode website covering easy, medium and hard difficulties, GPT-4 achieves 68.2%, 40% and 10.7% pass rates compared to 72.2%, 37.7% and 7% average for humans, respectively. GPT-4 also achieves good results on tasks like code comprehension, reverse-engineering, direct code execution, and so on.

  • Mathematical ability. GPT-4 possesses a substantial mathematical ability, though not yet sufficient to conduct independent mathematical research.

Despite a broad range of capabilities, it would be wrong not to bring up the glaring weaknesses of GPT-4.

  • Hallucinations is a term for when a large language model such as GPT-4 makes up false information in response to a factual question. For example, when asked to cite legal cases as precedent for a legal case, GPT-4 might make up entirely fictitious cases out of thin air. Because GPT-4 is frequently anthropomorphized, especially due to the peculiarities of the ChatGPT interface, it would not be a gross mistake to call this process lying. GPT-4 will confidently lie on factual questions with no provocation and no way to distinguish the lies from normal, factual information. The frequency of these “hallucinations” depends on the specifics of the question, but is generally around 20-40%.

  • Consistency. For a variety of reasons, GPT-4 has trouble with consistently remaining on a particular topic for a prolonged period of time. This has significant impact on attempts to use GPT-4 as part of an autonomous system, as GPT-4’s “thinking” tends to experience “drift” away from the topic fairly quickly.

  • Creativity. This last point is not based on testing by OpenAI, but rather on anecdotal experience of several writers who I know have attempted to use GPT-4 for creative writing. Overall, GPT-4’s suggestions remain “tropey” and not innovative, though wherever this problem can be resolved with a more careful choice of prompts remains to be seen.

Despite some crucial limitations, it seems that GPT-4 has a very broad and deep set of capabilities, which makes this line of models a crucial topic of research in the AI safety community.