Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
If you're reading this book, you've probably already heard about this thing called the Arm assembly language and know that understanding it is the key to analyzing binaries that run on Arm. But what is this language, and why does it exist? After all, programmers usually write code in high-level languages such as C/C++, and hardly anyone programs in assembly directly. High-level languages are, after all, far more convenient for programmers to program in.
Unfortunately, these high-level languages are too complex for processors to interpret directly. Instead, programmers compile these high-level programs down into the binary machine code that the processor can run.
This machine code is not quite the same as assembly language. If you were to look at it directly in a text editor, it would look unintelligible. Processors also don't run assembly language; they run only machine code. So, why is it so important in reverse engineering?
To understand the purpose of assembly, let's do a quick tour of the history of computing to see how we got to where we are and how everything connects.
Back in the mists of time when it all started, people decided to create computers and have them perform simple tasks. Computers don't speak our human languages-they are just electronic devices after all-and so we needed a way to communicate with them electronically. At the lowest level, computers operate on electrical signals, and these signals are formed by switching electrical voltages between one of two levels: on and off.
The first problem is that we need a way to describe these "ons" and "offs" for communication, storage, and simply describing the state of the system. Since there are two states, it was only natural to use the binary system for encoding these values. Each binary digit (or bit) could be 0 or 1. Although each bit can store only the smallest amount of information possible, stringing multiple bits together allows representation of much larger numbers. For example, the number 30,284,334,537 could be represented in just 35 bits as the following:
11100001101000101100100010111001001
Already this system allows for encoding large numbers, but now we have a new problem: where does one number in memory (or on a magnetic tape) end and the next one begin? This is perhaps a strange question to ask modern readers, but back when computers were first being designed, this was a serious problem. The simplest solution here would be to create fixed-size groupings of bits. Computer scientists, never wanting to miss out on a good naming pun, called this group of binary digits or bits a byte.
So, how many bits should be in a byte? This might seem like a blindingly obvious question to our modern ears, since we all know that a modern byte is 8 bits. But it was not always so.
Originally, different systems made different choices for how many bits would be in their bytes. The predecessor of the 8-bit byte we know today is the 6-bit Binary Coded Decimal Interchange Code (BCDIC) format for representing alphanumeric information used in early IBM computers, such as the IBM 1620 in 1959. Before that, bytes were often 4 bits long, and before that, a byte stood for an arbitrary number of bits greater than 1. Only later, with IBM's 8-bit Extended Binary Coded Decimal Interchange Code (EBCDIC), introduced in the 1960s in its mainframe computer product line System/360 and which had byte-addressable memory with 8-bit bytes, did the byte start to standardize around having 8 bits. This then led to the adoption of the 8-bit storage size in other widely used computer systems, including the Intel 8080 and Motorola 6800.
The following excerpt is from a book titled Planning a Computer System, published 1962, listing three main reasons for adopting the 8-bit byte1:
An 8-bit byte can hold one of 256 uniquely different values from 00000000 to 11111111. The interpretation of those values, of course, depends on the software using it. For example, we can store positive numbers in those bytes to represent a positive number from 0 to 255 inclusive. We can also use the two's complement scheme to represent signed numbers from -128 to 127 inclusive.
Of course, computers didn't just use bytes for encoding and processing integers. They would also often store and process human-readable letters and numbers, called characters.
Early character encodings, such as ASCII, had settled on using 7 bits per byte, but this gave only a limited set of 128 possible characters. This allowed for encoding English-language letters and digits, as well as a few symbol characters and control characters, but could not represent many of the letters used in other languages. The EBCDIC standard, using its 8-bit bytes, chose a different character set entirely, with code pages for "swapping" to different languages. But ultimately this character set was too cumbersome and inflexible.
Over time, it became clear that we needed a truly universal character set, supporting all the world's living languages and special symbols. This culminated in the creation of the Unicode project in 1987. A few different Unicode encodings exist, but the dominant encoding used on the Web is UTF-8. Characters within the ASCII character -set are included verbatim in UTF-8, and "extended characters" can spread out over multiple consecutive bytes.
Since characters are now encoded as bytes, we can represent characters using two hexadecimal digits. For example, the characters A, R, and M are normally encoded with the octets shown in Figure 1.1.
Figure 1.1: Letters A, R, and M and their hexadecimal values
Each hexadecimal digit can be encoded with a 4-bit pattern ranging from 0000 to 1111, as shown in Figure 1.2.
Figure 1.2: Hexadecimal ASCII values and their 8-bit binary equivalents
Since two hexadecimal values are required to encode an ASCII character, 8 bits seemed like the ideal for storing text in most written languages around the world, or a multiple of 8 bits for characters that cannot be represented in 8 bits alone.
Using this pattern, we can more easily interpret the meaning of a long string of bits. The following bit pattern encodes the word Arm:
0100 0001 0101 0010 0100 1101
One uniquely powerful aspect of computers, as opposed to the mechanical calculators that predated them, is that they can also encode their logic as data. This code can also be stored in memory or on disk and be processed or changed on demand. For example, a software update can completely change the operating system of a computer without the need to purchase a new machine.
We've already seen how numbers and characters are encoded, but how is this logic encoded? This is where the processor architecture and its instruction set comes into play.
If we were to create our own computer processor from scratch, we could design our own instruction encoding, mapping binary patterns to machine codes that our processor can interpret and respond to, in effect, creating our own "machine language." Since machine codes are meant to "instruct" the circuitry to perform an "operation," these machine codes are also referred to as instruction codes, or, more commonly, operation codes (opcodes).
In practice, most people use existing computer processors and therefore use the instruction encodings defined by the processor manufacturer. On Arm, instruction encodings have a fixed size and can be either 32-bit or 16-bit, depending on the instruction set in use by the program. The processor fetches and interprets each instruction and runs each in turn to perform the logic of the program. Each instruction is a binary pattern, or instruction encoding, which follows specific rules defined by the Arm architecture.
By way of example, let's assume we're building a tiny 16-bit instruction set and are defining how each instruction will look. Our first task is to designate part of the encoding as specifying exactly what type of instruction is to be run, called the opcode. For example, we might set the first 7 bits of the instruction to be an opcode and specify the opcodes for addition and subtraction, as shown in Table 1.1.
Table 1.1: Addition and Subtraction Opcodes
Writing machine code by hand is possible...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.