CS 111

Scribe Notes for 10/5/2010

by Kathryn Baumgarten

Last time we ended with: modularity.

How to get hard modularity:
(function calls don't give hard modularity)

  1. Run each module on a separate computer (both modules are protected from each other).
    This is more appropriate today with cheaper computers; also, you can get a performance boost from parallelism *if* you play your cards right.
    Example: Client-Service organization.

    Client Module Code (packaged in a subroutine; client calls factorial(5);):
    send(factorial, {"!", 5}); // (server, {command, args});
    receive(factorial, response);
    if (response.opcode == "ok")
    print(response.val);
    else
    print("error");


    Server Module Code:
    while(1) {
    receive(factorial, request);
    if (request.opcode == "!") {
    n.request.val;
    int p = 1;
    for (int i=1; i<=n; i++)
    p* = i;
    response = {"ok", p"}
    else
    response = {"no_good"}
    send (factorial, response);
    }


    Note: There is no parallelism here, because the client just sits and waits rather than running another process.

    Pros:
    Limited error propagation (because the client and service don't share their state).
    The client can keep going if the server crashes, as long as it has a timeout set to stop waiting on the server after too long.
    Both client and service can continue to run (process other threads) if the other loops.

    Cons:
    • Slower: there is latency between the modules.
    • Authentication issues are present on a network (especially on the Internet).
    • Cost: need more computers in order to run each module on its own physical computer.
    • Transmission errors.
    • Note: QNX does this (there is a catch as to why this isn't bad).


  2. Extend the machine instruction set to contain new instructions, one for each server operation.
    (Protects the server module but not the client module.)
    This is impractical because it is not scalable, and involes the "hardware guys".

Virtualization

Therefore, we need to build a virtual machine (built on top of a physical machine).

Process: Write an x86 simulator (in C or any other language), and then add to this emulator a "factl" (factorial long) program/instruction. Then run the program that needed "factl" on the emulator. The client sits inside of the program, and the server sits inside of the emulator. The client is never really in control, so it can't cause errors in the server.

This gives hard modularity, as it protects the server from error propagation.

However, it is also slow, as the physical machine has to execute multiple instructions for each instruction the emulator executes, by a factor of approximately 10 real instructions per emulator instruction.

Virtualizable Processor
We can emulate machine x on machine x (if x is virtualizable). (Example: we can emulate an x86 processor on an x86 processor.) This allows emulated programs to run at full hardware speed, as long as they're "nice". The hardware passes control to the emulator from the emulated program, if the emulated program tries to execute a privileged instruction. The assumption made here is that the vast majority of executed instructions are unprivileged.

Denial of Service attacks: the emulated program does nothing but call halt. While the emulator doesn't actually halt, it does have to deal with the halt system call, which causes an interrupt and regular programs can't continue to run (thus service is denied). To protect against this, a limit is placed on the number of system calls an emulated program can make. Additionally, the processor has timeouts set once every 10 milliseconds to allow the emulator to gain control and make sure no emulated program is using up all the resources. (This is done by using halt once every 10ms - interrupts give control to the emulator.)

To create this division between privileged and unprivileged instructions, instructions to modify the emulator are stored in memory, and the memory is partitioned into accessible/inaccessible (unprotected/protected) regions. This gives increased speed at the cost of complexity.

Using a virtualizable processor to extend the instruction set:
addl
mull
halt // emulator: define 'halt' to return the factorial of the next byte
5 // must assume that the emulator supports this weird instruction convention
leal 1(%eax), %ebx
movl $5, %eax;
leal 1(%eax), %ebx // compute factorial from %eax
jmp* // jump to self - when timer is up, compute factorial
// **must trust the code behind the emulator, to do this **


Linux uses int [byte {0-255}] to generate an interrupt in a virtual processor. This is a privileged instruction. The processor then looks into a table of pointers, the interrupt address table, and calls the function pointed to by the bye in this table. (Example: byte = 0x80 in Linux, for a system call.)

This will push onto the kernel's stack: the emulated program's stack segment (ss), the stack pointer (esp), eflags, GS (code segment), eip (instruction pointer), and an error value (what the problem was that caused an interrupt to be called). Then it sets the eip to the value in 0x80 in the interrupt pointer/address table. The emulator can now do what it likes, including executing privileged instructions. When it is done, the reti instruction allows the emulator to restore the emulated program's state and return from the interrupt.

Caution: if the interrupt address table is not put into protected memory, an attacker can modify the table to have a system call run their own code.

The "I am privileged" register exists to detect whether a process can execute privileged instructions. When it is set to 1, we are in kernel mode and can execute any instructions we want; when it is set to 0, we can only execute unprivileged instructions. The instruction that sets and clears this bit is also privileged - an interrupt routine must set it to 1. (The default value when the computer boots is 1 (the operating system needs to run set up processes), and it is then changed to 0 when the OS finishes.)

In Linux, a system call is implemented with the system call number stored in %eax, and up to 6 arguments in ebx, ecx, edx, esi, edi, and ebp. (If there are more than 6 arguments, you're doing it wrong!)
This gives protected transfer of control at the cost of a loss of speed. It utilizes option (2) from above (extend the machine instruction set). However, we can also use option (1) to do protect transfer of control between client and server (example: QNX).

That is, QNX uses virtualizable processors to implement message passing, between modules on separate physical/virtual computers. Linux uses virtual processors to implement extended instruction sets through emulators.

Layering
Building a more useful abstraction from a lower level abstraction.

Hardware instructions abstract the hardware to the kernel; the kernel abstracts hardware instructions to applications through system calls. Can we nest this (i.e., put layers on top of the kernel before applications)?

On x86, yes: for up to 4 layers. On Linux, only 0 (kernel) and 3 (applications) are used.

Considerations for Layering/Abstraction
  1. Access to instructions:
    -Privileged code can access all instructions.
    -Unprivileged code can only access unprivileged instructions.

  2. Access to registers:
    In addition to privileged instructions, we have privileged registers, which control access to the system. These are rarely used because of performance impact. (Normal registers can be accessed at full speed by any code, regardless of its privilege level.)

    Example: %cpl (current process level): 0 through 3 for x86. Because changing this register would allow an application to set its privilege level to that of the kernel, this is a privileged register.

    Virtualization of registers:
    An example (in Linux) where virtual registers would be necessary is
    cmd_line$ cat | dog // two commands run simultaneously, but we have one register set. Therefore, we need to virtualize the register set.
    Process descriptor: this describes each process, with a process id, copy of the stack, and a copy of the virtual registers (there is a process descriptor for cat, for dog, etc). When a process is running, the virtual registers in the process descriptor are junk data, because real registers are used to avoid a performance hit. However, interrupts occur every 10ms, and they are a good time to give another process control; the current register values are stored in the virtual registers to save the state of the process being suspended.
    Note: The more registers you have, the slower context-switching gets. The kernel only needs to save and restore the registers it uses (for example, get_pid is cheap to run in an interrupt). Similarly, generally the floating point registers don't need to be saved/restored (unless they are being used).

  3. Access to primary memory:
    Both process 1 and process 2 see the same virtual address, 0xffffff, for %esp (bottom of stack pointer). This address is mapped to different addresses in the physical memory.
    Virtual Memory
    -----------
    |   text  |
    -----------
    |   data  |
    -----------
    |   heap  |
    |     |   |
    |     V   |
    -----------
    |    ^    |
    |    |    |
    |  stack  |
    ----------- <-- 0xffffff
                    (process 1 and process 2 see this)
    			
    Physical Memory
    ------------
    |   text1  |
    ------------
    |   data1  |
    ------------
    |   heap1  |
    |     |    |
    |     V    |
    ------------
    |    ^     |
    |    |     |
    |  stack1  |
    ------------ <-- 0xee33ac
    |          |
    |   ...    | 
    |          |
    ------------
    |   text2  |
    ------------
    |   data2  |
    ------------
    |   heap2  |
    |     |    |
    |     V    |
    ------------
    |    ^     |
    |    |     |
    |  stack2  |
    ------------ <-- 0x44dacc 
    			
  4. Access to devices:
    There is no hardware support for this because it is so rare. Therefore, we just use system calls typically (which has some small overhead).

    The exception is graphics devices/output (common, not universal, to have hardware support).
    *The VMMU (Virtual Memory Management Unit) can map to the graphics card's memory, too.

    There are different kinds of devices:
    -Request response devices (e.g. disk drive, graphics card)
    -Spontaneous data generation devices (e.g. keyboard, network card)

    Unix's Big Idea:
    Use a single abstraction for all devices. (Open, read, write, close, lseek*).
    In practice, devices are a messy area. Therefore, extras are needed. An example is the system call ioctrl, which is used when the abstraction can't handle the I/O.

Try this: $ cat /proc/cpuinfo