CSRG TR/5 -- September 1, 1982 -- Joy, et. al.

4.2BSD System Manual

Revised July, 1983

William Joy, Eric Cooper, Robert Fabry,
Samuel Leffler, Kirk McKusick and David Mosher

Computer Systems Research Group
Computer Science Division
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA 94720

(415) 642-7780

ABSTRACT

This document summarizes the facilities provided by the 4.2BSD version of the UNIX operating system. It does not attempt to act as a tutorial for use of the system nor does it attempt to explain or justify the design of the system facilities. It gives neither motivation nor implementation details, in favor of brevity.

The first section describes the basic kernel functions provided to a UNIX process: process naming and protection, memory management, software interrupts, object references (descriptors), time and statistics functions, and resource controls. These facilities, as well as facilities for bootstrap, shutdown and process accounting, are provided solely by the kernel.

The second section describes the standard system abstractions for files and file systems, communication, terminal handling, and process control and debugging. These facilities are implemented by the operating system or by network server processes.

† UNIX is a trademark of Bell Laboratories.

Introduction.
0. Notation and types
1. Kernel primitives
1.1. Processes and protection: .1. Host and process identifiers; .2. Process creation and termination; .3. User and group ids; .4. Process groups
1.2. Memory management: .1. Text, data and stack; .2. Mapping pages; .3. Page protection control; .4. Giving and getting advice
1.3. Signals: .1. Overview; .2. Signal types; .3. Signal handlers; .4. Sending signals; .5. Protecting critical sections; .6. Signal stacks
1.4. Timing and statistics: .1. Real time; .2. Interval time
1.5. Descriptors: .1. The reference table; .2. Descriptor properties; .3. Managing descriptor references; .4. Multiplexing requests; .5. Descriptor wrapping
1.6. Resource controls: .1. Process priorities; .2. Resource utilization; .3. Resource limits
1.7. System operation support: .1. Bootstrap operations; .2. Shutdown operations; .3. Accounting
2. System facilities
2.1. Generic operations: .1. Read and write; .2. Input/output control; .3. Non-blocking and asynchronous operations
2.2. File system: .1 Overview; .2. Naming; .3. Creation and removal; .3.1. Directory creation and removal; .3.2. File creation; .3.3. Creating references to devices; .3.4. Portal creation; .3.6. File, device, and portal removal; .4. Reading and modifying file attributes; .5. Links and renaming; .6. Extension and truncation; .7. Checking accessibility; .8. Locking; .9. Disc quotas
2.3. Inteprocess communication: .1. Interprocess communication primitives; .1.1. Communication domains; .1.2. Socket types and protocols; .1.3. Socket creation, naming and service establishment; .1.4. Accepting connections; .1.5. Making connections; .1.6. Sending and receiving data; .1.7. Scatter/gather and exchanging access rights; .1.8. Using read and write with sockets; .1.9. Shutting down halves of full-duplex connections; .1.10. Socket and protocol options; .2. UNIX domain; .2.1. Types of sockets; .2.2. Naming; .2.3. Access rights transmission; .3. INTERNET domain; .3.1. Socket types and protocols; .3.2. Socket naming; .3.3. Access rights transmission; .3.4. Raw access
2.4. Terminals and devices: .1. Terminals; .1.1. Terminal input; .1.1.1 Input modes; .1.1.2 Interrupt characters; .1.1.3 Line editing; .1.2. Terminal output; .1.3. Terminal control operations; .1.4. Terminal hardware support; .2. Structured devices; .3. Unstructured devices
2.5. Process control and debugging
I. Summary of facilities

1. Notation and types

The notation used to describe system calls is a variant of a C language call, consisting of a prototype call followed by declaration of parameters and results. An additional keyword result, not part of the normal C language, is used to indicate which of the declared entities receive results. As an example, consider the read call, as described in section 2.1:


cc = read(fd, buf, nbytes);
result int cc; int fd; result char *buf; int nbytes;

The first line shows how the read routine is called, with three parameters. As shown on the second line cc is an integer and read also returns information in the parameter buf.

Description of all error conditions arising from each system call is not provided here; they appear in the programmer's manual. In particular, when accessed from the C language, many calls return a characteristic -1 value when an error occurs, returning the error code in the global variable errno. Other languages may present errors in different ways.

A number of system standard types are defined in the include file <sys/types.h> and used in the specifications here and in many C programs. These include caddr_t giving a memory address (typically as a character pointer), off_t giving a file offset (typically as a long integer), and a set of unsigned types u_char, u_short, u_int and u_long, shorthand names for unsigned char, unsigned short, etc.

2. Kernel primitives

The facilities available to a UNIX user process are logically divided into two parts: kernel facilities directly implemented by UNIX code running in the operating system, and system facilities implemented either by the system, or in cooperation with a server process. These kernel facilities are described in this section 1.

The facilities implemented in the kernel are those which define the UNIX virtual machine which each process runs in. Like many real machines, this virtual machine has memory management hardware, an interrupt facility, timers and counters. The UNIX virtual machine also allows access to files and other objects through a set of descriptors. Each descriptor resembles a device controller, and supports a set of operations. Like devices on real machines, some of which are internal to the machine and some of which are external, parts of the descriptor machinery are built-in to the operating system, while other parts are often implemented in server processes on other machines. The facilities provided through the descriptor machinery are described in section 2.

2.1. Processes and protection

2.1.1. Host and process identifiers

Each UNIX host has associated with it a 32-bit host id, and a host name of up to 255 characters. These are set (by a privileged user) and returned by the calls:


sethostid(hostid)
long hostid;

hostid = gethostid();
result long hostid;

sethostname(name, len)
char *name; int len;

len = gethostname(buf, buflen)
result int len; result char *buf; int buflen;

On each host runs a set of processes. Each process is largely independent of other processes, having its own protection domain, address space, timers, and an independent set of references to system or user implemented objects.

Each process in a host is named by an integer called the process id. This number is in the range 1-30000 and is returned by the getpid routine:


pid = getpid();
result int pid;

On each UNIX host this identifier is guaranteed to be unique; in a multi-host environment, the (hostid, process id) pairs are guaranteed unique.

2.1.2. Process creation and termination

A new process is created by making a logical duplicate of an existing process:


pid = fork();
result int pid;

The fork call returns twice, once in the parent process, where pid is the process identifier of the child, and once in the child process where pid is 0. The parent-child relationship induces a hierarchical structure on the set of processes in the system.

A process may terminate by executing an exit call:


exit(status)
int status;

returning 8 bits of exit status to its parent.

When a child process exits or terminates abnormally, the parent process receives information about any event which caused termination of the child process. A second call provides a non-blocking interface and may also be used to retrieve information about resources consumed by the process during its lifetime.


#include <sys/wait.h>

pid = wait(astatus);
result int pid; result union wait *astatus;

pid = wait3(astatus, options, arusage);
result int pid; result union waitstatus *astatus;
int options; result struct rusage *arusage;

A process can overlay itself with the memory image of another process, passing the newly created process a set of parameters, using the call:


execve(name, argv, envp)
char *name, **argv, **envp;

The specified name must be a file which is in a format recognized by the system, either a binary executable file or a file which causes the execution of a specified interpreter program to process its contents.

2.1.3. User and group ids

Each process in the system has associated with it two user-id's: a real user id and a effective user id, both non-negative 16 bit integers. Each process has an real accounting group id and an effective accounting group id and a set of access group id's. The group id's are non-negative 16 bit integers. Each process may be in several different access groups, with the maximum concurrent number of access groups a system compilation parameter, the constant NGROUPS in the file <sys/param.h>, guaranteed to be at least 8.

The real and effective user ids associated with a process are returned by:


ruid = getuid();
result int ruid;

euid = geteuid();
result int euid;

the real and effective accounting group ids by:


rgid = getgid();
result int rgid;

egid = getegid();
result int egid;

and the access group id set is returned by a getgroups call:


ngroups = getgroups(gidsetsize, gidset);
result int ngroups; int gidsetsize; result int gidset[gidsetsize];

The user and group id's are assigned at login time using the setreuid, setregid, and setgroups calls:


setreuid(ruid, euid);
int ruid, euid;

setregid(rgid, egid);
int rgid, egid;

setgroups(gidsetsize, gidset)
int gidsetsize; int gidset[gidsetsize];

The setreuid call sets both the real and effective user-id's, while the setregid call sets both the real and effective accounting group id's. Unless the caller is the super-user, ruid must be equal to either the current real or effective user-id, and rgid equal to either the current real or effective accounting group id. The setgroups call is restricted to the super-user.

2.1.4. Process groups

Each process in the system is also normally associated with a process group. The group of processes in a process group is sometimes referred to as a job and manipulated by high-level system software (such as the shell). The current process group of a process is returned by the getpgrp call:


pgrp = getpgrp(pid);
result int pgrp; int pid;

When a process is in a specific process group it may receive software interrupts affecting the group, causing the group to suspend or resume execution or to be interrupted or terminated. In particular, a system terminal has a process group and only processes which are in the process group of the terminal may read from the terminal, allowing arbitration of terminals among several different jobs.

The process group associated with a process may be changed by the setpgrp call:


setpgrp(pid, pgrp);
int pid, pgrp;

Newly created processes are assigned process id's distinct from all processes and process groups, and the same process group as their parent. A normal (unprivileged) process may set its process group equal to its process id. A privileged process may set the process group of any process to any value.

2.2. Memory management ^†

† This section represents the interface planned for later releases of the system. Of the calls described in this section, only sbrk and getpagesize are included in 4.2BSD.

2.2.1. Text, data and stack

Each process begins execution with three logical areas of memory called text, data and stack. The text area is read-only and shared, while the data and stack areas are private to the process. Both the data and stack areas may be extended and contracted on program request. The call


addr = sbrk(incr);
result caddr_t addr; int incr;

changes the size of the data area by incr bytes and returns the new end of the data area, while


addr = sstk(incr);
result caddr_t addr; int incr;

changes the size of the stack area. The stack area is also automatically extended as needed. On the VAX the text and data areas are adjacent in the P0 region, while the stack section is in the P1 region, and grows downward.

2.2.2. Mapping pages

The system supports sharing of data between processes by allowing pages to be mapped into memory. These mapped pages may be shared with other processes or private to the process. Protection and sharing options are defined in <mman.h> as:


/* protections are chosen from these bits, or-ed together */
#define	PROT_READ	0x4	/* pages can be read */
#define	PROT_WRITE	0x2	/* pages can be written */
#define	PROT_EXEC	0x1	/* pages can be executed */

/* sharing types; choose either SHARED or PRIVATE */
#define	MAP_SHARED	1	/* share changes */
#define	MAP_PRIVATE	2	/* changes are private */

The cpu-dependent size of a page is returned by the getpagesize system call:


pagesize = getpagesize();
result int pagesize;

The call:


mmap(addr, len, prot, share, fd, pos);
caddr_t addr; int len, prot, share, fd; off_t pos;

causes the pages starting at addr and continuing for len bytes to be mapped from the object represented by descriptor fd, at absolute position pos. The parameter share specifies whether modifications made to this mapped copy of the page, are to be kept private, or are to be shared with other references. The parameter prot specifies the accessibility of the mapped pages. The addr, len, and pos parameters must all be multiples of the pagesize.

A process can move pages within its own memory by using the mremap call:


mremap(addr, len, prot, share, fromaddr);
caddr_t addr; int len, prot, share; caddr_t fromaddr;

This call maps the pages starting at fromaddr to the address specified by addr.

A mapping can be removed by the call


munmap(addr, len);
caddr_t addr; int len;

This causes further references to these pages to refer to private pages initialized to zero.

2.2.3. Page protection control

A process can control the protection of pages using the call


mprotect(addr, len, prot);
caddr_t addr; int len, prot;

This call changes the specified pages to have protection prot.

2.2.4. Giving and getting advice

A process that has knowledge of its memory behavior may use the madvise call:


madvise(addr, len, behav);
caddr_t addr; int len, behav;

Behav describes expected behavior, as given in <mman.h>:


#define	MADV_NORMAL	0	/* no further special treatment */
#define	MADV_RANDOM	1	/* expect random page references */
#define	MADV_SEQUENTIAL	2	/* expect sequential references */
#define	MADV_WILLNEED	3	/* will need these pages */
#define	MADV_DONTNEED	4	/* don't need these pages */

Finally, a process may obtain information about whether pages are core resident by using the call


mincore(addr, len, vec)
caddr_t addr; int len; result char *vec;

Here the current core residency of the pages is returned in the character array vec, with a value of 1 meaning that the page is in-core.

2.3. Signals

2.3.1. Overview

The system defines a set of signals that may be delivered to a process. Signal delivery resembles the occurrence of a hardware interrupt: the signal is blocked from further occurrence, the current process context is saved, and a new one is built. A process may specify the handler to which a signal is delivered, or specify that the signal is to be blocked or ignored. A process may also specify that a default action is to be taken when signals occur.

Some signals will cause a process to exit when they are not caught. This may be accompanied by creation of a core image file, containing the current memory image of the process for use in post-mortem debugging. A process may choose to have signals delivered on a special stack, so that sophisticated software stack manipulations are possible.

All signals have the same priority. If multiple signals are pending simultaneously, the order in which they are delivered to a process is implementation specific. Signal routines execute with the signal that caused their invocation blocked, but other signals may yet occur. Mechanisms are provided whereby critical sections of code may protect themselves against the occurrence of specified signals.

2.3.2. Signal types

The signals defined by the system fall into one of five classes: hardware conditions, software conditions, input/output notification, process control, or resource control. The set of signals is defined in the file .

Hardware signals are derived from exceptional conditions which may occur during execution. Such signals include SIGFPE representing floating point and other arithmetic exceptions, SIGILL for illegal instruction execution, SIGSEGV for addresses outside the currently assigned area of memory, and SIGBUS for accesses that violate memory protection constraints. Other, more cpu-specific hardware signals exist, such as those for the various customer-reserved instructions on the VAX (SIGIOT, SIGEMT, and SIGTRAP).

Software signals reflect interrupts generated by user request: SIGINT for the normal interrupt signal; SIGQUIT for the more powerful quit signal, that normally causes a core image to be generated; SIGHUP and SIGTERM that cause graceful process termination, either because a user has “hung up”, or by user or program request; and SIGKILL, a more powerful termination signal which a process cannot catch or ignore. Other software signals (SIGALRM, SIGVTALRM, SIGPROF) indicate the expiration of interval timers.

A process can request notification via a SIGIO signal when input or output is possible on a descriptor, or when a non-blocking operation completes. A process may request to receive a SIGURG signal when an urgent condition arises.

A process may be stopped by a signal sent to it or the members of its process group. The SIGSTOP signal is a powerful stop signal, because it cannot be caught. Other stop signals SIGTSTP, SIGTTIN, and SIGTTOU are used when a user request, input request, or output request respectively is the reason the process is being stopped. A SIGCONT signal is sent to a process when it is continued from a stopped state. Processes may receive notification with a SIGCHLD signal when a child process changes state, either by stopping or by terminating.

Exceeding resource limits may cause signals to be generated. SIGXCPU occurs when a process nears its CPU time limit and SIGXFSZ warns that the limit on file size creation has been reached.

2.3.3. Signal handlers

A process has a handler associated with each signal that controls the way the signal is delivered. The call


#include <signal.h>

struct sigvec {
	int	(*sv_handler)();
	int	sv_mask;
	int	sv_onstack;
};

sigvec(signo, sv, osv)
int signo; struct sigvec *sv; result struct sigvec *osv;

assigns interrupt handler address sv_handler to signal signo. Each handler address specifies either an interrupt routine for the signal, that the signal is to be ignored, or that a default action (usually process termination) is to occur if the signal occurs. The constants SIG_IGN and SIG_DEF used as values for sv_handler cause ignoring or defaulting of a condition. The sv_mask and sv_onstack values specify the signal mask to be used when the handler is invoked and whether the handler should operate on the normal run-time stack or a special signal stack (see below). If osv is non-zero, the previous signal vector is returned.

When a signal condition arises for a process, the signal is added to a set of signals pending for the process. If the signal is not currently blocked by the process then it will be delivered. The process of signal delivery adds the signal to be delivered and those signals specified in the associated signal handler's sv_mask to a set of those masked for the process, saves the current process context, and places the process in the context of the signal handling routine. The call is arranged so that if the signal handling routine exits normally the signal mask will be restored and the process will resume execution in the original context. If the process wishes to resume in a different context, then it must arrange to restore the signal mask itself.

The mask of blocked signals is independent of handlers for signals. It prevents signals from being delivered much as a raised hardware interrupt priority level prevents hardware interrupts. Preventing an interrupt from occurring by changing the handler is analogous to disabling a device from further interrupts.

The signal handling routine sv_handler is called by a C call of the form


(*sv_handler)(signo, code, scp);
int signo; long code; struct sigcontext *scp;

The signo gives the number of the signal that occurred, and the code, a word of information supplied by the hardware. The scp parameter is a pointer to a machine-dependent structure containing the information for restoring the context before the signal.

2.3.4. Sending signals

A process can send a signal to another process or group of processes with the calls:


kill(pid, signo)
int pid, signo;

killpgrp(pgrp, signo)
int pgrp, signo;

Unless the process sending the signal is privileged, it and the process receiving the signal must have the same effective user id.

Signals are also sent implicitly from a terminal device to the process group associated with the terminal when certain input characters are typed.

2.3.5. Protecting critical sections

To block a section of code against one or more signals, a sigblock call may be used to add a set of signals to the existing mask, returning the old mask:


oldmask = sigblock(mask);
result long oldmask; long mask;

The old mask can then be restored later with sigsetmask,


oldmask = sigsetmask(mask);
result long oldmask; long mask;

The sigblock call can be used to read the current mask by specifying an empty mask.

It is possible to check conditions with some signals blocked, and then to pause waiting for a signal and restoring the mask, by using:


sigpause(mask);
long mask;

2.3.6. Signal stacks

Applications that maintain complex or fixed size stacks can use the call


struct sigstack {
	caddr_t	ss_sp;
	int	ss_onstack;
};

sigstack(ss, oss)
struct sigstack *ss; result struct sigstack *oss;

to provide the system with a stack based at ss_sp for delivery of signals. The value ss_onstack indicates whether the process is currently on the signal stack, a notion maintained in software by the system.

When a signal is to be delivered, the system checks whether the process is on a signal stack. If not, then the process is switched to the signal stack for delivery, with the return from the signal arranged to restore the previous stack.

If the process wishes to take a non-local exit from the signal routine, or run code from the signal stack that uses a different stack, a sigstack call should be used to reset the signal stack.

2.4. Timers

2.4.1. Real time

The system's notion of the current Greenwich time and the current time zone is set and returned by the call by the calls:


#include <sys/time.h>

settimeofday(tvp, tzp);
struct timeval *tp;
struct timezone *tzp;

gettimeofday(tp, tzp);
result struct timeval *tp;
result struct timezone *tzp;

where the structures are defined in <sys/time.h> as:


struct timeval {
	long	tv_sec;	/* seconds since Jan 1, 1970 */
	long	tv_usec;	/* and microseconds */
};

struct timezone {
	int	tz_minuteswest;	/* of Greenwich */
	int	tz_dsttime;	/* type of dst correction to apply */
};

Earlier versions of UNIX contained only a 1-second resolution version of this call, which remains as a library routine:


time(tvsec)
result long *tvsec;

returning only the tv_sec field from the gettimeofday call.

2.4.2. Interval time

The system provides each process with three interval timers, defined in <sys/time.h>:


#define	ITIMER_REAL	0	/* real time intervals */
#define	ITIMER_VIRTUAL	1	/* virtual time intervals */
#define	ITIMER_PROF	2	/* user and system virtual time */

The ITIMER_REAL timer decrements in real time. It could be used by a library routine to maintain a wakeup service queue. A SIGALRM signal is delivered when this timer expires.

The ITIMER_VIRTUAL timer decrements in process virtual time. It runs only when the process is executing. A SIGVTALRM signal is delivered when it expires.

The ITIMER_PROF timer decrements both in process virtual time and when the system is running on behalf of the process. It is designed to be used by processes to statistically profile their execution. A SIGPROF signal is delivered when it expires.

A timer value is defined by the itimerval structure:


struct itimerval {
	struct	timeval it_interval;	/* timer interval */
	struct	timeval it_value;	/* current value */
};

and a timer is set or read by the call:


getitimer(which, value);
int which; result struct itimerval *value;

setitimer(which, value, ovalue);
int which; struct itimerval *value; result struct itimerval *ovalue;

The third argument to setitimer specifies an optional structure to receive the previous contents of the interval timer. A timer can be disabled by specifying a timer value of 0.

The system rounds argument timer intervals to be not less than the resolution of its clock. This clock resolution can be determined by loading a very small value into a timer and reading the timer back to see what value resulted.

The alarm system call of earlier versions of UNIX is provided as a library routine using the ITIMER_REAL timer. The process profiling facilities of earlier versions of UNIX remain because it is not always possible to guarantee the automatic restart of system calls after receipt of a signal.


profil(buf, bufsize, offset, scale);
result char *buf; int bufsize, offset, scale;

2.5. Descriptors


.PP
.NH 3
The reference table
.PP
Each process has access to resources through
\fIdescriptors\fP.  Each descriptor is a handle allowing
the process to reference objects such as files, devices
and communications links.
.PP
Rather than allowing processes direct access to descriptors, the system
introduces a level of indirection, so that descriptors may be shared
between processes.  Each process has a \fIdescriptor reference table\fP,
containing pointers to the actual descriptors.  The descriptors
themselves thus have multiple references, and are reference counted by the
system.
.PP
Each process has a fixed size descriptor reference table, where
the size is returned by the \fIgetdtablesize\fP call:
.DS
nds = getdtablesize();
result int nds;
.DE
and guaranteed to be at least 20.  The entries in the descriptor reference
table are referred to by small integers; for example if there
are 20 slots they are numbered 0 to 19.
.NH 3
Descriptor properties
.PP
Each descriptor has a logical set of properties maintained
by the system and defined by its \fItype\fP.
Each type supports a set of operations;
some operations, such as reading and writing, are common to several
abstractions, while others are unique.
The generic operations applying to many of these types are described
in section 2.1.  Naming contexts, files and directories are described in
section 2.2.  Section 2.3 describes communications domains and sockets.
Terminals and (structured and unstructured) devices are described
in section 2.4.
.NH 3
Managing descriptor references
.PP
A duplicate of a descriptor reference may be made by doing
.DS
new = dup(old);
result int new; int old;
.DE
returning a copy of descriptor reference \fIold\fP indistinguishable from
the original.  The \fInew\fP chosen by the system will be the
smallest unused descriptor reference slot.
A copy of a descriptor reference may be made in a specific slot
by doing
.DS
dup2(old, new);
int old, new;
.DE
The \fIdup2\fP call causes the system to deallocate the descriptor reference
current occupying slot \fInew\fP, if any, replacing it with a reference
to the same descriptor as old.
This deallocation is also performed by:
.DS
close(old);
int old;
.DE
.NH 3
Multiplexing requests
.PP
The system provides a
standard way to do
synchronous and asynchronous multiplexing of operations.
.PP
Synchronous multiplexing is performed by using the \fIselect\fP call:
.DS
nds = select(nd, in, out, except, tvp);
result int nds; int nd; result *in, *out, *except;
struct timeval *tvp;
.DE
The \fIselect\fP call examines the descriptors
specified by the
sets \fIin\fP, \fIout\fP and \fIexcept\fP, replacing
the specified bit masks by the subsets that select for input,
output, and exceptional conditions respectively (\fInd\fP
indicates the size, in bytes, of the bit masks).
If any descriptors meet the following criteria,
then the number of such descriptors is returned in \fInds\fP and the
bit masks are updated.
.if n .ds bu *
.if t .ds bu \(bu
.IP \*(bu
A descriptor selects for input if an input oriented operation
such as \fIread\fP or \fIreceive\fP is possible, or if a
connection request may be accepted (see section 2.3.1.4).
.IP \*(bu
A descriptor selects for output if an output oriented operation
such as \fIwrite\fP or \fIsend\fP is possible, or if an operation
that was ``in progress'', such as connection establishment,
has completed (see section 2.1.3).
.IP \*(bu
A descriptor selects for an exceptional condition if a condition
that would cause a SIGURG signal to be generated exists (see section 1.3.2).
.LP
If none of the specified conditions is true, the operation blocks for
at most the amount of time specified by \fItvp\fP, or waits for one of the
conditions to arise if \fItvp\fP is given as 0.
.PP
Options affecting i/o on a descriptor
may be read and set by the call:
.DS
._d
dopt = fcntl(d, cmd, arg)
result int dopt; int d, cmd, arg;

/* interesting values for cmd */
#define	F_SETFL	3	/* set descriptor options */
#define	F_GETFL	4	/* get descriptor options */
#define	F_SETOWN	5	/* set descriptor owner (pid/pgrp) */
#define	F_GETOWN	6	/* get descriptor owner (pid/pgrp) */
.DE
The F_SETFL \fIcmd\fP may be used to set a descriptor in 
non-blocking i/o mode and/or enable signalling when i/o is
possible.  F_SETOWN may be used to specify a process or process
group to be signalled when using the latter mode of operation.
.PP
Operations on non-blocking descriptors will
either complete immediately,
note an error EWOULDBLOCK,
partially complete an input or output operation returning a partial count,
or return an error EINPROGRESS noting that the requested operation is
in progress.
A descriptor which has signalling enabled will cause the specified process
and/or process group
be signaled, with a SIGIO for input, output, or in-progress
operation complete, or
a SIGURG for exceptional conditions.
.PP
For example, when writing to a terminal
using non-blocking output,
the system will accept only as much data as there is buffer space for
and return; when making a connection on a \fIsocket\fP, the operation may
return indicating that the connection establishment is ``in progress''.
The \fIselect\fP facility can be used to determine when further
output is possible on the terminal, or when the connection establishment
attempt is complete.
.NH 3
Descriptor wrapping.\(dg
.PP
.FS
\(dg The facilities described in this section are not included
in 4.2BSD.
.FE
A user process may build descriptors of a specified type by
\fIwrapping\fP a communications channel with a system supplied protocol
translator:
.DS
new = wrap(old, proto)
result int new; int old; struct dprop *proto;
.DE
Operations on the descriptor \fIold\fP are then translated by the
system provided protocol translator into requests on the underyling
object \fIold\fP in a way defined by the protocol.
The protocols supported by the kernel may vary from system to system
and are described in the programmers manual.
.PP
Protocols may be based on communications multiplexing or a rights-passing
style of handling multiple requests made on the same object.  For instance,
a protocol for implementing a file abstraction may or may not include
locally generated ``read-ahead'' requests.  A protocol that provides for
read-ahead may provide higher performance but have a more difficult
implementation.
.PP
Another example is the terminal driving facilities.  Normally a terminal
is associated with a communications line and the terminal type
and standard terminal access protocol is wrapped around a synchronous
communications line and given to the user.  If a virtual terminal
is required, the terminal driver can be wrapped around a communications
link, the other end of which is held by a virtual terminal protocol
interpreter.

2.6. Resource controls


.NH 3
Process priorities
.PP
The system gives CPU scheduling priority to processes that have not used
CPU time recently.  This tends to favor interactive processes and
processes that execute only for short periods.
It is possible to determine the priority currently
assigned to a process, process group, or the processes of a specified user,
or to alter this priority using the calls:
.DS
._d
#define	PRIO_PROCESS	0	/* process */
#define	PRIO_PGRP	1	/* process group */
#define	PRIO_USER	2	/* user id */

prio = getpriority(which, who);
result int prio; int which, who;

setpriority(which, who, prio);
int which, who, prio;
.DE
The value \fIprio\fP is in the range \-20 to 20.
The default priority is 0; lower priorities cause more
favorable execution.
The \fIgetpriority\fP call returns the highest priority (lowest numerical value)
enjoyed by any of the specified processes.
The \fIsetpriority\fP call sets the priorities of all of the
specified processes to the specified value.
Only the super-user may lower priorities.
.NH 3
Resource utilization
.PP
The resources used by a process are returned by a \fIgetrusage\fP call,
returning information in a structure defined in :
.DS
._d
#define	RUSAGE_SELF	0		/* usage by this process */
#define	RUSAGE_CHILDREN	-1		/* usage by all children */

getrusage(who, rusage)
int who; result struct rusage *rusage;

._f
struct rusage {
	struct	timeval ru_utime;	/* user time used */
	struct	timeval ru_stime;	/* system time used */
	int	ru_maxrss;	/* maximum core resident set size: kbytes */
	int	ru_ixrss;	/* integral shared memory size (kbytes*sec) */
	int	ru_idrss;	/* unshared data " */
	int	ru_isrss;	/* unshared stack " */
	int	ru_minflt;	/* page-reclaims */
	int	ru_majflt;	/* page faults */
	int	ru_nswap;	/* swaps */
	int	ru_inblock;	/* block input operations */
	int	ru_oublock;	/* block output " */
	int	ru_msgsnd;	/* messages sent */
	int	ru_msgrcv;	/* messages received */
	int	ru_nsignals;	/* signals received */
	int	ru_nvcsw;	/* voluntary context switches */
	int	ru_nivcsw;	/* involuntary " */
};
.DE
The \fIwho\fP parameter specifies whose resource usage is to be returned.
The resources used by the current process, or by all
the terminated children of the current process may be requested.
.NH 3
Resource limits
.PP
The resources of a process for which limits are controlled by the
kernel are defined in , and controlled by the
\fIgetrlimit\fP and \fIsetrlimit\fP calls:
.DS
._d
#define	RLIMIT_CPU	0	/* cpu time in milliseconds */
#define	RLIMIT_FSIZE	1	/* maximum file size */
#define	RLIMIT_DATA	2	/* maximum data segment size */
#define	RLIMIT_STACK	3	/* maximum stack segment size */
#define	RLIMIT_CORE	4	/* maximum core file size */
#define	RLIMIT_RSS	5	/* maximum resident set size */

#define	RLIM_NLIMITS	6

#define	RLIM_INFINITY	0x7f\&f\&f\&f\&f\&f\&f

._f
struct rlimit {
	int	rlim_cur;	/* current (soft) limit */
	int	rlim_max;	/* hard limit */
};

getrlimit(resource, rlp)
int resource; result struct rlimit *rlp;

setrlimit(resource, rlp)
int resource; struct rlimit *rlp;
.DE
.PP
Only the super-user can raise the maximum limits.
Other users may only
alter \fIrlim_cur\fP within the range from 0 to \fIrlim_max\fP
or (irreversibly) lower \fIrlim_max\fP.

2.7. System operation support


.PP
Unless noted otherwise,
the calls in this section are permitted only to a privileged user.
.NH 3
Bootstrap operations
.PP
The call
.DS
mount(blkdev, dir, ronly);
char *blkdev, *dir; int ronly;
.DE
extends the UNIX name space.  The \fImount\fP call specifies
a block device \fIblkdev\fP containing a UNIX file system
to be made available starting at \fIdir\fP.  If \fIronly\fP is
set then the file system is read-only; writes to the file system
will not be permitted and access times will not be updated
when files are referenced.
\fIDir\fP is normally a name in the root directory.
.PP
The call
.DS
swapon(blkdev, size);
char *blkdev; int size;
.DE
specifies a device to be made available for paging and swapping.
.PP
.NH 3
Shutdown operations
.PP
The call
.DS
unmount(dir);
char *dir;
.DE
unmounts the file system mounted on \fIdir\fP.
This call will succeed only if the file system is
not currently being used.
.PP
The call
.DS
sync();
.DE
schedules input/output to clean all system buffer caches.
(This call does not require priveleged status.)
.PP
The call
.DS
reboot(how)
int how;
.DE
causes a machine halt or reboot.  The call may request a reboot
by specifying \fIhow\fP as RB_AUTOBOOT, or that the machine be halted
with RB_HALT.  These constants are defined in .
.NH 3
Accounting
.PP
The system optionally keeps an accounting record in a file
for each process that exits on the system.
The format of this record is beyond the scope of this document.
The accounting may be enabled to a file \fIname\fP by doing
.DS
acct(path);
char *path;
.DE
If \fIpath\fP is null, then accounting is disabled.  Otherwise,
the named file becomes the accounting file.

3. System facilities


This section discusses the system facilities that
are not considered part of the kernel.
.PP
The system abstractions described are:
.IP "Directory contexts
.br
A directory context is a position in the UNIX file system name
space.  Operations on files and other named objects in a file system are
always specified relative to such a context.
.IP "Files
.br
Files are used to store uninterpreted sequence of bytes on which
random access \fIreads\fP and \fIwrites\fP may occur.
Pages from files may also be mapped into process address space.
A directory may be read as a file\(dg.
.FS
\(dg Support for mapping files is not included in the 4.2 release.
.FE
.IP "Communications domains
.br
A communications domain represents
an interprocess communications environment, such as the communications
facilities of the UNIX system,
communications in the INTERNET, or the resource sharing protocols
and access rights of a resource sharing system on a local network.
.IP "Sockets
.br
A socket is an endpoint of communication and the focal
point for IPC in a communications domain.  Sockets may be created in pairs,
or given names and used to rendezvous with other sockets
in a communications domain, accepting connections from these
sockets or exchanging messages with them.  These operations model
a labeled or unlabeled communications graph, and can be used in a
wide variety of communications domains.  Sockets can have different
\fItypes\fP\| to provide different semantics of communication,
increasing the flexibility of the model.
.IP "Terminals and other devices
.br
Devices include
terminals, providing input editing and interrupt generation
and output flow control and editing, magnetic tapes,
disks and other peripherals.  They often support the generic
\fIread\fP and \fIwrite\fP operations as well as a number of \fIioctl\fP\|s.
.IP "Processes
.br
Process descriptors provide facilities for control and debugging of
other processes.
.ds ss 2

3.1. Generic operations


.PP
.PP
Many system abstractions support the
operations \fIread\fP, \fIwrite\fP and \fIioctl\fP.  We describe
the basics of these common primitives here.
Similarly, the mechanisms whereby normally synchronous operations
may occur in a non-blocking or asynchronous fashion are
common to all system-defined abstractions and are described here.
.NH 3
Read and write
.PP
The \fIread\fP and \fIwrite\fP system calls can be applied
to communications channels, files, terminals and devices.
They have the form:
.DS
cc = read(fd, buf, nbytes);
result int cc; int fd; result caddr_t buf; int nbytes;

cc = write(fd, buf, nbytes);
result int cc; int fd; caddr_t buf; int nbytes;
.DE
The \fIread\fP call transfers as much data as possible from the
object defined by \fIfd\fP to the buffer at address \fIbuf\fP of
size \fInbytes\fP.  The number of bytes transferred is
returned in \fIcc\fP, which is \-1 if a return occurred before
any data was transferred because of an error or use of non-blocking
operations.
.PP
The \fIwrite\fP call transfers data from the buffer to the
object defined by \fIfd\fP.  Depending on the type of \fIfd\fP,
it is possible that the \fIwrite\fP call will accept some portion
of the provided bytes; the user should resubmit the other bytes
in a later request in this case.
Error returns because of interrupted or otherwise incomplete operations
are possible.
.PP
Scattering of data on input or gathering of data for output
is also possible using an array of input/output vector descriptors.
The type for the descriptors is defined in  as:
.DS
._f
struct iovec {
	caddr_t	iov_msg;	/* base of a component */
	int	iov_len;	/* length of a component */
};
.DE
The calls using an array of descriptors are:
.DS
cc = readv(fd, iov, iovlen);
result int cc; int fd; struct iovec *iov; int iovlen;

cc = writev(fd, iov, iovlen);
result int cc; int fd; struct iovec *iov; int iovlen;
.DE
Here \fIiovlen\fP is the count of elements in the \fIiov\fP array.
.NH 3
Input/output control
.PP
Control operations on an object are performed by the \fIioctl\fP
operation:
.DS
ioctl(fd, request, buffer);
int fd, request; caddr_t buffer;
.DE
This operation causes the specified \fIrequest\fP to be performed
on the object \fIfd\fP.  The \fIrequest\fP parameter specifies
whether the argument buffer is to be read, written, read and written,
or is not needed, and also the size of the buffer, as well as the
request.
Different descriptor types and subtypes within descriptor types
may use distinct \fIioctl\fP requests.  For example,
operations on terminals control flushing of input and output
queues and setting of terminal parameters; operations on
disks cause formatting operations to occur; operations on tapes
control tape positioning.
.PP
The names for basic control operations are defined in .
.NH 3
Non-blocking and asynchronous operations
.PP
A process that wishes to do non-blocking operations on one of
its descriptors sets the descriptor in non-blocking mode as
described in section 1.5.4.  Thereafter the \fIread\fP call will
return a specific EWOULDBLOCK error indication if there is no data to be
\fIread\fP.  The process may
\fIdselect\fP the associated descriptor to determine when a read is
possible.
.PP
Output attempted when a descriptor can accept less than is requested
will either accept some of the provided data, returning a shorter than normal
length, or return an error indicating that the operation would block.
More output can be performed as soon as a \fIselect\fP call indicates
the object is writeable.
.PP
Operations other than data input or output
may be performed on a descriptor in a non-blocking fashion.
These operations will return with a characteristic error indicating
that they are in progress
if they cannot return immediately.  The descriptor
may then be \fIselect\fPed for \fIwrite\fP to find out
when the operation can be retried.  When \fIselect\fP indicates
the descriptor is writeable, a respecification of the original
operation will return the result of the operation.

3.2. File system


.NH 3
Overview
.PP
The file system abstraction provides access to a hierarchical
file system structure.
The file system contains directories (each of which may contain
other sub-directories) as well as files and references to other
objects such as devices and inter-process communications sockets.
.PP
Each file is organized as a linear array of bytes.  No record
boundaries or system related information is present in
a file.
Files may be read and written in a random-access fashion.
The user may read the data in a directory as though
it were an ordinary file to determine the names of the contained files,
but only the system may write into the directories.
The file system stores only a small amount of ownership, protection and usage
information with a file.
.NH 3
Naming
.PP
The file system calls take \fIpath name\fP arguments.
These consist of a zero or more component \fIfile names\fP
separated by ``/'' characters, where each file name
is up to 255 ASCII characters excluding null and ``/''.
.PP
Each process always has two naming contexts: one for the
root directory of the file system and one for the
current working directory.  These are used
by the system in the filename translation process.
If a path name begins with a ``/'', it is called
a full path name and interpreted relative to the root directory context.
If the path name does not begin with a ``/'' it is called
a relative path name and interpreted relative to the current directory
context.
.PP
The system limits
the total length of a path name to 1024 characters.
.PP
The file name ``..'' in each directory refers to
the parent directory of that directory.
The parent directory of a file system is always the systems root directory.
.PP
The calls
.DS
chdir(path);
char *path;

chroot(path)
char *path;
.DE
change the current working directory and root directory context of a process.
Only the super-user can change the root directory context of a process.
.NH 3
Creation and removal
.PP
The file system allows directories, files, special devices,
and ``portals'' to be created and removed from the file system.
.NH 4
Directory creation and removal
.PP
A directory is created with the \fImkdir\fP system call:
.DS
mkdir(path, mode);
char *path; int mode;
.DE
and removed with the \fIrmdir\fP system call:
.DS
rmdir(path);
char *path;
.DE
A directory must be empty if it is to be deleted.
.NH 4
File creation
.PP
Files are created with the \fIopen\fP system call,
.DS
fd = open(path, oflag, mode);
result int fd; char *path; int oflag, mode;
.DE
The \fIpath\fP parameter specifies the name of the
file to be created.  The \fIoflag\fP parameter must
include O_CREAT from below to cause the file to be created.  The protection
for the new file is specified in \fImode\fP.  Bits for \fIoflag\fP are
defined in :
.DS
._d
#define	O_RDONLY	000	/* open for reading */
#define	O_WRONLY	001	/* open for writing */
#define	O_RDWR	002	/* open for read & write */
#define	O_NDELAY	004 	/* non-blocking open */
#define	O_APPEND	010	/* append on each write */
#define	O_CREAT	01000	/* open with file create */
#define	O_TRUNC	02000	/* open with truncation */
#define	O_EXCL	04000	/* error on create if file exists */
.DE
.PP
One of O_RDONLY, O_WRONLY and O_RDWR should be specified,
indicating what types of operations are desired to be performed
on the open file.  The operations will be checked against the user's
access rights to the file before allowing the \fIopen\fP to succeed.
Specifying O_APPEND causes writes to automatically append to the
file.
The flag O_CREAT causes the file to be created if it does not
exist, with the specified \fImode\fP, owned by the current user
and the group of the containing directory.
.PP
If the open specifies to create the file with O_EXCL
and the file already exists, then the \fIopen\fP will fail
without affecting the file in any way.  This provides a
simple exclusive access facility.
.NH 4
Creating references to devices
.PP
The file system allows entries which reference peripheral devices.
Peripherals are distinguished as \fIblock\fP or \fIcharacter\fP
devices according by their ability to support block-oriented
operations.
Devices are identified by their ``major'' and ``minor''
device numbers.  The major device number determines the kind
of peripheral it is, while the minor device number indicates
one of possibly many peripherals of that kind.
Structured devices have all operations performed internally
in ``block'' quantities while
unstructured devices often have a number of
special \fIioctl\fP operations, and may have input and output
performed in large units.
The \fImknod\fP call creates special entries:
.DS
mknod(path, mode, dev);
char *path; int mode, dev;
.DE
where \fImode\fP is formed from the object type
and access permissions.  The parameter \fIdev\fP is a configuration
dependent parameter used to identify specific character or
block i/o devices.
.NH 4
Portal creation\(dg
.PP
.FS
\(dg The \fIportal\fP call is not implemented in 4.2BSD.
.FE
The call
.DS
fd = portal(name, server, param, dtype, protocol, domain, socktype)
result int fd; char *name, *server, *param; int dtype, protocol;
int domain, socktype;
.DE
places a \fIname\fP in the file system name space that causes connection to a
server process when the name is used.
The portal call returns an active portal in \fIfd\fP as though an
access had occurred to activate an inactive portal, as now described.
.PP
When an inactive portal is accesseed, the system sets up a socket
of the specified \fIsocktype\fP in the specified communications
\fIdomain\fP (see section 2.3), and creates the \fIserver\fP process,
giving it the specified \fIparam\fP as argument to help it identify
the portal, and also giving it the newly created socket as descriptor
number 0.  The accessor of the portal will create a socket in the same
\fIdomain\fP and \fIconnect\fP to the server.  The user will then
\fIwrap\fP the socket in the specified \fIprotocol\fP to create an object of
the required descriptor type \fIdtype\fP and proceed with the
operation which was in progress before the portal was encountered.
.PP
While the server process holds the socket (which it received as \fIfd\fP
from the \fIportal\fP call on descriptor 0 at activation) further references
will result in connections being made to the same socket.
.NH 4
File, device, and portal removal
.PP
A reference to a file, special device or portal may be removed with the
\fIunlink\fP call,
.DS
unlink(path);
char *path;
.DE
The caller must have write access to the directory in which
the file is located for this call to be successful.
.NH 3
Reading and modifying file attributes
.PP
Detailed information about the attributes of a file
may be obtained with the calls:
.DS
#include 

stat(path, stb);
char *path; result struct stat *stb;

fstat(fd, stb);
int fd; result struct stat *stb;
.DE
The \fIstat\fP structure includes the file
type, protection, ownership, access times,
size, and a count of hard links.
If the file is a symbolic link, then the status of the link
itself (rather than the file the link references)
may be found using the \fIlstat\fP call:
.DS
lstat(path, stb);
char *path; result struct stat *stb;
.DE
.PP
Newly created files are assigned the user id of the
process that created it and the group id of the directory
in which it was created.  The ownership of a file may
be changed by either of the calls
.DS
chown(path, owner, group);
char *path; int owner, group;

fchown(fd, owner, group);
int fd, owner, group;
.DE
.PP
In addition to ownership, each file has three levels of access
protection associated with it.  These levels are owner relative,
group relative, and global (all users and groups).  Each level
of access has separate indicators for read permission, write
permission, and execute permission.
The protection bits associated with a file may be set by either
of the calls:
.DS
chmod(path, mode);
char *path; int mode;

fchmod(fd, mode);
int fd, mode;
.DE
where \fImode\fP is a value indicating the new protection
of the file.  The file mode is a three digit octal number.
Each digit encodes read access as 4, write access as 2 and execute
access as 1, or'ed together.  The 0700 bits describe owner
access, the 070 bits describe the access rights for processes in the same
group as the file, and the 07 bits describe the access rights
for other processes. 
.PP
Finally, the access and modify times on a file may be set by the call:
.DS
utimes(path, tvp)
char *path; struct timeval *tvp[2];
.DE
This is particularly useful when moving files between media, to
preserve relationships between the times the file was modified.
.NH 3
Links and renaming
.PP
Links allow multiple names for a file
to exist.  Links exist independently of the file linked to.
.PP
Two types of links exist, \fIhard\fP links and \fIsymbolic\fP
links.  A hard link is a reference counting mechanism that
allows a file to have multiple names within the same file
system.  Symbolic links cause string substitution
during the pathname interpretation process.
.PP
Hard links and symbolic links have different
properties.  A hard link insures the target
file will always be accessible, even after its original
directory entry is removed; no such guarantee exists for a symbolic link.
Symbolic links can span file systems boundaries.
.PP
The following calls create a new link, named \fIpath2\fP,
to \fIpath1\fP:
.DS
link(path1, path2);
char *path1, *path2;

symlink(path1, path2);
char *path1, *path2;
.DE
The \fIunlink\fP primitive may be used to remove
either type of link. 
.PP
If a file is a symbolic link, the ``value'' of the
link may be read with the \fIreadlink\fP call,
.DS
len = readlink(path, buf, bufsize);
result int len; result char *path, *buf; int bufsize;
.DE
This call returns, in \fIbuf\fP, the null-terminated string
substituted into pathnames passing through \fIpath\fP\|.
.PP
Atomic renaming of file system resident objects is possible
with the \fIrename\fP call:
.DS
rename(oldname, newname);
char *oldname, *newname;
.DE
where both \fIoldname\fP and \fInewname\fP must be
in the same file system.
If \fInewname\fP exists and is a directory, then it must be empty.
.NH 3
Extension and truncation
.PP
Files are created with zero length and may be extended
simply by writing or appending to them.  While a file is
open the system maintains a pointer into the file
indicating the current location in the file associated with
the descriptor.  This pointer may be moved about in the
file in a random access fashion.
To set the current offset into a file, the \fIlseek\fP
call may be used,
.DS
oldoffset = lseek(fd, offset, type);
result off_t oldoffset; int fd; off_t offset; int type;
.DE
where \fItype\fP is given in  as one of,
.DS
._d
#define	L_SET	0	/* set absolute file offset */
#define	L_INCR	1	/* set file offset relative to current position */
#define	L_XTND	2	/* set offset relative to end-of-file */
.DE
The call ``lseek(fd, 0, L_INCR)''
returns the current offset into the file.
.PP
Files may have ``holes'' in them.  Holes are void areas in the
linear extent of the file where data has never been
written.  These may be created by seeking to
a location in a file past the current end-of-file and writing.
Holes are treated by the system as zero valued bytes.
.PP
A file may be truncated with either of the calls:
.DS
truncate(path, length);
char *path; int length;

ftruncate(fd, length);
int fd, length;
.DE
reducing the size of the specified file to \fIlength\fP bytes.
.NH 3
Checking accessibility
.PP
A process running with
different real and effective user ids
may interrogate the accessibility of a file to the
real user by using
the \fIaccess\fP call:
.DS
accessible = access(path, how);
result int accessible; char *path; int how;
.DE
Here \fIhow\fP is constructed by or'ing the following bits, defined
in :
.DS
._d
#define	F_OK	0	/* file exists */
#define	X_OK	1	/* file is executable */
#define	W_OK	2	/* file is writable */
#define	R_OK	4	/* file is readable */
.DE
The presence or absence of advisory locks does not affect the
result of \fIaccess\fP\|.
.NH 3
Locking
.PP
The file system provides basic facilities that allow cooperating processes
to synchronize their access to shared files.  A process may
place an advisory \fIread\fP or \fIwrite\fP lock on a file,
so that other cooperating processes may avoid interfering
with the process' access.  This simple mechanism
provides locking with file granularity.  More granular
locking can be built using the IPC facilities to provide a lock
manager.
The system does not force processes to obey the locks;
they are of an advisory nature only.
.PP
Locking is performed after an \fIopen\fP call by applying the
\fIflock\fP primitive,
.DS
flock(fd, how);
int fd, how;
.DE
where the \fIhow\fP parameter is formed from bits defined in :
.DS
._d
#define	LOCK_SH	1	/* shared lock */
#define	LOCK_EX	2	/* exclusive lock */
#define	LOCK_NB	4	/* don't block when locking */
#define	LOCK_UN	8	/* unlock */
.DE
Successive lock calls may be used to increase or
decrease the level of locking.  If an object is currently
locked by another process when a \fIflock\fP call is made,
the caller will be blocked until the current lock owner
releases the lock; this may be avoided by including LOCK_NB
in the \fIhow\fP parameter.
Specifying LOCK_UN removes all locks associated with the descriptor.
Advisory locks held by a process are automatically deleted when
the process terminates.
.NH 3
Disk quotas
.PP
As an optional facility, each file system may be requested to
impose limits on a user's disk usage.
Two quantities are limited: the total amount of disk space which
a user may allocate in a file system and the total number of files
a user may create in a file system.  Quotas are expressed as
\fIhard\fP limits and \fIsoft\fP limits.  A hard limit is
always imposed; if a user would exceed a hard limit, the operation
which caused the resource request will fail.  A soft limit results
in the user receiving a warning message, but with allocation succeeding.
Facilities are provided to turn soft limits into hard limits if a
user has exceeded a soft limit for an unreasonable period of time.
.PP
To enable disk quotas on a file system the \fIsetquota\fP call
is used:
.DS
setquota(special, file)
char *special, *file;
.DE
where \fIspecial\fP refers to a structured device file where
a mounted file system exists, and
\fIfile\fP refers to a disk quota file (residing on the file
system associated with \fIspecial\fP) from which user quotas
should be obtained.  The format of the disk quota file is 
implementation dependent.
.PP
To manipulate disk quotas the \fIquota\fP call is provided:
.DS
#include 

quota(cmd, uid, arg, addr)
int cmd, uid, arg; caddr_t addr;
.DE
The indicated \fIcmd\fP is applied to the user ID \fIuid\fP.
The parameters \fIarg\fP and \fIaddr\fP are command specific.
The file  contains definitions pertinent to the
use of this call.

3.3. Interprocess communications


.NH 3
Interprocess communication primitives
.NH 4
Communication domains
.PP
The system provides access to an extensible set of 
communication \fIdomains\fP.  A communication domain
is identified by a manifest constant defined in the
file .
Important standard domains supported by the system are the ``unix''
domain, AF_UNIX, for communication within the system, and the ``internet''
domain for communication in the DARPA internet, AF_INET.  Other domains can
be added to the system.
.NH 4
Socket types and protocols
.PP
Within a domain, communication takes place between communication endpoints
known as \fIsockets\fP.  Each socket has the potential to exchange
information with other sockets within the domain.
.PP
Each socket has an associated
abstract type, which describes the semantics of communication using that
socket.  Properties such as reliability, ordering, and prevention
of duplication of messages are determined by the type.
The basic set of socket types is defined in :
.DS
/* Standard socket types */
._d
#define	SOCK_DGRAM	1	/* datagram */
#define	SOCK_STREAM	2	/* virtual circuit */
#define	SOCK_RAW	3	/* raw socket */
#define	SOCK_RDM	4	/* reliably-delivered message */
#define	SOCK_SEQPACKET	5	/* sequenced packets */
.DE
The SOCK_DGRAM type models the semantics of datagrams in network communication:
messages may be lost or duplicated and may arrive out-of-order.
The SOCK_RDM type models the semantics of reliable datagrams: messages
arrive unduplicated and in-order, the sender is notified if
messages are lost.
The \fIsend\fP and \fIreceive\fP operations (described below)
generate reliable/unreliable datagrams.
The SOCK_STREAM type models connection-based virtual circuits: two-way
byte streams with no record boundaries.
The SOCK_SEQPACKET type models a connection-based,
full-duplex, reliable, sequenced packet exchange;
the sender is notified if messages are lost, and messages are never
duplicated or presented out-of-order.
Users of the last two abstractions may use the facilities for
out-of-band transmission to send out-of-band data.
.PP
SOCK_RAW is used for unprocessed access to internal network layers
and interfaces; it has no specific semantics.
.PP
Other socket types can be defined.\(dg
.FS
\(dg 4.2BSD does not support the SOCK_RDM and SOCK_SEQPACKET types.
.FE
.PP
Each socket may have a concrete \fIprotocol\fP associated with it.
This protocol is used within the domain to provide the semantics
required by the socket type.
For example, within the ``internet'' domain, the SOCK_DGRAM type may be
implemented by the UDP user datagram protocol, and the SOCK_STREAM
type may be implemented by the TCP transmission control protocol, while
no standard protocols to provide SOCK_RDM or SOCK_SEQPACKET sockets exist.
.NH 4
Socket creation, naming and service establishment
.PP
Sockets may be \fIconnected\fP or \fIunconnected\fP.  An unconnected
socket descriptor is obtained by the \fIsocket\fP call:
.DS
s = socket(domain, type, protocol);
result int s; int domain, type, protocol;
.DE
.PP
An unconnected socket descriptor may yield a connected socket descriptor
in one of two ways: either by actively connecting to another socket,
or by becoming associated with a name in the communications domain and
\fIaccepting\fP a connection from another socket.
.PP
To accept connections, a socket must first have a binding
to a name within the communications domain.  Such a binding
is established by a \fIbind\fP call:
.DS
bind(s, name, namelen);
int s; char *name; int namelen;
.DE
A socket's bound name may be retrieved with a \fIgetsockname\fP call:
.DS
getsockname(s, name, namelen);
int s; result caddr_t name; result int *namelen;
.DE
while the peer's name can be retrieved with \fIgetpeername\fP:
.DS
getpeername(s, name, namelen);
int s; result caddr_t name; result int *namelen;
.DE
Domains may support sockets with several names.
.NH 4
Accepting connections
.PP
Once a binding is made, it is possible to \fIlisten\fP for
connections:
.DS
listen(s, backlog);
int s, backlog;
.DE
The \fIbacklog\fP specifies the maximum count of connections
that can be simultaneously queued awaiting acceptance.
.PP
An \fIaccept\fP call:
.DS
t = accept(s, name, anamelen);
result int t; int s; result caddr_t name; result int *anamelen;
.DE
returns a descriptor for a new, connected, socket
from the queue of pending connections on \fIs\fP.
.NH 4
Making connections
.PP
An active connection to a named socket is made by the \fIconnect\fP call:
.DS
connect(s, name, namelen);
int s; caddr_t name; int namelen;
.DE
.PP
It is also possible to create connected pairs of sockets without
using the domain's name space to rendezvous; this is done with the
\fIsocketpair\fP call\(dg:
.FS
\(dg 4.2BSD supports \fIsocketpair\fP creation only in the ``unix''
communication domain.
.FE
.DS
socketpair(d, type, protocol, sv);
int d, type, protocol; result int sv[2];
.DE
Here the returned \fIsv\fP descriptors correspond to those obtained with
\fIaccept\fP and \fIconnect\fP.
.PP
The call
.DS
pipe(pv)
result int pv[2];
.DE
creates a pair of SOCK_STREAM sockets in the UNIX domain,
with pv[0] only writeable and pv[1] only readable.
.NH 4
Sending and receiving data
.PP
Messages may be sent from a socket by:
.DS
cc = sendto(s, buf, len, flags, to, tolen);
result int cc; int s; caddr_t buf; int len, flags; caddr_t to; int tolen;
.DE
if the socket is not connected or:
.DS
cc = send(s, buf, len, flags);
result int cc; int s; caddr_t buf; int len, flags;
.DE
if the socket is connected.
The corresponding receive primitives are:
.DS
msglen = recvfrom(s, buf, len, flags, from, fromlenaddr);
result int msglen; int s; result caddr_t buf; int len, flags;
result caddr_t from; result int *fromlenaddr;
.DE
and
.DS
msglen = recv(s, buf, len, flags);
result int msglen; int s; result caddr_t buf; int len, flags;
.DE
.PP
In the unconnected case,
the parameters \fIto\fP and \fItolen\fP
specify the destination or source of the message, while
the \fIfrom\fP parameter stores the source of the message,
and \fI*fromlenaddr\fP initially gives the size of the \fIfrom\fP
buffer and is updated to reflect the true length of the \fIfrom\fP
address.
.PP
All calls cause the message to be received in or sent from
the message buffer of length \fIlen\fP bytes, starting at address \fIbuf\fP.
The \fIflags\fP specify
peeking at a message without reading it or sending or receiving
high-priority out-of-band messages, as follows:
.DS
._d
#define	MSG_PEEK	0x1	/* peek at incoming message */
#define	MSG_OOB	0x2	/* process out-of-band data */
.DE
.NH 4
Scatter/gather and exchanging access rights
.PP
It is possible scatter and gather data and to exchange access rights
with messages.  When either of these operations is involved,
the number of parameters to the call becomes large.
Thus the system defines a message header structure, in ,
which can be
used to conveniently contain the parameters to the calls:
.DS
.if t .ta .5i 1.25i 2i 2.7i
.if n ._f
struct msghdr {
	caddr_t	msg_name;		/* optional address */
	int	msg_namelen;	/* size of address */
	struct	iov *msg_iov;	/* scatter/gather array */
	int	msg_iovlen;		/* # elements in msg_iov */
	caddr_t	msg_accrights;	/* access rights sent/received */
	int	msg_accrightslen;	/* size of msg_accrights */
};
.DE
Here \fImsg_name\fP and \fImsg_namelen\fP specify the source or destination
address if the socket is unconnected; \fImsg_name\fP may be given as
a null pointer if no names are desired or required.
The \fImsg_iov\fP and \fImsg_iovlen\fP describe the scatter/gather
locations, as described in section 2.1.3.
Access rights to be sent along with the message are specified
in \fImsg_accrights\fP, which has length \fImsg_accrightslen\fP.
In the ``unix'' domain these are an array of integer descriptors,
taken from the sending process and duplicated in the receiver.
.PP
This structure is used in the operations \fIsendmsg\fP and \fIrecvmsg\fP:
.DS
sendmsg(s, msg, flags);
int s; struct msghdr *msg; int flags;

msglen = recvmsg(s, msg, flags);
result int msglen; int s; result struct msghdr *msg; int flags;
.DE
.NH 4
Using read and write with sockets
.PP
The normal UNIX \fIread\fP and \fIwrite\fP calls may be
applied to connected sockets and translated into \fIsend\fP and \fIreceive\fP
calls from or to a single area of memory and discarding any rights
received.  A process may operate on a virtual circuit socket, a terminal
or a file with blocking or non-blocking input/output
operations without distinguishing the descriptor type.
.NH 4
Shutting down halves of full-duplex connections
.PP
A process that has a full-duplex socket such as a virtual circuit
and no longer wishes to read from or write to this socket can
give the call:
.DS
shutdown(s, direction);
int s, direction;
.DE
where \fIdirection\fP is 0 to not read further, 1 to not
write further, or 2 to completely shut the connection down.
.NH 4
Socket and protocol options
.PP
Sockets, and their underlying communication protocols, may
support \fIoptions\fP.  These options may be used to manipulate
implementation specific or non-standard facilities. 
The \fIgetsockopt\fP
and \fIsetsockopt\fP calls are used to control options:
.DS
getsockopt(s, level, optname, optval, optlen)
int s, level, optname; result caddr_t optval; result int *optlen;

setsockopt(s, level, optname, optval, optlen)
int s, level, optname; caddr_t optval; int optlen;
.DE
The option \fIoptname\fP is interpreted at the indicated
protocol \fIlevel\fP for socket \fIs\fP.  If a value is specified
with \fIoptval\fP and \fIoptlen\fP, it is interpreted by
the software operating at the specified \fIlevel\fP.  The \fIlevel\fP
SOL_SOCKET is reserved to indicate options maintained
by the socket facilities.  Other \fIlevel\fP values indicate
a particular protocol which is to act on the option request;
these values are normally interpreted as a ``protocol number''.
.NH 3
UNIX domain
.PP
This section describes briefly the properties of the UNIX communications
domain.
.NH 4
Types of sockets
.PP
In the UNIX domain,
the SOCK_STREAM abstraction provides pipe-like
facilities, while SOCK_DGRAM provides (usually)
reliable message-style communications.
.NH 4
Naming
.PP
Socket names are strings and may appear in the UNIX file
system name space through portals\(dg.
.FS
\(dg The 4.2BSD implementation of the UNIX domain embeds
bound sockets in the UNIX file system name space; this
is a side effect of the implementation.
.FE
.NH 4
Access rights transmission
.PP
The ability to pass UNIX descriptors with messages in this domain
allows migration of service within the system and allows
user processes to be used in building system facilities.
.NH 3
INTERNET domain
.PP
This section describes briefly how the INTERNET domain is
mapped to the model described in this section.  More
information will be found in the document describing the
network implementation in 4.2BSD.
.NH 4
Socket types and protocols
.PP
SOCK_STREAM is supported by the INTERNET TCP protocol;
SOCK_DGRAM by the UDP protocol.  The SOCK_SEQPACKET
has no direct INTERNET family analogue; a protocol
based on one from the XEROX NS family and layered on
top of IP could be implemented to fill this gap.
.NH 4
Socket naming
.PP
Sockets in the INTERNET domain have names composed of the 32 bit
internet address, and a 16 bit port number.
Options may be used to
provide source routing for the address, security options,
or additional address for subnets of INTERNET for which the basic 32 bit
addresses are insufficient.
.NH 4
Access rights transmission
.PP
No access rights transmission facilities are provided in the INTERNET domain.
.NH 4
Raw access
.PP
The INTERNET domain allows the super-user access to the raw facilities
of the various network interfaces and the various internal layers
of the protocol implementation.  This allows administrative and debugging
functions to occur.  These interfaces are modeled as SOCK_RAW sockets.

3.4. Terminals and Devices


.NH 3
Terminals
.PP
Terminals support \fIread\fP and \fIwrite\fP i/o operations,
as well as a collection of terminal specific \fIioctl\fP operations,
to control input character editing, and output delays.
.NH 4
Terminal input
.PP
Terminals are handled according to the underlying communication
characteristics such as baud rate and required delays,
and a set of software parameters.
.NH 5
Input modes
.PP
A terminal is in one of three possible modes: \fIraw\fP, \fIcbreak\fP,
or \fIcooked\fP.
In raw mode all input is passed through to the
reading process immediately and without interpretation.
In cbreak mode, the handler interprets input only by looking
for characters that cause interrupts or output flow control;
all other characters are made available as in raw mode.
In cooked mode, input
is processed to provide standard line-oriented local editing functions,
and input is presented on a line-by-line basis.
.NH 5
Interrupt characters
.PP
Interrupt characters are interpreted by the terminal handler only in
cbreak and cooked modes, and
cause a software interrupt to be sent to all processes in the process
group associated with the terminal.  Interrupt characters exist
to send SIGINT
and SIGQUIT signals,
and to stop a process group
with the SIGTSTP signal either immediately, or when
all input up to the stop character has been read.
.NH 5
Line editing
.PP
When the terminal is in cooked mode, editing of an input line
is performed.  Editing facilities allow deletion of the previous
character or word, or deletion of the current input line. 
In addition, a special character may be used to reprint the current
input line after some number of editing operations have been applied.
.PP
Certain other characters are interpreted specially when a process is
in cooked mode.  The \fIend of line\fP character determines
the end of an input record.  The \fIend of file\fP character simulates
an end of file occurrence on terminal input.  Flow control is provided
by \fIstop output\fP and \fIstart output\fP control characters.  Output
may be flushed with the \fIflush output\fP character; and a \fIliteral
character\fP may be used to force literal input of the immediately
following character in the input line.
.NH 4
Terminal output
.PP
On output, the terminal handler provides some simple formatting services.
These include converting the carriage return character to the
two character return-linefeed sequence, displaying non-graphic
ASCII characters as ``^character'', inserting delays after certain
standard control characters, expanding tabs, and providing translations
for upper-case only terminals.
.NH 4
Terminal control operations
.PP
When a terminal is first opened it is initialized to a standard
state and configured with a set of standard control, editing,
and interrupt characters.  A process
may alter this configuration with certain
control operations, specifying parameters in a standard structure:
.DS
._f
struct ttymode {
	short	tt_ispeed;	/* input speed */
	int	tt_iflags;	/* input flags */
	short	tt_ospeed;	/* output speed */
	int	tt_oflags;	/* output flags */
};
.DE
and ``special characters'' are specified with the 
\fIttychars\fP structure,
.DS
._f
struct ttychars {
	char	tc_erasec;	/* erase char */
	char	tc_killc;	/* erase line */
	char	tc_intrc;	/* interrupt */
	char	tc_quitc;	/* quit */
	char	tc_startc;	/* start output */
	char	tc_stopc;	/* stop output */
	char	tc_eofc;	/* end-of-file */
	char	tc_brkc;	/* input delimiter (like nl) */
	char	tc_suspc;	/* stop process signal */
	char	tc_dsuspc;	/* delayed stop process signal */
	char	tc_rprntc;	/* reprint line */
	char	tc_flushc;	/* flush output (toggles) */
	char	tc_werasc;	/* word erase */
	char	tc_lnextc;	/* literal next character */
};
.DE
.NH 4
Terminal hardware support
.PP
The terminal handler allows a user to access basic
hardware related functions; e.g. line speed,
modem control, parity, and stop bits.  A special signal,
SIGHUP, is automatically
sent to processes in a terminal's process
group when a carrier transition is detected.  This is
normally associated with a user hanging up on a modem
controlled terminal line.
.NH 3
Structured devices
.PP
Structures devices are typified by disks and magnetic
tapes, but may represent any random-access device.
The system performs read-modify-write type buffering actions on block
devices to allow them to be read and written in a totally random
access fashion like ordinary files.
File systems are normally created in block devices.
.NH 3
Unstructured devices
.PP
Unstructured devices are those devices which
do not support block structure.  Familiar unstructured devices
are raw communications lines (with
no terminal handler), raster plotters, magnetic tape and disks unfettered
by buffering and permitting large block input/output and positioning
and formatting commands.

3.5. Process and kernel descriptors


.PP
The status of the facilities in this section is still under discussion.
The \fIptrace\fP facility of 4.1BSD is provided in 4.2BSD.  Planned
enhancements would allow a descriptor based process control facility.

I. Summary of facilities


.PP
.de h
.br
.if n .ne 8
\fB\\$1 \\$2\fP
.br
..
.nr H1 0
.NH
Kernel primitives
.LP
.h 1.1. "Process naming and protection
.in +5
.TS
lw(1.6i) aw(3i).
sethostid	set UNIX host id
gethostid	get UNIX host id
sethostname	set UNIX host name
gethostname	get UNIX host name
getpid	get process id
fork	create new process
exit	terminate a process
execve	execute a different process
getuid	get user id
geteuid	get effective user id
setreuid	set real and effective user id's
getgid	get accounting group id
getegid	get effective accounting group id
getgroups	get access group set
setregid	set real and effective group id's
setgroups	set access group set
getpgrp	get process group
setpgrp	set process group
.TE
.in -5
.h 1.2 "Memory management
.in +5
.TS
lw(1.6i) aw(3i).
	memory management definitions
sbrk	change data section size
sstk\(dg	change stack section size
.FS
\(dg Not supported in 4.2BSD.
.FE
getpagesize	get memory page size
mmap\(dg	map pages of memory
mremap\(dg	remap pages in memory
munmap\(dg	unmap memory
mprotect\(dg	change protection of pages
madvise\(dg	give memory management advice
mincore\(dg	determine core residency of pages
.TE
.in -5
.h 1.3 "Signals
.in +5
.TS
lw(1.6i) aw(3i).
	signal definitions
sigvec	set handler for signal
kill	send signal to process
killpgrp	send signal to process group
sigblock	block set of signals
sigsetmask	restore set of blocked signals
sigpause	wait for signals
sigstack	set software stack for signals
.TE
.in -5
.h 1.4 "Timing and statistics
.in +5
.TS
lw(1.6i) aw(3i).
	time-related definitions
gettimeofday	get current time and timezone
settimeofday	set current time and timezone
getitimer	read an interval timer
setitimer	get and set an interval timer
profil	profile process
.TE
.in -5
.h 1.5 "Descriptors
.in +5
.TS
lw(1.6i) aw(3i).
getdtablesize	descriptor reference table size
dup	duplicate descriptor
dup2	duplicate to specified index
close	close descriptor
select	multiplex input/output
fcntl	control descriptor options
wrap\(dg	wrap descriptor with protocol
.FS
\(dg Not supported in 4.2BSD.
.FE
.TE
.in -5
.h 1.6 "Resource controls
.in +5
.TS
lw(1.6i) aw(3i).
	resource-related definitions
getpriority	get process priority
setpriority	set process priority
getrusage	get resource usage
getrlimit	get resource limitations
setrlimit	set resource limitations
.TE
.in -5
.h 1.7 "System operation support
.in +5
.TS
lw(1.6i) aw(3i).
mount	mount a device file system
swapon	add a swap device
umount	umount a file system
sync	flush system caches
reboot	reboot a machine
acct	specify accounting file
.TE
.in -5
.NH
System facilities
.LP
.h 2.1 "Generic operations
.in +5
.TS
lw(1.6i) aw(3i).
read	read data
write	write data
	scatter-gather related definitions
readv	scattered data input
writev	gathered data output
	standard control operations
ioctl	device control operation
.TE
.in -5
.h 2.2 "File system
.PP
Operations marked with a * exist in two forms: as shown,
operating on a file name, and operating on a file descriptor,
when the name is preceded with a ``f''.
.in +5
.TS
lw(1.6i) aw(3i).
	file system definitions
chdir	change directory
chroot	change root directory
mkdir	make a directory
rmdir	remove a directory
open	open a new or existing file
mknod	make a special file
portal\(dg	make a portal entry
unlink	remove a link
stat*	return status for a file	
lstat	returned status of link
chown*	change owner
chmod*	change mode
utimes	change access/modify times
link	make a hard link
symlink	make a symbolic link
readlink	read contents of symbolic link
rename	change name of file
lseek	reposition within file
truncate*	truncate file
access	determine accessibility
flock	lock a file
.TE
.in -5
.h 2.3 "Communications
.in +5
.TS
lw(1.6i) aw(3i).
	standard definitions
socket	create socket
bind	bind socket to name
getsockname	get socket name
listen	allow queueing of connections
accept	accept a connection
connect	connect to peer socket
socketpair	create pair of connected sockets
sendto	send data to named socket
send	send data to connected socket
recvfrom	receive data on unconnected socket
recv	receive data on connected socket
sendmsg	send gathered data and/or rights
recvmsg	receive scattered data and/or rights
shutdown	partially close full-duplex connection
getsockopt	get socket option
setsockopt	set socket option
.TE
.in -5
.h 2.5 "Terminals, block and character devices
.in +5
.TS
lw(1.6i) aw(3i).
.TE
.in -5
.h 2.4 "Processes and kernel hooks
.in +5
.TS
lw(1.6i) aw(3i).
.TE
.in -5