Advanced Character Driver Operations  Andy Wang Ted Baker

advertisement
Advanced Character Driver
Operations
Ted Baker  Andy Wang
COP 5641 / CIS 4930
Topics

Managing ioctl command numbers

Block/unblocking a process
Seeking on a device
Access control


ioctl

For operations beyond simple data transfers





Eject the media
Report error information
Change hardware settings
Self destruct
Alternatives


Embedded commands in the data stream
Driver-specific file systems
ioctl

User-level interface
int ioctl(int fd, unsigned long cmd, ...);
 ...

Variable number of arguments


In this context, it is meant to pass a single optional argument


Just a way to bypass the type checking
Difficult to audit ioctl calls


Problematic for the system call interface
E.g., 32-bit vs. 64-bit modes
Currently uses lock_kernel(), or the global kernel lock

See vfs_ioctl() in /fs/ioctl.c
ioctl

Driver-level interface
int (*ioctl) (struct inode *inode,
struct file *filp,
unsigned int cmd,
unsigned long arg);
 cmd is passed from the user unchanged
 arg can be an integer or a pointer

Compiler does not type check
Choosing the ioctl Commands

Need a numbering scheme to avoid mistakes


E.g., issuing a command to the wrong device
(changing the baud rate of an audio device)
Check include/asm/ioctl.h and
Documentation/ioctl/ioctldecoding.txt
Choosing the ioctl Commands

A command number uses four bitfields



Defined in <linux/ioctl.h>
< direction, type, number, size>
direction: direction of data transfer




_IOC_NONE
_IOC_READ
_IOC_WRITE
_IOC_READ | WRITE
Choosing the ioctl Commands



type (ioctl device type)

8-bit (_IOC_TYPEBITS) magic number

Associated with the device
number

8-bit (_IOC_NRBITS) sequential number

Unique within device
size: size of user data involved

The width is either 13 or 14 bits (_IOC_SIZEBITS)
Choosing the ioctl Commands

Useful macros to create ioctl command
The macro will figure
numbers
out that





size
_IO(type, nr)
_IOR(type, nr, datatype)
_IOW(type, nr, datatype)
_IOWR(type, nr, datatype)
= sizeof(datatype)
Example

cmd = _IOWR(‘k’, 1, struct foo)
Choosing the ioctl Commands

Useful macros to decode ioctl command
numbers




_IOC_DIR(nr)
_IOC_TYPE(nr)
_IOC_NR(nr)
_IOC_SIZE(nr)
Choosing the ioctl Commands

The scull example
/* Use 'k' as magic number */
#define SCULL_IOC_MAGIC 'k‘
/* Please use a different 8-bit number in your code */
#define SCULL_IOCRESET _IO(SCULL_IOC_MAGIC, 0)
Choosing the ioctl Commands

The scull example
/*
* S means "Set" through a ptr,
* T means "Tell" directly with the argument value
* G means "Get": reply by setting through a pointer
* Q means "Query": response is on the return value
* X means "eXchange": switch G and S atomically
* H means "sHift": switch T and Q atomically
*/
#define SCULL_IOCSQUANTUM _IOW(SCULL_IOC_MAGIC, 1, int)
#define SCULL_IOCSQSET _IOW(SCULL_IOC_MAGIC, 2, int)
#define SCULL_IOCTQUANTUM _IO(SCULL_IOC_MAGIC, 3)
#define SCULL_IOCTQSET _IO(SCULL_IOC_MAGIC, 4)
#define SCULL_IOCGQUANTUM _IOR(SCULL_IOC_MAGIC, 5, int)
Set new
value and
return the
old value
Choosing the ioctl Commands

The scull example
#define
#define
#define
#define
#define
#define
#define
SCULL_IOCGQSET _IOR(SCULL_IOC_MAGIC, 6, int)
SCULL_IOCQQUANTUM _IO(SCULL_IOC_MAGIC, 7)
SCULL_IOCQQSET _IO(SCULL_IOC_MAGIC, 8)
SCULL_IOCXQUANTUM _IOWR(SCULL_IOC_MAGIC, 9, int)
SCULL_IOCXQSET _IOWR(SCULL_IOC_MAGIC,10, int)
SCULL_IOCHQUANTUM _IO(SCULL_IOC_MAGIC, 11)
SCULL_IOCHQSET _IO(SCULL_IOC_MAGIC, 12)
#define SCULL_IOC_MAXNR 14
The Return Value

When the command number is not supported


Return –EINVAL
Or –ENOTTY (according to the POSIX standard)
The Predefined Commands

Handled by the kernel first


Will not be passed down to device drivers
Three groups

For any file (regular, device, FIFO, socket)



Magic number: “T.”
For regular files only
Specific to the file system type
Using the ioctl Argument


If it is an integer, just use it directly
If it is a pointer

Need to check for valid user address
int access_ok(int type, const void *addr,
unsigned long size);
 type: either VERIFY_READ or VERIFY_WRITE

Returns 1 for success, 0 for failure
 Driver then results –EFAULT to the caller
Defined in <asm/uaccess.h>

Mostly called by memory-access routines

Using the ioctl Argument

The scull example
int scull_ioctl(struct inode *inode, struct file *filp,
unsigned int cmd, unsigned long arg) {
int err = 0, tmp;
int retval = 0;
/* check the magic number and whether the command is defined */
if (_IOC_TYPE(cmd) != SCULL_IOC_MAGIC) {
return -ENOTTY;
}
if (_IOC_NR(cmd) > SCULL_IOC_MAXNR) {
return -ENOTTY;
}
…
Using the ioctl Argument

The scull example
…
/* the concept of "read" and "write" is reversed here */
if (_IOC_DIR(cmd) & _IOC_READ) {
err = !access_ok(VERIFY_WRITE, (void __user *) arg,
_IOC_SIZE(cmd));
} else if (_IOC_DIR(cmd) & _IOC_WRITE) {
err = !access_ok(VERIFY_READ, (void __user *) arg,
_IOC_SIZE(cmd));
}
if (err) return -EFAULT;
…
Using the ioctl Argument

Data transfer functions optimized for most
used data sizes (1, 2, 4, and 8 bytes)

If the size mismatches

Cryptic compiler error message:



Conversion to non-scalar type requested
Use copy_to_user and copy_from_user
#include <asm/uaccess.h>

put_user(datum, ptr)



Writes to a user-space address
Calls access_ok()
Returns 0 on success, -EFAULT on error
Using the ioctl Argument

__put_user(datum, ptr)
 Does not check access_ok()


get_user(local, ptr)





Can still fail if the user-space memory is not writable
Reads from a user-space address
Calls access_ok()
Stores the retrieved value in local
Returns 0 on success, -EFAULT on error
__get_user(local, ptr)
 Does not check access_ok()

Can still fail if the user-space memory is not readable
Capabilities and Restricted Operations



Limit certain ioctl operations to privileged users
See <linux/capability.h> for the full set of
capabilities
To check a certain capability call
int capable(int capability);

In the scull example
if (!capable(CAP_SYS_ADMIN)) {
return –EPERM;
A catch-all capability for many
}
system administration
operations
The Implementation of the ioctl
Commands

A giant switch statement
…
switch(cmd) {
case SCULL_IOCRESET:
scull_quantum = SCULL_QUANTUM;
scull_qset = SCULL_QSET;
break;
case SCULL_IOCSQUANTUM: /* Set: arg points to the value */
if (!capable(CAP_SYS_ADMIN)) {
return -EPERM;
}
retval = __get_user(scull_quantum, (int __user *)arg);
break;
…
The Implementation of the ioctl
Commands
…
case SCULL_IOCTQUANTUM: /* Tell: arg is the value */
if (!capable(CAP_SYS_ADMIN)) {
return -EPERM;
}
scull_quantum = arg;
break;
case SCULL_IOCGQUANTUM: /* Get: arg is pointer to result */
retval = __put_user(scull_quantum, (int __user *) arg);
break;
case SCULL_IOCQQUANTUM: /* Query: return it (> 0) */
return scull_quantum;
…
The Implementation of the ioctl
Commands
…
case SCULL_IOCXQUANTUM: /* eXchange: use arg as pointer */
if (!capable(CAP_SYS_ADMIN)) {
return -EPERM;
}
tmp = scull_quantum;
retval = __get_user(scull_quantum, (int __user *) arg);
if (retval == 0) {
retval = __put_user(tmp, (int __user *) arg);
}
break;
…
The Implementation of the ioctl
Commands
…
case SCULL_IOCHQUANTUM: /* sHift: like Tell + Query */
if (!capable(CAP_SYS_ADMIN)) {
return -EPERM;
}
tmp = scull_quantum;
scull_quantum = arg;
return tmp;
default: /* redundant, as cmd was checked against MAXNR */
return -ENOTTY;
} /* switch */
return retval;
} /* scull_ioctl */
The Implementation of the ioctl
Commands

Six ways to pass and receive arguments from
the user space

Need to know command number
int quantum;
ioctl(fd,SCULL_IOCSQUANTUM, &quantum); /* Set by pointer */
ioctl(fd,SCULL_IOCTQUANTUM, quantum); /* Set by value */
ioctl(fd,SCULL_IOCGQUANTUM, &quantum); /* Get by pointer */
quantum = ioctl(fd,SCULL_IOCQQUANTUM); /* Get by return value */
ioctl(fd,SCULL_IOCXQUANTUM, &quantum); /* Exchange by pointer */
/* Exchange by value */
quantum = ioctl(fd,SCULL_IOCHQUANTUM, quantum);
Device Control Without ioctl

Writing control sequences into the data
stream itself


Example: console escape sequences
Advantages:


No need to implement ioctl methods
Disadvantages:

Need to make sure that escape sequences do not
appear in the normal data stream (e.g., cat a binary file)

Need to parse the data stream
Blocking I/O


Needed when no data is available for reads
When the device is not ready to accept data

Output buffer is full
Introduction to Sleeping
Introduction to Sleeping


A process is removed from the scheduler’s
run queue
Certain rules

Never sleep when running in an atomic context



Multiple steps must be performed without concurrent
accesses
Not while holding a spinlock, seqlock, or RCU lock
Not while disabling interrupts
Introduction to Sleeping

Okay to sleep while holding a semaphore




Other threads waiting for the semaphore will also sleep
Need to keep it short
Make sure that it is not blocking the process that will wake
it up
After waking up



Make no assumptions about the state of the system
The resource one is waiting for might be gone again
Must check the wait condition again
Introduction to Sleeping

Wait queue: contains a list of processes
waiting for a specific event

#include <linux/wait.h>

To initialize statically, call
DECLARE_WAIT_QUEUE_HEAD(my_queue);

To initialize dynamically, call
wait_queue_head_t my_queue;
init_waitqueue_head(&my_queue);
Simple Sleeping

Call variants of wait_event macros

wait_event(queue, condition)

queue = wait queue head


Waits until the boolean condition becomes true

Puts into an uninterruptible sleep


Passed by value
Usually is not what you want
wait_event_interruptible(queue, condition)


Can be interrupted by any signals
Returns nonzero if sleep was interrupted

Your driver should return -ERESTARTSYS
Simple Sleeping

wait_event_killable(queue, condition)


wait_event_timeout(queue, condition, timeout)



Can be interrupted only by fatal signals
Wait for a limited time (in jiffies)
Returns 0 regardless of condition evaluations
wait_event_interruptible_timeout(queue,
condition,
timeout)
Simple Sleeping

To wake up, call variants of wake_up
functions
void wake_up(wait_queue_head_t *queue);

Wakes up all processes waiting on the queue
void wake_up_interruptible(wait_queue_head_t *queue);

Wakes up processes that perform an interruptible sleep
Simple Sleeping

Example module: sleepy
static DECLARE_WAIT_QUEUE_HEAD(wq);
static int flag = 0;
ssize_t sleepy_read(struct file *filp, char __user *buf,
size_t count, loff_t *pos) {
printk(KERN_DEBUG "process %i (%s) going to sleep\n",
current->pid, current->comm);
wait_event_interruptible(wq, flag != 0);
Multiple
threads can
flag = 0;
wake up at
printk(KERN_DEBUG "awoken %i (%s)\n", current->pid,
this point
current->comm);
return 0; /* EOF */
}
Simple Sleeping

Example module: sleepy
ssize_t sleepy_write(struct file *filp, const char __user *buf,
size_t count, loff_t *pos) {
printk(KERN_DEBUG "process %i (%s) awakening the readers...\n",
current->pid, current->comm);
flag = 1;
wake_up_interruptible(&wq);
return count; /* succeed, to avoid retrial */
}
Blocking and Nonblocking Operations

By default, operations block



If no data is available for reads
If no space is available for writes
Non-blocking I/O is indicated by the
O_NONBLOCK flag in filp->flags




Defined in <linux/fcntl.h>
Only open, read, and write calls are affected
Returns –EAGAIN immediately instead of block
Applications need to distinguish non-blocking
returns vs. EOFs
A Blocking I/O Example

scullpipe

A read process



Blocks when no data is available
Wakes a blocking write when buffer space becomes
available
A write process


Blocks when no buffer space is available
Wakes a blocking read process when data arrives
A Blocking I/O Example

scullpipe data structure
struct scull_pipe {
wait_queue_head_t inq, outq; /* read and write queues */
char *buffer, *end; /* begin of buf, end of buf */
int buffersize; /* used in pointer arithmetic */
char *rp, *wp; /* where to read, where to write */
int nreaders, nwriters; /* number of openings for r/w */
struct fasync_struct *async_queue; /* asynchronous readers */
struct semaphore sem; /* mutual exclusion semaphore */
struct cdev cdev; /* Char device structure */
};
A Blocking I/O Example
static ssize_t scull_p_read(struct file *filp, char __user *buf,
size_t count, loff_t *f_pos) {
struct scull_pipe *dev = filp->private_data;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
while (dev->rp == dev->wp) { /* nothing to read */
up(&dev->sem); /* release the lock */
if (filp->f_flags & O_NONBLOCK)
return -EAGAIN;
if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp)))
return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
A Blocking I/O Example
if (dev->wp > dev->rp)
count = min(count, (size_t)(dev->wp - dev->rp));
else /* the write pointer has wrapped */
count = min(count, (size_t)(dev->end - dev->rp));
if (copy_to_user(buf, dev->rp, count)) {
up (&dev->sem);
return -EFAULT;
}
dev->rp += count;
if (dev->rp == dev->end) dev->rp = dev->buffer; /* wrapped */
up (&dev->sem);
/* finally, awake any writers and return */
wake_up_interruptible(&dev->outq);
return count;
}
Advanced Sleeping
Advanced Sleeping


Uses low-level functions to affect a sleep
How a process sleeps
1. Allocate and initialize a wait_queue_t structure
DEFINE_WAIT(my_wait);

Or
wait_queue_t my_wait;
init_wait(&my_wait);
Queue element
Advanced Sleeping
2. Add to the proper wait queue and mark a process
as being asleep

TASK_RUNNING TASK_INTERRUPTIBLE or
TASK_UNINTERRUPTIBLE

Call
void prepare_to_wait(wait_queue_head_t *queue,
wait_queue_t *wait,
int state);
Advanced Sleeping
3. Give up the processor
Double check the sleeping condition before going to
sleep
 The wakeup thread might have changed the condition
between steps 1 and 2
if (/* sleeping condition */) {
schedule(); /* yield the CPU */
}

Advanced Sleeping
4. Return from sleep
Remove the process from the wait queue if
schedule() was not called
void finish_wait(wait_queue_head_t *queue,
wait_queue_t *wait);
Advanced Sleeping

scullpipe write method
/* How much space is free? */
static int spacefree(struct scull_pipe *dev) {
if (dev->rp == dev->wp)
return dev->buffersize - 1;
return ((dev->rp + dev->buffersize - dev->wp)
% dev->buffersize) - 1;
}
Advanced Sleeping
static ssize_t
scull_p_write(struct file *filp, const char __user *buf,
size_t count, loff_t *f_pos) {
struct scull_pipe *dev = filp->private_data;
int result;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
/* Wait for space for writing */
result = scull_getwritespace(dev, filp);
if (result)
return result; /* scull_getwritespace called up(&dev->sem) */
/* ok, space is there, accept something */
count = min(count, (size_t)spacefree(dev));
Advanced Sleeping
if (dev->wp >= dev->rp)
count = min(count, (size_t)(dev->end - dev->wp));
else /* the write pointer has wrapped, fill up to rp - 1 */
count = min(count, (size_t)(dev->rp - dev->wp - 1));
if (copy_from_user(dev->wp, buf, count)) {
up (&dev->sem); return -EFAULT;
}
dev->wp += count;
if (dev->wp == dev->end) dev->wp = dev->buffer; /* wrapped */
up(&dev->sem);
Notify
wake_up_interruptible(&dev->inq);
asynchronous
if (dev->async_queue)
readers who
kill_fasync(&dev->async_queue, SIGIO, POLL_IN);
are waiting
return count;
}
Advanced Sleeping (Scenario 1)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: full
Task state: RUNNING
Advanced Sleeping (Scenario 1)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: full
Task state: RUNNING  INTERRUPTIBLE
Advanced Sleeping
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: full
Task state: INTERRUPTIBLE /* sleep */
Advanced Sleeping (Scenario 2)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: full
Task state: RUNNING
Advanced Sleeping (Scenario 2)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
wake
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
up
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: !full Task state: RUNNING  RUNNING
Advanced Sleeping (Scenario 2)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: !full Task state: RUNNING  INTERRUPTIBLE
Advanced Sleeping (Scenario 2)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: !full Task state: INTERRUPTIBLE /* no sleep */
Advanced Sleeping (Scenario 3)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: full
Task state: RUNNING
Advanced Sleeping (Scenario 3)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: full
Task state: RUNNING  INTERRUPTIBLE
Advanced Sleeping (Scenario 3)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
wake
if (spacefree(dev) == 0) schedule();
up
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: !full Task state: INTERRUPTIBLE  RUNNING
Advanced Sleeping (Scenario 3)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: !full Task state: RUNNING /* do not sleep */
Advanced Sleeping (Scenario 4)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: full
Task state: RUNNING
Advanced Sleeping (Scenario 4)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: full
Task state: RUNNING  INTERRUPTIBLE
Advanced Sleeping (Scenario 4)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
return 0;
}
Queue: full
Task state: INTERRUPTIBLE
Advanced Sleeping (Scenario 4)
/* Wait for space for writing; caller must hold device semaphore.
* On error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev,
struct file *filp) {
while (spacefree(dev) == 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) == 0) schedule();
finish_wait(&dev->outq, &wait);
if (signal_pending(current)) return -ERESTARTSYS;
if (down_interruptible(&dev->sem)) return -ERESTARTSYS;
}
wake
return 0;
up
}
Queue: !full Task state: INTERRUPTIBLE  RUNNING
More Examples of Advanced Sleeping


See linux/wait.h
Implementations of wait_event, and
wait_event_interruptible
Exclusive Waits

Avoid waking up all processes waiting on a
queue


Wake up only one process
Call
void prepare_to_wait_exclusive(wait_queue_heat_t *queue,
wait_queue_t *wait, int state);

Set the WQ_FLAG_EXCLUSIVE flag


Add the queue entry to the end of the wait queue
wake_up stops after waking the first process with
the flag set
The Details of Waking Up
/* wakes up all processes waiting on the queue */
void wake_up(wait_queue_head_t *queue);
/* wakes up processes that perform an interruptible sleep */
void wake_up_interruptible(wait_queue_head_t *queue);
/* wake up to nr exclusive waiters */
void wake_up_nr(wait_queue_head_t *queue, int nr);
void wake_up_interruptible_nr(wait_queue_head_t *queue, int nr);
/* wake up all exclusive waiters */
void wake_up_all(wait_queue_head_t *queue);
void wake_up_interruptible_all(wait_queue_head_t *queue);
/* do not lose the CPU during this call */
void wake_up_interruptible_sync(wait_queue_head_t *queue);
Ancient History: sleep_on


Not safe
Deprecated
Testing the scullpipe Driver

Window 1
% cat /dev/scullpipe

%
Window2
Testing the scullpipe Driver

Window 1
% cat /dev/scullpipe

Window2
% ls –aF > /dev/scullpipe
Testing the scullpipe Driver

Window 1
% cat /dev/scullpipe
./
../
file1
file2

Window2
% ls –aF > /dev/scullpipe
poll and select

Nonblocking I/Os often involve the use of
poll, select, and epoll system calls





Allow a process to determine whether it can read
or write open files without blocking
Can block a process until any of a set of file
descriptors becomes available for reading or
writing
select introduced in BSD Linux
poll introduced in System V
epoll added in 2.5.45 for better scaling
poll and select

All three calls supported through the poll
method
unsigned int (*poll) (struct file *filp,
poll_table *wait);
1. Call poll_wait on one or more wait queues that could
indicate a change in the poll status

If no file descriptors are available, wait
2. Return a bit mask describing the operations that could
be immediately performed without blocking
poll and select


poll_table defined in <linux/poll.h>
To add a wait queue into the poll_table,
call
void poll_wait(struct file *,
wait_queue_head_t *,
poll_table *);

Bit mask flags defined in <linux/poll.h>

POLLIN

Set if the device can be read without blocking
poll and select

POLLOUT


POLLRDNORM



Set if “normal” data is available for reading
A readable device returns (POLLIN | POLLRDNORM)
POLLWRNORM



Set if the device can be written without blocking
Same meaning as POLLOUT
A writable device returns (POLLOUT | POLLWRNORM)
POLLPRI

High-priority data can be read without blocking
poll and select

POLLHUP


POLLERR


An error condition has occurred
POLLRDBAND



Returns when a process reads the end-of-file
Out-of-band data is available for reading
Associated with sockets
POLLWRBAND

Data with nonzero priority can be written to the device
poll and select

Example
static unsigned int scull_p_poll(struct file *filp,
poll_table *wait) {
struct scull_pipe *dev = filp->private_data;
unsigned int mask = 0;
down(&dev->sem);
poll_wait(filp, &dev->inq, wait);
poll_wait(filp, &dev->outq, wait);
if (dev->rp != dev->wp) /* circular buffer not empty */
mask |= POLLIN | POLLRDNORM; /* readable */
if (spacefree(dev)) /* circular buffer not full */
mask |= POLLOUT | POLLWRNORM; /* writable */
up(&dev->sem);
return mask;
}
poll and select

No end-of-file support

The reader sees an end-of-file when all writers
close the file


Check dev->nwriters in read and poll
Problem when a reader opens the scullpipe before
the writer
 Need blocking within open
Interaction with read and write

Reading from the device

If there is data in the input buffer, return at least
one byte


If no data is available



poll returns POLLIN | POLLRDNORM
If O_NONBLOCK is set, return –EAGAIN
poll must report the device unreadable until one byte
arrives
At the end-of-file, read returns 0, poll returns
POLLHUP
Interaction with read and write

Writing to the device

If there is space in the output buffer, accept at
least one byte


poll reports that the devices is writable by returning
POLLOUT | POLLWRNORM
If the output buffer is full, write blocks



If O_NONBLOCK is set, write returns –EAGAIN
poll reports that the file is not writable
If the device is full, write returns -ENOSPC
Interaction with read and write

In write, never wait for data transmission before
returning


Or, select may block
To make sure the output buffer is actually
transmitted, use fsync call
Interaction with read and write

To flush pending output, call fsync
int (*fsync) (struct file *file,
struct dentry *dentry, int datasync);


Should return only when the device has been
completely flushed
datasync:

Used by file systems, ignored by drivers
The Underlying Data Structure
The Underlying Data Structure

When the poll call completes, poll_table
is deallocated with all wait queue entries
removed

epoll reduces this overhead of setting up and
tearing down the data structure between every I/O
Asynchronous Notification

Polling


Inefficient for rare events
A solution: asynchronous notification


Application receives a signal whenever data
becomes available
Two steps


Specify a process as the owner of the file (so that the
kernel knows whom to notify)
Set the FASYNC flag in the device via fcntl command
Asynchronous Notification

Example (user space)
/* create a signal handler */
signal(SIGIO, &input_handler);
/* set current pid the owner of the stdin */
fcntl(STDIN_FILENO, F_SETOWN, getpid());
/* obtain the current file control flags */
oflags = fcntl(STDIN_FILENO, F_GETFL);
/* set the asynchronous flag */
fcntl(STDIN_FILENO, F_SETFL, oflags | FASYNC);
Asynchronous Notification

Some catches

Not all devices support asynchronous notification


Usually available for sockets and ttys
Need to know which input file to process

Still need to use poll or select
The Driver’s Point of View
1. When F_SETOWN is invoked, a value is
assigned to filp->f_owner
2. When F_SETFL is executed to change the
status of FASYNC

The driver’s fasync method is called
static int
scull_p_fasync(int fd, struct file *filp, int mode) {
struct scull_pipe *dev = filp->private_data;
return fasync_helper(fd, filp, mode, &dev->async_queue);
}
The Driver’s Point of View

fasync_helper adds or removes processes from
the asynchronous list
void fasync_helper(int fd, struct file *filp, int mode,
struct fasync_struct **fa);
3. When data arrives, send a SIGNO signal to
all processes registered for asynchronous
notification

Near the end of write, notify blocked readers
if (dev->async_queue)
kill_fasync(&dev->async_queue, SIGIO, POLL_IN);

Similarly for read, as needed
The Driver’s Point of View
4. When the file is closed, remove the file from
the list of asynchronous readers in the
release method
scull_p_fasync(-1, filp, 0);
The llseek Implementation

Implements lseek and llseek system calls

Modifies filp->f_pos
loff_t scull_llseek(struct file *filp, loff_t off, int whence) {
struct scull_dev *dev = filp->private_data;
loff_t newpos;
switch(whence) {
case 0: /* SEEK_SET */
newpos = off;
break;
case 1: /* SEEK_CUR, relative to the current position */
newpos = filp->f_pos + off;
break;
The llseek Implementation
case 2: /* SEEK_END, relative to the end of the file */
newpos = dev->size + off;
break;
default: /* can't happen */
return -EINVAL;
}
if (newpos < 0) return -EINVAL;
filp->f_pos = newpos;
return newpos;
}
The llseek Implementation

Does not make sense for serial ports and
keyboard inputs

Need to inform the kernel via calling
nonseekable_open in the open method
int nonseekable_open(struct inode *inode, struct file *filp);

Replace llseek method with no_llseek
(defined in <linux/fs.h> in your
file_operations structure
Access Control on a Device File


Prevents unauthorized users from using the
device
Sometimes permits only one authorized user
to open the device at a time
Single-Open Devices

Example: scullsingle
static atomic_t scull_s_available = ATOMIC_INIT(1);
static int scull_s_open(struct inode *inode, struct file *filp) {
struct scull_dev *dev = &scull_s_device;
if (!atomic_dec_and_test(&scull_s_available)) {
atomic_inc(&scull_s_available);
Returns true, if the
return -EBUSY; /* already open */
tested value is 0
}
/* then, everything else is the same as before */
if ((filp->f_flags & O_ACCMODE) == O_WRONLY) scull_trim(dev);
filp->private_data = dev;
return 0; /* success */
}
Single-Open Devices

In the release call, marks the device idle
static int
scull_s_release(struct inode *inode, struct file *filp) {
atomic_inc(&scull_s_available); /* release the device */
return 0;
}
Restricting Access to a Single User (with
multiple processes) at a Time


Example: sculluid
Includes the following in the open call
spin_lock(&scull_u_lock);
if (scull_u_count && /* someone is using the device */
(scull_u_owner != current->uid) && /* not the same user */
(scull_u_owner != current->euid) && /* not the same effective
uid (for su) */
!capable(CAP_DAC_OVERRIDE)) { /* not root override */
spin_unlock(&scull_u_lock);
return -EBUSY; /* -EPERM would confuse the user */
}
if (scull_u_count == 0) scull_u_owner = current->uid;
scull_u_count++;
spin_unlock(&scull_u_lock);
Restricting Access to a Single User (with
Multiple Processes) at a Time

Includes the following in the release call
static int scull_u_release(struct inode *inode,
struct file *filp) {
spin_lock(&scull_u_lock);
scull_u_count--; /* nothing else */
spin_unlock(&scull_u_lock);
return 0;
}
Blocking open as an Alternative to
EBUSY (scullwuid)

A user might prefer to wait over getting errors

E.g., data communication channel
spin_lock(&scull_w_lock);
while (!scull_w_available()) {
spin_unlock(&scull_w_lock);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
if (wait_event_interruptible(scull_w_wait,
scull_w_available()))
return -ERESTARTSYS; /* tell the fs layer to handle it */
spin_lock(&scull_w_lock);
}
if (scull_w_count == 0) scull_w_owner = current->uid;
scull_w_count++;
spin_unlock(&scull_w_lock);
Blocking open as an Alternative to
EBUSY (scullwuid)

The release method wakes pending
processes
static int scull_w_release(struct inode *inode,
struct file *filp) {
int temp;
spin_lock(&scull_w_lock);
scull_w_count--;
temp = scull_w_count;
spin_unlock(&scull_w_lock);
if (temp == 0)
wake_up_interruptible_sync(&scull_w_wait);
return 0;
}
Blocking open as an Alternative to
EBUSY

Might not be the right semantics for
interactive users


Blocking on cp vs. getting a return value –EBUSY
or -EPERM
Incompatible policies for the same device

One solution: one device node per policy
Cloning the Device on open

Allows the creation of private, virtual devices


E.g., One virtual scull device for each process
with different tty device number
Example: scullpriv
Cloning the Device on open
static int scull_c_open(struct inode *inode, struct file *filp) {
struct scull_dev *dev;
dev_t key;
if (!current->signal->tty) {
PDEBUG("Process \"%s\" has no ctl tty\n", current->comm);
return -EINVAL;
}
key = tty_devnum(current->signal->tty);
spin_lock(&scull_c_lock);
dev = scull_c_lookfor_device(key);
spin_unlock(&scull_c_lock);
if (!dev) return -ENOMEM;
.../* then, everything else is the same as before */
}
Cloning the Device on open
/* The clone-specific data structure includes a key field */
struct scull_listitem {
struct scull_dev device;
dev_t key;
struct list_head list;
};
/* The list of devices, and a lock to protect it */
static LIST_HEAD(scull_c_list);
static spinlock_t scull_c_lock = SPIN_LOCK_UNLOCKED;
Cloning the Device on open
/* Look for a device or create one if missing */
static struct scull_dev *scull_c_lookfor_device(dev_t key) {
struct scull_listitem *lptr;
list_for_each_entry(lptr, &scull_c_list, list) {
if (lptr->key == key)
return &(lptr->device);
}
/* not found */
lptr = kmalloc(sizeof(struct scull_listitem), GFP_KERNEL);
if (!lptr) return NULL;
Cloning the Device on open
/* initialize the device */
memset(lptr, 0, sizeof(struct scull_listitem));
lptr->key = key;
scull_trim(&(lptr->device)); /* initialize it */
init_MUTEX(&(lptr->device.sem));
/* place it in the list */
list_add(&lptr->list, &scull_c_list);
return &(lptr->device);
}
What’s going on?
scull_listitem
scull_c_list
struct list_head {
struct list_head *next;
struct list_head *prev;
};
struct scull_dev device;
dev_t key;
struct list_head {
struct list_head *next;
struct list_head *prev;
} list;
Download