Problem 3

advertisement
Backups
• In Linux, what file systems should you back up and how
often?
– /home – changes the most often based on the frequency of
usage by the users
• back up as often as possible, every night or at least once per week
– /var – changes as software produces new/appends to files
• log files should be archived as they are rotated, other files might be
backed up every night (mail, web site content) or once a week
– /usr – changes occur as you upgrade software or install new
software
• may not need to be backed up at all if you don’t mind re-installing, or
you could back it up when you have made substantive changes like
installing several new software packages
– / - changes occur to just /etc and /root usually
• back up /root rarely but you might want to back up /etc occasionally like
once per month if you are making changes to config files
RAID
• Redundant array of independent disks
– or redundant array of inexpensive disks
• Used to ensure file system integrity
– by storing redundant information, any damage to some
portion of a file system can be restored by utilizing the
surviving portions
– you could potentially use RAID for backups but that is
not its main purpose nor its strength
• There are several RAID levels, but as you move
from one level to the next, it does not necessarily
improve
– the levels all use different storage schemes
RAID: Levels
Level
Description
Advantages/Disadvantages
Usage
0
Striping at block A: Improved disk performance over For
superior
disk
level,
no standard disk drive
performance
without
redundancy
D: No redundancy
redundancy;
where
increased cost is not a
concern
1
Complete mirror
A: Provides 100% redundancy and Safest form of RAID if
improves disk access for parallel reads cost is not a factor
D: Most costly form of RAID
2
Striping at bit
level, redundancy
through Hamming
codes
A:
Fast access for single disk Not used in practice
operation
because of Hamming
D:
Hamming Codes are time Codes
consuming to compute
RAID: Levels
Level
Description
Advantages/Disadvantages
3
Striping at byte A: Fast access for single disk operation,
level, parity bit compromise between expense and
redundancy
redundancy
D: All drives active for any single access
so cannot accommodate parallel accesses
4
Striping at block A: Larger stripes accommodate parallel
level, single parity accesses (like RAID 0) but improves over
disk
RAID 0 because of redundancy
D: Single parity disk is a bottleneck
defeating advantage gained by striping
5
Striping at block A: Same as 4
level,
parity D: None
distributed across
disks
Usage
Useful for single user
systems
Not used in practice
because the parity
disk is a bottleneck
Useful for multi-user
systems (e.g. file
servers)
RAID: Levels
Level
Description
6
Striping at block
level,
parity
distributed across
disks
and
duplicated
7
RAID 3 (or 4)
with
real-time
operating system
controller
10
RAID 0 & 1 with
striping at block
level
53
Extra disks to
support RAID 3
and
RAID
5
striping
Advantages/Disadvantages
Usage
A: Same as 4 & 5 except that with double Same as 5
the parity information, it provides a greater
degree of redundancy
D: More expensive than RAIDs 3-5
A: Faster single disk access over RAID 3 Same as 3
D: More expensive
A: Same as 0 and 1 combined
D: Twice as much disk drive as RAID 1
For file servers that
require both parallel
access & redundancy
A: Best overall access as one can access Useful for multi-user
the RAID 3 or the RAID 5 set of disks
systems (e.g. file
D: Requires more disks so is more servers)
expensive than 3 or 5
RAID: Parity Computation
• For RAID levels 3-6, parity information is used for
redundancy
– Parity is computed by XOR-ing one bit from each disk
drive
– Consider four drives storing the following bytes in the
same locations (e.g. same surface, sector, track)
•
•
•
•
00000000
11110000
10101010
00101111
– XOR-ing them bitwise (by column) requires XOR-ing the
first bit of byte 1 and byte 2, XOR-ing the first bit of byte
3 and byte 4 and then XOR-ing the results: (0 XOR 1)
XOR (1 XOR 0) = 1 XOR 1 = 0
– This yields the following parity byte: 01110101
RAID: Parity Computation
• Now, if any one of those drives has a damaged
block, we can restore it by XOR-ing the four
surviving blocks
– whether it is one of bytes 1-4 or the parity byte
– this works unless damage occurs to blocks on different
drives in the same locations
• Assume we have 5 disks, it will be unusual for the
same location to have bad blocks on different
drives
– we might have bad blocks but in different locations
Backup Strategies
• You want to back up some portion of your file
system
– How often?
– To where will it be backed up?
– Do you perform a whole backup or an incremental
backup?
– Assuming you want to do occasional backups, how
long do you keep older backups?
– Are you backups to ensure file system integrity or
to create archives or both?
Backup Strategies
• Assume you have 7 tapes and you will create a
full backup at the beginning of each week
followed by incremental backups during the week
• You will rotate tapes
• In order to retain the oldest information, you
might use a scheme like the following
Tape Number
1
2
3
4
5
6
7
2
3
5
6
1
Week
1
1
1
2
2
2
3
3
3
4
4
4
Usage
Full backup
Daily incremental backups for 3 days
Daily incremental backups for 3 days
Full backup
Daily incremental backups for 3 days
Daily incremental backups for 3 days
Full backup
Daily incremental backups for 3 days
Daily incremental backups for 3 days
Full backup
Daily incremental backups for 3 days
Daily incremental backups for 3 days
Restoration from Backups
• A file was deleted some time ago and we want to
restore it
– Use the most recent backup
– If not found, continue backward in time through the
backups until you locate it
– Once found, stop, you don’t want an earlier version
• the pattern of restoring the file is the same whether you have used
full backups or incremental backups except that if you know when
the file was deleted, you can go right to the previous full backup
saving time
• Now assume you want to restore an entire file system
– You must start from the last full backup
– Then, working your way forward, restore from each
incremental backup because new files were added and old
files were altered
Incremental vs Full Backups
• The advantages of incremental backups are
– performing an incremental backup is faster
(possibly a lot faster) than doing full backups
– performing an incremental backup should take a
lot less storage space on your backup medium than
a full backup
• The disadvantages of incremental backups are
– restoring the full file system is a lot more work
– unless you have indexed each incremental backup,
knowing where a file was stored requires searching
backward through time
Backups: Where?
• Where are you going to place your backup?
– Individual users wishing to backup only specific files
might use USB or optical disk storage
– Individual users wishing to backup their full file
system might purchase a second, external, hard disk
drive
• they are cheap enough
• very convenient and fast access
• you have many options, an external hard disk or possibly a
remotely accessible hard disk via cloud storage
– For an organization, this may not be as practical
• Use magnetic tape media which is far cheaper in terms of
cost per unit of storage (once you purchase the actual tape
device, the individual tapes are cheap)
Backups: How?
• There are several Linux programs available to
perform backups
– dump – good for full and incremental backups of
full file systems (cannot backup individual
directories or files)
– tar – originally developed to bundle files together
and save to tape (tar = tape archive)
• tar can be used on individual files, directories or full file
systems
• some incremental backup capabilities but not as useful
as dump
– cpio – no built-in incremental backup capabilities
but can be done in conjunction with find
Task Scheduling: rc.local
• One way to schedule tasks is to place them in the
/etc/rc.d/rc.local file
– This script is executed at the end of system initialization
after every reboot
– If you have tasks to perform at that time, place them here
– You might wish to
•
•
•
•
Rotate log files
Execute scripts to obtain statistical information
Datamine log files
Examine disk space for issues like bad permissions or users using
too much space
– You would probably not want to do backups via rc.local
but instead use one of the other scheduling tools like cron
Task Scheduling: anacron
• With at and cron you can schedule tasks for
different times
– But if the system were down during a scheduled
time, the task is not executed
– anacron is a scheduling program that reads
scheduled activities fomr a file (/etc/anacrontab)
and executes as soon as possible after the specified
time period elapses
• thus, if the system is down, the tasks are run once the
system is running
– anacron calls upon the program run-parts
Task Scheduling: at and batch
• These programs run tasks one time
– at runs the program(s) at the given time/date
– batch runs the program(s) when system load drops
below 80%
• Syntax:
– at [-f file] timespecifier
– batch [-f file]
• If you do not specify a file, you are placed in the
at> prompt to enter your commands (exit the
prompt with control+d)
• Otherwise, the file will be a script consisting of
the commands/programs to execute
Task Scheduling: at
• For at, the time specifier uses the format
– HH:MM [AM|PM] [day]
• AM|PM can be upper or lower case if included and if not
included, then HH is interpreted as military time (add 12
hours for pm such as 15 for 3 pm)
– If no day is given, then at schedules the task for the
next occurrence of the time, e.g., 3:45 am would
execute the next time it is 3:45 am
• The day, if given, will use one of these formats
–
–
–
–
MMDDYY
MM/DD/YY
DD.MM.YY
YYYY-MM-DD
Task Scheduling: at
• You can also specify the following for time and
day
–
–
–
–
–
noon
midnight
teatime (4 pm)
today
tomorrow
• Or, you can use now + time unit as in
– now + 5 minutes
– now + 2 days
– now + 1 week
• the time/day scheduled is relative to the time you submit it
– NOTE: you cannot specify in seconds, only minutes,
hours, days, weeks, months
Task Scheduling: at and batch
• To view scheduled tasks for at and batch, use atq
– this lists tasks including the task, the time and a task
number
• To delete a scheduled task by at or batch, use atrm
tasknumber as in atrm 5 to remove the 5th item
scheduled
• When a task is scheduled, the task also includes
the value of your current SHELL in order for the
task to run in that shell
• You can control which users can or cannot use
at/batch
– by placing usernames in the files /etc/at.allow to allow
those users and /etc/at.deny to deny those users
Task Scheduling: crontab
• The cron program is used to schedule recurring
tasks
– You specify the recurrence and the task in one line of a
file
– You submit one file of all tasks for one user (one file
per user)
– Thus, you will have to place each recurrence/task in a
single file
– The format of this file is
• 1 2 3 4 5 task
• where 1 2 3 4 5 specify the recurrence (see next slide)
– If this information is placed in a file foo, schedule all
tasks with
• crontab foo
Task Scheduling: crontab
• Recurrence is specified as
1. minute of the hour (0-59)
2. hour of the day (0-23, military time)
3. day of the month (1-31 but make sure that the day
matches the month if you use 29-31)
4. month of the year (1-12)
5. day of the week (0-7 where 0 and 7 are both Sunday,
1-6 are Monday-Saturday)
•
•
•
* can be used to indicate “every time”
You can list multiple items separating them by
commas such as 5,10,15,20
You can use */value to indicate “every value”
such as */5 for every 5 (minutes, hours, days)
Task Scheduling: crontab
• Let’s look at some examples
–
–
–
–
30 12 15 * * - 12:30 pm on the 15th of every month
30 12 * * 0 – 12:30 pm every Sunday
30 12 15 * 0 – 12:30 pm every Sunday the 15th
*/10 0,12 * * * - at 12:00 pm and 12:00 am for every
10 minuets (e.g., 12:00, 12:10, 12:20, …, 12:50) every
day of every month
• Here we add the tasks to see full entries
0
*/5
30
15
0
*
0
3
*
*
15
1
*
*
*
1
*
*
*
*
./backup /home
./intruder_alert
./usage_report >> disk_data.dat
./end_of_year_statistics
Notice each example invokes a script because we may not have room to place
all of the commands on one line that we want to execute
Task Scheduling: crontab
• Since a user can only submit one crontab job
– the user must either delete and resubmit the job if
the user wishes to change what is scheduled
– or edit what is scheduled
– crontab –l – lists the schedule for this user
– crontab –r – deletes what is scheduled for this user
– crontab –e – places the scheduled entries in a vi
editor to be edited
• As root, you can view other user’s crontab
listings
– add –u username to the crontab –l command
Task Scheduling: crontab
• Similar to at, there are files /etc/cron.allow and
/etc/cron.deny
• There are pre-established directories called
/etc/cron.hourly, /etc/cron.daily, /etc/cron.weekly
and /etc/cron.monthly
– anacron is responsible for executing any scripts in
these directories to execute every hour, day, week or
month
• When the crontab job is submitted, it saves
environment variables to be used when executing
the tasks
– HOME, LOGNAME, PATH, SHELL
System Monitoring: terminology
• One goal of an operating system is to ensure that
processes are making progress toward completion
– this is known as liveness
– The system administrator should inspect the system to
ensure this
– Starvation – when a process is not making progress
because resources that it needs are being held by other
processes
– Starvation can arise because we allow a process to
hold onto a resource until it is done with it and the OS
switches off between processes using multitasking
• whenever a process, say P1, reaches the CPU, another
process already has access to the resource that P1 needs, so
P1 doesn’t make progress
System Monitoring: example
• Process P0 and P1 both need to access file F0
– Assume F0 currently stores the value 0
• P0 will add 3 to it
• P1 will subtract 2 from it
• No matter what, when done, F0 should store 1 (0 + 3 – 2)
– Consider this situation:
1.
2.
3.
4.
5.
6.
7.
8.
P0 begins executing, reads the datum from the file and stores the
datum in a local variable, X
P0 adds 3 to X. X is now 3 (the file is still storing 0)
The CPU is interrupted and the operating system performs a context
switch to P1
P1 begins executing, reads the datum from the file and stores the
datum in a local variable, Y
P1 subtracts 2 from Y. Y is now -2 (the file is still storing 0)
P1 writes Y back to the file (the file now stores -2)
The CPU is interrupted and the operating system performs a context
switch to P0
P0 writes X (3) back to the file (the file now stores 3)
System Monitoring: example
• In step 8 occurs before the interruption in step 3,
the file will be ok
• If step 7 occurs before step 6 then step 6 may
occur after step 8 (resulting in the file storing -2)
• So there are three possible results
– File stores 1 (correct answer)
– File stores 3
– File stores -2
• We need to enforce mutually exclusive access to
the file to ensure that the proper sequence (1-8) is
performed so that the file always results in the
write answer (1)
System Monitoring: example
• Now consider this variation of the previous problem where we have two
files, F1 and F2
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
P0 begins executing
P0 requests access to F0
As no other process is using F0, the operating system grants P0 access
A context switch forces the CPU to switch to P1
P1 begins executing
P1 request access to F1
As no other process is using F1, the operating system grants P1 access
A context switch forces the CPU to switch to P0
P0 resumes executing
While holding onto F0, P1 requests access to F1
The operating system denies access to P0 as P1 is holding onto F1. P0
enters a waiting state and the CPU performs a context switch to P1
12. P1 resumes executing
13. While holding onto F1, P1 requests access to F0
14. The operating system denies access to P1 as P0 is holding onto F0, P1
enters a waiting state
System Monitoring: Deadlock
• The situation on the previous slide is deadlock
– Neither process can continue to make progress because
each holds a resource the other needs
– Because neither process can continue, neither will
release the resource it is holding onto
• Operating systems might deal with deadlock in
different ways
– Prevention – do not allow a process to start if it could
cause a deadlock
– Avoidance – do not grant a process to access a
resource if it could result in a deadlock
• these are both overly conservative
– Detection – every so often, look for a deadlock
• if one is found, kill the processes involved and restart them
later at different times
System Monitoring: Deadlock
• Most operating systems (including Linux) apply
the Ostrich algorithm
– Don’t even deal with deadlock, let it happen and let the
user or system administrator handle it
– A notable exception is that the kernel cannot be
involved in a deadlock because the kernel is always
granted resources in a specific order and only granted
to another process once the kernel is done with it
• So it might fall on the system administrator to
ensure that currently running processes are
making progress toward completion and not stuck
in a deadlock
System Monitoring: Fairness
• To promote liveness, we want to make sure
that processes are treated fairly
– They are not kept waiting too long
– No process should be allowed to monopolize a
resource
• One shared resource is all of memory
– We want to ensure that processes are given enough
pieces of memory (frames) so that they do not
succumb to poor performance from a lot of page
swapping
– Too much time spent swapping leads to a situation
called thrashing
System Monitoring: Thrashing Example
• Imagine a process with the code as
shown to the right
• The process’ code is stored in two
frames and its data in two frames
• Each iteration of the loop requires both
pages resulting in a page fault from the
first half to the second half and another
page fault to resume the first half
– The same is true of the data
• The code results in 4 * n page faults!
– Had this process been given 4 frames
instead of 2, it would only have 4 total
page faults (one fault for each of the first
time the page was referenced)
System Monitoring Tools
• We divide our look at system monitoring tools
into
– Monitoring processes and the processor
• We have already looked at the System Monitor GUI, top
and ps
– Monitoring memory
– Monitoring I/O
– Monitoring the network
• We have already seen netstat, ss, …
– Miscellaneous tools that can be used to help you
determine system performance
System Monitoring Tools: mpstat
• mpstat – processor performance
– intended for multiple processor systems but can
also report on the performance of a single
processor
– for each processor, indicates percentage of time on
user processes, system processes, hardware
interrupts, software interrupts guest activities, ideal
time, cycle stealing, wait time for input and output
• for user processes, broken into niced processes and nonniced processes
• cycle stealing is a situation where the CPU is forced to
wait because another device (usually the disk) has
higher priority to use memory or a bus
System Monitoring Tools: sar
• The sar program reports on archived statistics of the
processor utilization
• The default is that every 10 minute, processor
performance is recorded in /var/log/sa using files
named sa# or sar# (# being a number)
– mpstat provides an average, sar is reporting on 10 minute
intervals)
– each interval is described by number of CPUs, percentage
of CPU time for user and nice modes, system time, wait
time, cycle stealing and ideal time
• You can alter sar’s output through options that restrict
the intervals (start, stop time, duration)
– you can also alter what sar reports on providing statistics
on I/O transfer rate, memory usage, paging, network usage
and block device usage
System Monitoring Tools: pidstat
• The previous programs showed you CPU
utilization information
• pidstat provides you CPU information for each
running process
– That is, it breaks down for each process how much
time the CPU spends on the process itself versus
system and guest usage and total usage for the process
– You are also told which processor the process is
running on (if multiple processors)
– As with sar, you can alter pidstat’s output information
through options
• -d for I/O usage including reading/writing to hard disk
• -r for virtual memory and page fault information
• -w for task switching activity
System Monitoring Tools: others
• uptime – the amount of time the system has
been running since the last boot/reboot
– Also the current users logged in
– The average load over the past 1, 5 and 15 minutes
• strace – primarily used to debug system code
– Given a Linux program, what system calls does it
make?
• example: pwd: brk, mmap, access, open, fstat, mmap,
close, open, read, fstat, mmap, mprotect, mmap (four
times), arch_prctl, mprotect (twice), munmap, brk
(twice), open, fstat, mmap, close, getcwd, fstat, mmap,
write, close, munmap, close and exit_group
Memory System Monitoring: vmstat
• This program reports on the average amount of
memory utilization, disk swaps and other items
since the last boot/reboot
• The information looks like the following:
procs -------memory---------r b
swpd free buff
cache
1 2
64 112976 65360 413114
---swap--- -----io---- --system-- ----cpu----si so
bi
bo
in cs us sy id wa st
0
0
71
30
203 60
2 1 85 0 0
• Explanations for the abbreviations are given in
the table on the next slide
Header
r
b
swpd
free
buff
cache
si
so
bi
bo
in
cs
us
sy
id
wa
st
Meaning
Number of processes waiting for run time
Number of processes in uninterruptible sleep
Amount of virtual memory used (in KB)
Amount of free RAM (in KB)
Amount of RAM used as buffers (in KB)
Amount of RAM used to cache hard disk data (in KB)
Average KB/second of data swapped into memory from disk
Average KB/second of data swapped out of memory to disk
Average blocks/second of data swapped out from memory to disk
Average blocks/second of data swapped into memory from disk
Number of interrupts/second
Number of context switches/second
User time (non-kernel)
System (kernel) time
Idle time
Wait time
Cycle stealing time
Memory System Monitoring: free
• The free program reports on the current memory
utilization instead of an average over time like
vmstat
• Free reports on the amount of currently used
memory and the amount of free or available
memory
– The -/+ buffers/cached indicates the amount of
memory allocated for buffers of running applications,
or disk caches
total
Mem:
1020648
-/+ buffers/cached:
Swap:
524280
used
907160
370236
8
free
113488
650412
524272
shared
0
buffers
65436
cached
471488
I/O System Monitoring: Tools
•
•
•
•
•
•
iostat – reports on file system utilization
lpstat – reports on printer utilization
ip – network interface information (IP addresses, etc)
ss – socket usage statistics
netstat – port utilization
nstat, rtacct – TCP/IP information such as number of
TCP and UDP packets
• nmap – scans the ports that are available and the
services that are available (note: you would submit this
on another computer, not necessarily your own)
Summary
Name
df, du
free
iostat
lpstat
mpstat
netstat
nmap, nstat
pidstat
ps
rtacct
System
Monitor
sar
ss
stat
strace
top
uptime
vmstat
who
Processor
info
Process
info
Memory
info
*
*
VM info File system I/O Network
info
info info
*
*
*
*
*
*
Comments
*
*
Obsolete, replaced by ss
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Graphical, persistent
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Persistent
System uptime
*
Lists logged in users
Log Files
• Log files are repositories of messages as
generated by various software
– many messages are logged by syslogd or auditd
• They are collected primarily so the system
administrator can go back and inspect what might
have happened at some time in the past
– or look at how the given software is performing now
• Different log files store different types of
information but mostly they store the events that
are noteworthy
– as a system administrator, you will need to know how
to read the log files for useful information and which
files to examine and for what reasons
Log Files: What You Can Get
• The following are some of the types of
information you might obtain by analyzing
different log files
– Who is and has tried to log in
– What services were successfully started or stopped at
system initialization time
– Events related to hardware or software errors
– What yum updates and installations occurred
(successfully or unsuccessfully)
– What jobs successfully ran via crontab
– Who has accessed your Apache web server and which
pages have been requested (if you are running Apache)
Log Files: syslogd
• Recall from chapter 11 that we configured
what programs syslog should log and to where
– this information was placed in /etc/syslog.conf and
looked something like this:
*.info
mail.none
authpriv.none
cron.none
authpriv.*
mail.*
cron.*
uucp,news.crit
local7.*
/var/log/messages
/var/log/messages
/var/log/messages
/var/log/messages
/var/log/secure
/var/log/maillog
/var/log/cron
/var/log/spooler
/var/log/boot.log
Many sources
are logged to
/var/log/messages
Log Files: Log Entries
• Log files will tend to store the following
information
– date and time of event
– host name of the computer on which the event
arose
– name of the program that generated the log
message
– a (short) description of the event (less than one
line), if the process that generated the event is not
one of syslogd or the kernel, a PID is included
Log Files: messages Log
• As we saw, several different programs have their
messages logged to /var/log/messages
• Let’s take a look at some of that log file
– Nov 23 10:29:16 mycomputer sshd[1781]: Server
listening on 0.0.0.0 port 22.
– Nov 23 10:29:41 mycomputer pam: gdm-password[2041]:
pam_unix(gdm-password:session): session opened for
user foxr by (uid=0)
– Nov 23 10:32:18 mycomputer su: pam_unix(su:session):
session opened for user root by foxr(uid=500)
• the first message indicates an attempt to use ssh over port 22
• the second is an attempt through PAM to authenticate by foxr
(uid=0 means that root handled the login attempt)
• the third indicates an su session switching from root to foxr
Log Files: secure Log
• All authentication attempts are logged in to the
secure log file, let’s look at some excerpts
– Nov 23 11:37:20 mycomputer su:
pam_unix(su:session): session opened for user root by
foxr(uid=500)
– Nov 23 11:37:27 mycomputer unix_chkpwd[4993]:
password check failed for user (root)
– Nov 23 11:37:27 mycomputer su: pam_unix(su:auth):
authentication failure; logname=foxr uid=500 euid=0
tty=ptrs/1 ruser=foxr rhost= user=root
• we see a successful su attempt followed by two messages
related to a single event: a failed su attempt
Log Files: cron Log
• The cron log stores information generated by crond
when either running crontab jobs or anacron jobs
– Nov 20 11:01:01 mycomputer anacron[5013]: Anacron
started on 2012-11-20
– Nov 20 15:10:01 mycomputer CROND[5042]: (foxr)
CMD (./my_scheduled_script >> output.txt)
– Nov 20 16:43:01 mycomputer CROND[5311]: (foxr)
CMD (echo “did this work?”)
– Nov 20 16:44:01 mycomputer CROND[5314]: (foxr)
CMD (echo “did this work?”)
• notice the first entry is not of a scheduled job being executed but
of anacron being started demonstrating that logged entries are
not only of operations of software but the starting and stopping
of software
• the last two entries are of a crontab job that merely executed an
echo statement rather than the invocation of a shell script
Log Files: boot Log
• The boot.log file is very different in that it
reports on the success when performing
operations at system boot time including
mounting file systems and starting services
Starting udev:
Setting hostname mycomputer:
[ OK ]
Checking filesystems
/dev/sda1: clean, 93551/256000 files, 784149/1024000 blocks
/dev/sda5: recovering journal
/dev/sda5: clean, 472/40320 files, 26591/16180 blocks
[ OK
Remounting root file system in read-write mode:
[ OK
]
]
Log Files: boot Log
Mounting local filesystems:
Enabling local file system quotas:
Enabling /etc/fstab swaps:
[ OK
[ OK
[ OK
]
]
]
Iptables: Applying firewall rules:
Bringing up loopback interface:
Bringing up interface eth0:
Starting auditd:
Starting portreserve:
Starting system logger:
Starting irqbalance:
Starting crond:
Starting atd:
Starting certmonger:
[
[
[
[
[
[
[
[
[
[
]
]
]
]
]
]
]
]
]
]
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
Log Files: audit Log
• The audit daemon, auditd, logs a number of
different events based on preset rules and rules
that you can add via the audit rules file in
/etc/audit/audit.rules (refer back to chapter 11)
– The audit logs are stored in /var/log/audit
• As the audit log files grow rapidly, rather than
searching these files you will primarily use two
tools to search for you
– aureport – give basic statistics of types of events
logged
– ausearch – output all logged messages that match
given criteria
Log Files: aureport Options
Option
-au
-c
-e
-f
-i
-l
-m
-n
-p
-s
-u
-x
Meaning
Authentication attempts
Configuration changes
Events
File operations
Convert numeric (UID, GID, etc) entries into text
Login attempts
Modification of user accounts
Anomalous events
Process initiated events
System calls
User initiated events
Processes executed
Log Files: ausearch Options
Option
-a EID
-gi GID
-i
-k string
-m type
-p PID
-pp PID
-sc name
-ui UID
-x name
Meaning
All entries for event # EID
All entries of processes owned by group GID
Convert numeric (UID, GID, etc) entries into text
All entries that contain string
All entries whose message type is listed in type
All entries generated by process PID
All entries generated by process whose parent is PID
All entries generated by the system call name (name may either
be a string or number)
All entries generated by user UID
All entries generated by the executable program name
Log Files: aureport Output
Summary Report
======================
Range of time in logs: 03/19/2013 10:11:02.774 –
04/16/2013 10:21:15.081
Selected time for report: 03/19/2013 10:11:02 –
04/16/2013 10:21:15.081
Number of changes in configuration: 18
Number of changes to accounts, groups, or roles: 47
Number of logins: 20
Number of failed logins: 1
Number of authentications: 164
Number of failed authentications: 5
Number of users: 3
Number of terminals: 16
Number of host names: 5
Number of executables: 21
Number of files: 0
Number of AVC's: 0
Number of MAC events: 20
Number of failed syscalls: 0
Number of anomaly events: 3
Number of responses to anomaly events: 0
Number of crypto events: 68
Number of keys: 0
Number of process IDs: 4787
Number of events: 29145
With no options
Log Files: aureport Output
• Let’s look at aureport with options
– Below we see an excerpt of aureport –au to show
authentication events (events 2-4 below) and aureport
–e to show all events (events 10-13 below)
2. 03/19/2013 10:11:16 foxr ? :0 /usr/libexec/gdm-sessionworker yes 35602
3. 03/19/2013 10:15:59 foxr ? ? /usr/sbin/userhelper yes
35609
4. 03/19/2013 10:22:35 root ? pts/0 /bin/su yes 35617
10. 03/19/2013 10:11:16 35607 USER_START 500 yes
11. 03/19/2013 10:11:16 35608 USER_LOGIN 500 yes
12. 03/19/2013 10:15:59 35609 USER_AUTH 500 yes
13. 03/19/2013 10:15:59 35610 USER_ACCT 500 yes
Log Files: ausearch Output
• Refer back to event 35610 on the previous
slide, we can examine the full log entry using
ausearch
– ausearch –a 35610
time->Tue Mar 19 10:11:16 2013
type=USER_START msg=audit(1363702276.674:35607):
user pid=1948 uid=0 auid=500 ses=2
subj=system_u:system_r:xdm_t:s0-s0:c0.c1023
msg='op=PAM:session_open acct="foxr"
exe="/usr/libexec/gdm-session-worker" hostname=? addr=?
terminal=:0 res=success'
Log Files: Others
• Xorg – log files generated by X windows
– including modules loaded when X windows is started and failures of
X windows components. There are also messages from the audit
client for window opening and closing events
• maillog – messages generated by the mail system
– even if your computer is not running email, email messages are
automatically generated and sent to root
• yum.log – entries here indicate yum operations
• lastlog – last log in attempts for users, this information is not
stored in text so you cannot view it directly
• httpd/access_log – the default log file generated by Apache for
every received http request
• cups – messages logged from printer requests and printer errors
• utmp, wtmp, btmp – these log files show who is currently
logged into the system and failed log in attempts
Log Files: Rotation
• Log files can grow rapidly
• We do not want log files to get too big
– but we also do not want to delete them
• Instead, we rotate log files
– The older approach was to tack on a number to the
end of the log file’s name
• e.g., file becomes file.1, file.1 becomes file.2, file.2
becomes file.3, and we create an empty version for file
– Most log files now are time stamped with their
creation date
• file is the current file, the older files are file-20140829,
file-20140822, file-20140815, etc
Log Files: Rotation
• We will employ the logrotate program to rotate
our files
– For a given log file, we list in /etc/logrotate.conf
instructions
• How often to rotate it (daily, weekly, monthly, etc)
• How many log files to retain (if 4, then we will keep the
current file and three others, deleting the fourth oldest)
• Ownership and permission information for created log files
• Whether to automatically compress older files or not
– The following is the entry found for the log file wtmp
/var/log/wtmp {
monthly
create 0664 root utmp
rotate 1
}
Disaster Planning and Recovery
• Risk assessment: identifying the assets and
their vulnerabilities of an organization
• Threat analysis: identifying the threats based
on vulnerabilities
• An organization will undergo this type of
analysis to generate plans and processes to
protect themselves from these threats
• Including in this process is disaster planning
Disaster Planning and Recovery
• First step: prioritize your goals
– This will allow you to determine which threats are
the most important to safeguard against
•
•
•
•
Second step: catalog assets
Third step: identify vulnerabilities
Fourth step: identify threats
Fifth step: safeguard against threats
– When the threats are disasters, we then put their
safeguards and recovery from them in our disaster
plan
Disaster Planning and Recovery
• There are many types of assets for
organizations but as we are interested in Linux,
we will limit the assets to
– Hardware
– Software
– Data
• You might think that hardware is the most
critical asset but in fact it will be your data
– The data may require confidentiality and security
while your hardware does not
Disaster Planning and Recovery
• Solutions for data threats
– Encryption – data cannot be accessed without the proper
key
– Authentication – data is only accessed by those who have
adequate access rights
– Intrusion detection software – test that no one has broken
into your system
– RAID and backups – data is available as needed
– Distributed access – place data at different sites so that if
one site goes down, the data is still available
– Software solutions – protect against attacks (e.g., denial of
service)
– Education – for the employees to make sure they are not
careless with the data (e.g., leaving it on a flash drive) and
to protect against the disloyal employee
Disaster Planning and Recovery
• Solutions for hardware threats
– Protection from vandalism through cameras, human
monitors or guards
– Protection against theft through cameras, bolting
equipment down, taking inventories
– Protection against fire through fire alarms, sprinklers
or fire retardant chemicals
– Protection against flood, smoke, heat damage –
various mechanisms
– Protection from power surges through surge protectors
and uninterruptible power supplies
Disaster Planning and Recovery
• To write a disaster plan, you need to envision
the possible disasters and what you should do
about them
– Include contact information
• All personnel
• Form emergency response teams
• Emergency numbers for fire, police, etc
– Full inventory of IT infrastructure (hardware,
software including versions installed, servers,
network components), also all licenses
– A copy of the plan
Disaster Planning and Recovery
• Let’s consider an example: fire damages your
building, your organization has multiple sites
– Evacuate the building, contact the fire department
– Is the disaster real? If not, return to business,
cancel the fire department
– Otherwise, contact your team leaders, check to
ensure everyone is accounted for
– Alert other sites that they will have to take up the
slack in processing and data access
Disaster Planning and Recovery
• Example continued
– After the fire is out, assess the damage
– Get an estimate for how long your site will be
unavailable
• Shift processing and data to other sites
• Contact personnel to let them know what to do about
reporting the site (work from home, report to another site)
• Determine the damage to hardware, look at your recovery
plan to determine how you will replace the damaged
equipment
• Work with other sites if your site will be out of action for any
duration other than perhaps a day or two
• After the disaster is recovered from, analyze your
plan for flaws and fix it up!
Troubleshooting
Problem 1: System is running ineffectively.
Description: Simple tasks are taking too long to execute. Log in is taking more time
than expected. There is a delay between issuing a command and seeing its result.
Steps to determine the problem:
Use top, ps, the system monitor, and/or mpstat, sar and pidstat to view the running
processes.
Is CPU load heavy, approaching 100%?
Are there processes that are taking most of the CPU time or are there many
processes taking little time but combined cause a heavy load? Use vmstat
and free to examine main memory and swap space utilization
Is main memory full?
Is the system spending a lot of time swapping?
Are there too many processes in memory?
Use uptime to see how long the system has been running without a reboot;
while Linux seldom needs to be rebooted, a reboot may resolve the problem
Troubleshooting
• Short term solutions:
– Identify processes that can be halted and scheduled for later
– Identify processes whose priorities can be lowered through
renice, or those processes which could be moved to the
background
– Alternatively, can you contact the users and ask some of
them to discontinue their processes and/or log off?
– Also, reboot the computer if the above steps do not solve
the problem
• Long term solutions:
– purchase more main memory
– increase the size of the swap partition (possibly add a
second hard disk to contain more of the swap partition)
– purchase a more powerful processor (or additional
processors)
Troubleshooting
Problem 3: Inadequate hard disk space
Description: One or more of the file systems is filling up or has become full.
Users cannot save files. Or, swap space is commonly low on available space.
Steps to determine the problem:
Use df to view how full each file system is
Use find to search for inordinately large user files and core dumps
(if /home is low on space), or log and spool files (if /var is low)
If swap space is low, examine swap history using vmstat, sar, and pidstat
Troubleshooting
• Short term solutions:
– back up the file system which is running out of space
– delete overly large files and warn the user/owner of the files
(e.g., “I have removed several core dumps found in your
directory”)
– ask users to clean up their file space.
• Long term solutions:
– back up all file systems
– purchase additional hard disks and either segment users onto
different partitions (e.g., /home/1 and /home/2) or repartition the
file system so that the partition can be moved either to the new
hard disk or to be split across the hard disks
– implement disk quotas if necessary to prevent user spaces from
filling up in the future
– initiate mail quotas
– move large log files to an archive
Troubleshooting
Problem 4: Suspicious system behavior.
Description: Services or programs are not working as they should.
System might be too slow. Files might have disappeared.
Steps to determine the problem:
Examine your log files, particularly secure, lastlog and btmp, to look for
unusual patterns of logins
Look for running processes with peculiar ownership
Use ausearch to look at authentication events, particularly failed ones
Look for evidence of computer virus or Trojan horse
Troubleshooting
• Short term solutions:
– kill any suspicious processes (with apologies to any users
who own those processes)
– run antiviral software
– reboot the computer if needed
– examine your firewall to make sure it is running
• Long term solutions:
– implement a more secure authentication system and a more
secure firewall
– implement an intrusion detection system
– discuss account protection with your users
– require all users to change passwords at the next log in
– delete any suspicious user accounts
Troubleshooting
Problem 6: Network not responding.
Description: You are unable to reach other computers via your
web browser or other network tools.
Steps to determine the problem:
See if the network service is running
Use ip to check the status of your interface device(s), do you have
MAC addresses? IP addresses? Do you have a router or gateway
connection?
Check the physical connection to the network to see if there is something
wrong with the cable or port
Use ping and/or traceroute to see if you can reach your gateway, if
successful, use ping/traceroute to reach a computer on your
local area network and then a computer on the Internet
Test to see if you can reach computers using IP addresses but not IP
aliases (check your resolv.conf file)
See if other users are also unable to communicate via the network
Troubleshooting
• Short term solution:
–
–
–
–
–
restart the network service
check your network configuration files (e.g., ifcfg-eth0)
if you are using DHCP, make sure your DHCP server is responding
make sure your name servers are responding
if they are unavailable, you may still be able to reach the network using
IP addresses (instead of aliases), or place the mapping information in
your /etc/hosts file
– if none of this works, reboot your computer
– reboot the DHCP server.
•
Long term solutions:
– reconfigure the network itself by replacing the DHCP server and/or
network gateway
– test your network cables
– try an alternate network interface device (e.g., replace your Ethernet
card with a new one).
Troubleshooting
Problem 9: System does not initialize correctly
Description: Upon boot/reboot, the operating system does
not come up in a usable mode.
Steps to determine the problem:
Check dmesg for errors during system boot
If the system initializes to Linux, see what runlevel /etc/inittab is set to
Does /sbin/init exist?
Is the root file system being mounted?
Is vmlinuz available?
Is GRUB configured correctly?
Troubleshooting
• Short term solution
– if errors arose during boot (from dmesg), try to diagnose
the cause of those errors (bad device, bad kernel image)
and reboot. If the system came up in the wrong runlevel,
run telinit to change the runlevel and alter the inittab file’s
default statement to modify the default runlevel
– if there are errors arising during the init process, you might
need to repartition one or more of your file systems
– if the system does not come up at all, check the GRUB
command line (shortly after booting, press ‘c’ to interrupt
the process and drop to a command line prompt)
• Long term solution:
– reinstall the OS
Download