Backups • In Linux, what file systems should you back up and how often? – /home – changes the most often based on the frequency of usage by the users • back up as often as possible, every night or at least once per week – /var – changes as software produces new/appends to files • log files should be archived as they are rotated, other files might be backed up every night (mail, web site content) or once a week – /usr – changes occur as you upgrade software or install new software • may not need to be backed up at all if you don’t mind re-installing, or you could back it up when you have made substantive changes like installing several new software packages – / - changes occur to just /etc and /root usually • back up /root rarely but you might want to back up /etc occasionally like once per month if you are making changes to config files RAID • Redundant array of independent disks – or redundant array of inexpensive disks • Used to ensure file system integrity – by storing redundant information, any damage to some portion of a file system can be restored by utilizing the surviving portions – you could potentially use RAID for backups but that is not its main purpose nor its strength • There are several RAID levels, but as you move from one level to the next, it does not necessarily improve – the levels all use different storage schemes RAID: Levels Level Description Advantages/Disadvantages Usage 0 Striping at block A: Improved disk performance over For superior disk level, no standard disk drive performance without redundancy D: No redundancy redundancy; where increased cost is not a concern 1 Complete mirror A: Provides 100% redundancy and Safest form of RAID if improves disk access for parallel reads cost is not a factor D: Most costly form of RAID 2 Striping at bit level, redundancy through Hamming codes A: Fast access for single disk Not used in practice operation because of Hamming D: Hamming Codes are time Codes consuming to compute RAID: Levels Level Description Advantages/Disadvantages 3 Striping at byte A: Fast access for single disk operation, level, parity bit compromise between expense and redundancy redundancy D: All drives active for any single access so cannot accommodate parallel accesses 4 Striping at block A: Larger stripes accommodate parallel level, single parity accesses (like RAID 0) but improves over disk RAID 0 because of redundancy D: Single parity disk is a bottleneck defeating advantage gained by striping 5 Striping at block A: Same as 4 level, parity D: None distributed across disks Usage Useful for single user systems Not used in practice because the parity disk is a bottleneck Useful for multi-user systems (e.g. file servers) RAID: Levels Level Description 6 Striping at block level, parity distributed across disks and duplicated 7 RAID 3 (or 4) with real-time operating system controller 10 RAID 0 & 1 with striping at block level 53 Extra disks to support RAID 3 and RAID 5 striping Advantages/Disadvantages Usage A: Same as 4 & 5 except that with double Same as 5 the parity information, it provides a greater degree of redundancy D: More expensive than RAIDs 3-5 A: Faster single disk access over RAID 3 Same as 3 D: More expensive A: Same as 0 and 1 combined D: Twice as much disk drive as RAID 1 For file servers that require both parallel access & redundancy A: Best overall access as one can access Useful for multi-user the RAID 3 or the RAID 5 set of disks systems (e.g. file D: Requires more disks so is more servers) expensive than 3 or 5 RAID: Parity Computation • For RAID levels 3-6, parity information is used for redundancy – Parity is computed by XOR-ing one bit from each disk drive – Consider four drives storing the following bytes in the same locations (e.g. same surface, sector, track) • • • • 00000000 11110000 10101010 00101111 – XOR-ing them bitwise (by column) requires XOR-ing the first bit of byte 1 and byte 2, XOR-ing the first bit of byte 3 and byte 4 and then XOR-ing the results: (0 XOR 1) XOR (1 XOR 0) = 1 XOR 1 = 0 – This yields the following parity byte: 01110101 RAID: Parity Computation • Now, if any one of those drives has a damaged block, we can restore it by XOR-ing the four surviving blocks – whether it is one of bytes 1-4 or the parity byte – this works unless damage occurs to blocks on different drives in the same locations • Assume we have 5 disks, it will be unusual for the same location to have bad blocks on different drives – we might have bad blocks but in different locations Backup Strategies • You want to back up some portion of your file system – How often? – To where will it be backed up? – Do you perform a whole backup or an incremental backup? – Assuming you want to do occasional backups, how long do you keep older backups? – Are you backups to ensure file system integrity or to create archives or both? Backup Strategies • Assume you have 7 tapes and you will create a full backup at the beginning of each week followed by incremental backups during the week • You will rotate tapes • In order to retain the oldest information, you might use a scheme like the following Tape Number 1 2 3 4 5 6 7 2 3 5 6 1 Week 1 1 1 2 2 2 3 3 3 4 4 4 Usage Full backup Daily incremental backups for 3 days Daily incremental backups for 3 days Full backup Daily incremental backups for 3 days Daily incremental backups for 3 days Full backup Daily incremental backups for 3 days Daily incremental backups for 3 days Full backup Daily incremental backups for 3 days Daily incremental backups for 3 days Restoration from Backups • A file was deleted some time ago and we want to restore it – Use the most recent backup – If not found, continue backward in time through the backups until you locate it – Once found, stop, you don’t want an earlier version • the pattern of restoring the file is the same whether you have used full backups or incremental backups except that if you know when the file was deleted, you can go right to the previous full backup saving time • Now assume you want to restore an entire file system – You must start from the last full backup – Then, working your way forward, restore from each incremental backup because new files were added and old files were altered Incremental vs Full Backups • The advantages of incremental backups are – performing an incremental backup is faster (possibly a lot faster) than doing full backups – performing an incremental backup should take a lot less storage space on your backup medium than a full backup • The disadvantages of incremental backups are – restoring the full file system is a lot more work – unless you have indexed each incremental backup, knowing where a file was stored requires searching backward through time Backups: Where? • Where are you going to place your backup? – Individual users wishing to backup only specific files might use USB or optical disk storage – Individual users wishing to backup their full file system might purchase a second, external, hard disk drive • they are cheap enough • very convenient and fast access • you have many options, an external hard disk or possibly a remotely accessible hard disk via cloud storage – For an organization, this may not be as practical • Use magnetic tape media which is far cheaper in terms of cost per unit of storage (once you purchase the actual tape device, the individual tapes are cheap) Backups: How? • There are several Linux programs available to perform backups – dump – good for full and incremental backups of full file systems (cannot backup individual directories or files) – tar – originally developed to bundle files together and save to tape (tar = tape archive) • tar can be used on individual files, directories or full file systems • some incremental backup capabilities but not as useful as dump – cpio – no built-in incremental backup capabilities but can be done in conjunction with find Task Scheduling: rc.local • One way to schedule tasks is to place them in the /etc/rc.d/rc.local file – This script is executed at the end of system initialization after every reboot – If you have tasks to perform at that time, place them here – You might wish to • • • • Rotate log files Execute scripts to obtain statistical information Datamine log files Examine disk space for issues like bad permissions or users using too much space – You would probably not want to do backups via rc.local but instead use one of the other scheduling tools like cron Task Scheduling: anacron • With at and cron you can schedule tasks for different times – But if the system were down during a scheduled time, the task is not executed – anacron is a scheduling program that reads scheduled activities fomr a file (/etc/anacrontab) and executes as soon as possible after the specified time period elapses • thus, if the system is down, the tasks are run once the system is running – anacron calls upon the program run-parts Task Scheduling: at and batch • These programs run tasks one time – at runs the program(s) at the given time/date – batch runs the program(s) when system load drops below 80% • Syntax: – at [-f file] timespecifier – batch [-f file] • If you do not specify a file, you are placed in the at> prompt to enter your commands (exit the prompt with control+d) • Otherwise, the file will be a script consisting of the commands/programs to execute Task Scheduling: at • For at, the time specifier uses the format – HH:MM [AM|PM] [day] • AM|PM can be upper or lower case if included and if not included, then HH is interpreted as military time (add 12 hours for pm such as 15 for 3 pm) – If no day is given, then at schedules the task for the next occurrence of the time, e.g., 3:45 am would execute the next time it is 3:45 am • The day, if given, will use one of these formats – – – – MMDDYY MM/DD/YY DD.MM.YY YYYY-MM-DD Task Scheduling: at • You can also specify the following for time and day – – – – – noon midnight teatime (4 pm) today tomorrow • Or, you can use now + time unit as in – now + 5 minutes – now + 2 days – now + 1 week • the time/day scheduled is relative to the time you submit it – NOTE: you cannot specify in seconds, only minutes, hours, days, weeks, months Task Scheduling: at and batch • To view scheduled tasks for at and batch, use atq – this lists tasks including the task, the time and a task number • To delete a scheduled task by at or batch, use atrm tasknumber as in atrm 5 to remove the 5th item scheduled • When a task is scheduled, the task also includes the value of your current SHELL in order for the task to run in that shell • You can control which users can or cannot use at/batch – by placing usernames in the files /etc/at.allow to allow those users and /etc/at.deny to deny those users Task Scheduling: crontab • The cron program is used to schedule recurring tasks – You specify the recurrence and the task in one line of a file – You submit one file of all tasks for one user (one file per user) – Thus, you will have to place each recurrence/task in a single file – The format of this file is • 1 2 3 4 5 task • where 1 2 3 4 5 specify the recurrence (see next slide) – If this information is placed in a file foo, schedule all tasks with • crontab foo Task Scheduling: crontab • Recurrence is specified as 1. minute of the hour (0-59) 2. hour of the day (0-23, military time) 3. day of the month (1-31 but make sure that the day matches the month if you use 29-31) 4. month of the year (1-12) 5. day of the week (0-7 where 0 and 7 are both Sunday, 1-6 are Monday-Saturday) • • • * can be used to indicate “every time” You can list multiple items separating them by commas such as 5,10,15,20 You can use */value to indicate “every value” such as */5 for every 5 (minutes, hours, days) Task Scheduling: crontab • Let’s look at some examples – – – – 30 12 15 * * - 12:30 pm on the 15th of every month 30 12 * * 0 – 12:30 pm every Sunday 30 12 15 * 0 – 12:30 pm every Sunday the 15th */10 0,12 * * * - at 12:00 pm and 12:00 am for every 10 minuets (e.g., 12:00, 12:10, 12:20, …, 12:50) every day of every month • Here we add the tasks to see full entries 0 */5 30 15 0 * 0 3 * * 15 1 * * * 1 * * * * ./backup /home ./intruder_alert ./usage_report >> disk_data.dat ./end_of_year_statistics Notice each example invokes a script because we may not have room to place all of the commands on one line that we want to execute Task Scheduling: crontab • Since a user can only submit one crontab job – the user must either delete and resubmit the job if the user wishes to change what is scheduled – or edit what is scheduled – crontab –l – lists the schedule for this user – crontab –r – deletes what is scheduled for this user – crontab –e – places the scheduled entries in a vi editor to be edited • As root, you can view other user’s crontab listings – add –u username to the crontab –l command Task Scheduling: crontab • Similar to at, there are files /etc/cron.allow and /etc/cron.deny • There are pre-established directories called /etc/cron.hourly, /etc/cron.daily, /etc/cron.weekly and /etc/cron.monthly – anacron is responsible for executing any scripts in these directories to execute every hour, day, week or month • When the crontab job is submitted, it saves environment variables to be used when executing the tasks – HOME, LOGNAME, PATH, SHELL System Monitoring: terminology • One goal of an operating system is to ensure that processes are making progress toward completion – this is known as liveness – The system administrator should inspect the system to ensure this – Starvation – when a process is not making progress because resources that it needs are being held by other processes – Starvation can arise because we allow a process to hold onto a resource until it is done with it and the OS switches off between processes using multitasking • whenever a process, say P1, reaches the CPU, another process already has access to the resource that P1 needs, so P1 doesn’t make progress System Monitoring: example • Process P0 and P1 both need to access file F0 – Assume F0 currently stores the value 0 • P0 will add 3 to it • P1 will subtract 2 from it • No matter what, when done, F0 should store 1 (0 + 3 – 2) – Consider this situation: 1. 2. 3. 4. 5. 6. 7. 8. P0 begins executing, reads the datum from the file and stores the datum in a local variable, X P0 adds 3 to X. X is now 3 (the file is still storing 0) The CPU is interrupted and the operating system performs a context switch to P1 P1 begins executing, reads the datum from the file and stores the datum in a local variable, Y P1 subtracts 2 from Y. Y is now -2 (the file is still storing 0) P1 writes Y back to the file (the file now stores -2) The CPU is interrupted and the operating system performs a context switch to P0 P0 writes X (3) back to the file (the file now stores 3) System Monitoring: example • In step 8 occurs before the interruption in step 3, the file will be ok • If step 7 occurs before step 6 then step 6 may occur after step 8 (resulting in the file storing -2) • So there are three possible results – File stores 1 (correct answer) – File stores 3 – File stores -2 • We need to enforce mutually exclusive access to the file to ensure that the proper sequence (1-8) is performed so that the file always results in the write answer (1) System Monitoring: example • Now consider this variation of the previous problem where we have two files, F1 and F2 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. P0 begins executing P0 requests access to F0 As no other process is using F0, the operating system grants P0 access A context switch forces the CPU to switch to P1 P1 begins executing P1 request access to F1 As no other process is using F1, the operating system grants P1 access A context switch forces the CPU to switch to P0 P0 resumes executing While holding onto F0, P1 requests access to F1 The operating system denies access to P0 as P1 is holding onto F1. P0 enters a waiting state and the CPU performs a context switch to P1 12. P1 resumes executing 13. While holding onto F1, P1 requests access to F0 14. The operating system denies access to P1 as P0 is holding onto F0, P1 enters a waiting state System Monitoring: Deadlock • The situation on the previous slide is deadlock – Neither process can continue to make progress because each holds a resource the other needs – Because neither process can continue, neither will release the resource it is holding onto • Operating systems might deal with deadlock in different ways – Prevention – do not allow a process to start if it could cause a deadlock – Avoidance – do not grant a process to access a resource if it could result in a deadlock • these are both overly conservative – Detection – every so often, look for a deadlock • if one is found, kill the processes involved and restart them later at different times System Monitoring: Deadlock • Most operating systems (including Linux) apply the Ostrich algorithm – Don’t even deal with deadlock, let it happen and let the user or system administrator handle it – A notable exception is that the kernel cannot be involved in a deadlock because the kernel is always granted resources in a specific order and only granted to another process once the kernel is done with it • So it might fall on the system administrator to ensure that currently running processes are making progress toward completion and not stuck in a deadlock System Monitoring: Fairness • To promote liveness, we want to make sure that processes are treated fairly – They are not kept waiting too long – No process should be allowed to monopolize a resource • One shared resource is all of memory – We want to ensure that processes are given enough pieces of memory (frames) so that they do not succumb to poor performance from a lot of page swapping – Too much time spent swapping leads to a situation called thrashing System Monitoring: Thrashing Example • Imagine a process with the code as shown to the right • The process’ code is stored in two frames and its data in two frames • Each iteration of the loop requires both pages resulting in a page fault from the first half to the second half and another page fault to resume the first half – The same is true of the data • The code results in 4 * n page faults! – Had this process been given 4 frames instead of 2, it would only have 4 total page faults (one fault for each of the first time the page was referenced) System Monitoring Tools • We divide our look at system monitoring tools into – Monitoring processes and the processor • We have already looked at the System Monitor GUI, top and ps – Monitoring memory – Monitoring I/O – Monitoring the network • We have already seen netstat, ss, … – Miscellaneous tools that can be used to help you determine system performance System Monitoring Tools: mpstat • mpstat – processor performance – intended for multiple processor systems but can also report on the performance of a single processor – for each processor, indicates percentage of time on user processes, system processes, hardware interrupts, software interrupts guest activities, ideal time, cycle stealing, wait time for input and output • for user processes, broken into niced processes and nonniced processes • cycle stealing is a situation where the CPU is forced to wait because another device (usually the disk) has higher priority to use memory or a bus System Monitoring Tools: sar • The sar program reports on archived statistics of the processor utilization • The default is that every 10 minute, processor performance is recorded in /var/log/sa using files named sa# or sar# (# being a number) – mpstat provides an average, sar is reporting on 10 minute intervals) – each interval is described by number of CPUs, percentage of CPU time for user and nice modes, system time, wait time, cycle stealing and ideal time • You can alter sar’s output through options that restrict the intervals (start, stop time, duration) – you can also alter what sar reports on providing statistics on I/O transfer rate, memory usage, paging, network usage and block device usage System Monitoring Tools: pidstat • The previous programs showed you CPU utilization information • pidstat provides you CPU information for each running process – That is, it breaks down for each process how much time the CPU spends on the process itself versus system and guest usage and total usage for the process – You are also told which processor the process is running on (if multiple processors) – As with sar, you can alter pidstat’s output information through options • -d for I/O usage including reading/writing to hard disk • -r for virtual memory and page fault information • -w for task switching activity System Monitoring Tools: others • uptime – the amount of time the system has been running since the last boot/reboot – Also the current users logged in – The average load over the past 1, 5 and 15 minutes • strace – primarily used to debug system code – Given a Linux program, what system calls does it make? • example: pwd: brk, mmap, access, open, fstat, mmap, close, open, read, fstat, mmap, mprotect, mmap (four times), arch_prctl, mprotect (twice), munmap, brk (twice), open, fstat, mmap, close, getcwd, fstat, mmap, write, close, munmap, close and exit_group Memory System Monitoring: vmstat • This program reports on the average amount of memory utilization, disk swaps and other items since the last boot/reboot • The information looks like the following: procs -------memory---------r b swpd free buff cache 1 2 64 112976 65360 413114 ---swap--- -----io---- --system-- ----cpu----si so bi bo in cs us sy id wa st 0 0 71 30 203 60 2 1 85 0 0 • Explanations for the abbreviations are given in the table on the next slide Header r b swpd free buff cache si so bi bo in cs us sy id wa st Meaning Number of processes waiting for run time Number of processes in uninterruptible sleep Amount of virtual memory used (in KB) Amount of free RAM (in KB) Amount of RAM used as buffers (in KB) Amount of RAM used to cache hard disk data (in KB) Average KB/second of data swapped into memory from disk Average KB/second of data swapped out of memory to disk Average blocks/second of data swapped out from memory to disk Average blocks/second of data swapped into memory from disk Number of interrupts/second Number of context switches/second User time (non-kernel) System (kernel) time Idle time Wait time Cycle stealing time Memory System Monitoring: free • The free program reports on the current memory utilization instead of an average over time like vmstat • Free reports on the amount of currently used memory and the amount of free or available memory – The -/+ buffers/cached indicates the amount of memory allocated for buffers of running applications, or disk caches total Mem: 1020648 -/+ buffers/cached: Swap: 524280 used 907160 370236 8 free 113488 650412 524272 shared 0 buffers 65436 cached 471488 I/O System Monitoring: Tools • • • • • • iostat – reports on file system utilization lpstat – reports on printer utilization ip – network interface information (IP addresses, etc) ss – socket usage statistics netstat – port utilization nstat, rtacct – TCP/IP information such as number of TCP and UDP packets • nmap – scans the ports that are available and the services that are available (note: you would submit this on another computer, not necessarily your own) Summary Name df, du free iostat lpstat mpstat netstat nmap, nstat pidstat ps rtacct System Monitor sar ss stat strace top uptime vmstat who Processor info Process info Memory info * * VM info File system I/O Network info info info * * * * * * Comments * * Obsolete, replaced by ss * * * * * * * * * * * * * * * * * * * * * * * Graphical, persistent * * * * * * * * * * * * * * Persistent System uptime * Lists logged in users Log Files • Log files are repositories of messages as generated by various software – many messages are logged by syslogd or auditd • They are collected primarily so the system administrator can go back and inspect what might have happened at some time in the past – or look at how the given software is performing now • Different log files store different types of information but mostly they store the events that are noteworthy – as a system administrator, you will need to know how to read the log files for useful information and which files to examine and for what reasons Log Files: What You Can Get • The following are some of the types of information you might obtain by analyzing different log files – Who is and has tried to log in – What services were successfully started or stopped at system initialization time – Events related to hardware or software errors – What yum updates and installations occurred (successfully or unsuccessfully) – What jobs successfully ran via crontab – Who has accessed your Apache web server and which pages have been requested (if you are running Apache) Log Files: syslogd • Recall from chapter 11 that we configured what programs syslog should log and to where – this information was placed in /etc/syslog.conf and looked something like this: *.info mail.none authpriv.none cron.none authpriv.* mail.* cron.* uucp,news.crit local7.* /var/log/messages /var/log/messages /var/log/messages /var/log/messages /var/log/secure /var/log/maillog /var/log/cron /var/log/spooler /var/log/boot.log Many sources are logged to /var/log/messages Log Files: Log Entries • Log files will tend to store the following information – date and time of event – host name of the computer on which the event arose – name of the program that generated the log message – a (short) description of the event (less than one line), if the process that generated the event is not one of syslogd or the kernel, a PID is included Log Files: messages Log • As we saw, several different programs have their messages logged to /var/log/messages • Let’s take a look at some of that log file – Nov 23 10:29:16 mycomputer sshd[1781]: Server listening on 0.0.0.0 port 22. – Nov 23 10:29:41 mycomputer pam: gdm-password[2041]: pam_unix(gdm-password:session): session opened for user foxr by (uid=0) – Nov 23 10:32:18 mycomputer su: pam_unix(su:session): session opened for user root by foxr(uid=500) • the first message indicates an attempt to use ssh over port 22 • the second is an attempt through PAM to authenticate by foxr (uid=0 means that root handled the login attempt) • the third indicates an su session switching from root to foxr Log Files: secure Log • All authentication attempts are logged in to the secure log file, let’s look at some excerpts – Nov 23 11:37:20 mycomputer su: pam_unix(su:session): session opened for user root by foxr(uid=500) – Nov 23 11:37:27 mycomputer unix_chkpwd[4993]: password check failed for user (root) – Nov 23 11:37:27 mycomputer su: pam_unix(su:auth): authentication failure; logname=foxr uid=500 euid=0 tty=ptrs/1 ruser=foxr rhost= user=root • we see a successful su attempt followed by two messages related to a single event: a failed su attempt Log Files: cron Log • The cron log stores information generated by crond when either running crontab jobs or anacron jobs – Nov 20 11:01:01 mycomputer anacron[5013]: Anacron started on 2012-11-20 – Nov 20 15:10:01 mycomputer CROND[5042]: (foxr) CMD (./my_scheduled_script >> output.txt) – Nov 20 16:43:01 mycomputer CROND[5311]: (foxr) CMD (echo “did this work?”) – Nov 20 16:44:01 mycomputer CROND[5314]: (foxr) CMD (echo “did this work?”) • notice the first entry is not of a scheduled job being executed but of anacron being started demonstrating that logged entries are not only of operations of software but the starting and stopping of software • the last two entries are of a crontab job that merely executed an echo statement rather than the invocation of a shell script Log Files: boot Log • The boot.log file is very different in that it reports on the success when performing operations at system boot time including mounting file systems and starting services Starting udev: Setting hostname mycomputer: [ OK ] Checking filesystems /dev/sda1: clean, 93551/256000 files, 784149/1024000 blocks /dev/sda5: recovering journal /dev/sda5: clean, 472/40320 files, 26591/16180 blocks [ OK Remounting root file system in read-write mode: [ OK ] ] Log Files: boot Log Mounting local filesystems: Enabling local file system quotas: Enabling /etc/fstab swaps: [ OK [ OK [ OK ] ] ] Iptables: Applying firewall rules: Bringing up loopback interface: Bringing up interface eth0: Starting auditd: Starting portreserve: Starting system logger: Starting irqbalance: Starting crond: Starting atd: Starting certmonger: [ [ [ [ [ [ [ [ [ [ ] ] ] ] ] ] ] ] ] ] OK OK OK OK OK OK OK OK OK OK Log Files: audit Log • The audit daemon, auditd, logs a number of different events based on preset rules and rules that you can add via the audit rules file in /etc/audit/audit.rules (refer back to chapter 11) – The audit logs are stored in /var/log/audit • As the audit log files grow rapidly, rather than searching these files you will primarily use two tools to search for you – aureport – give basic statistics of types of events logged – ausearch – output all logged messages that match given criteria Log Files: aureport Options Option -au -c -e -f -i -l -m -n -p -s -u -x Meaning Authentication attempts Configuration changes Events File operations Convert numeric (UID, GID, etc) entries into text Login attempts Modification of user accounts Anomalous events Process initiated events System calls User initiated events Processes executed Log Files: ausearch Options Option -a EID -gi GID -i -k string -m type -p PID -pp PID -sc name -ui UID -x name Meaning All entries for event # EID All entries of processes owned by group GID Convert numeric (UID, GID, etc) entries into text All entries that contain string All entries whose message type is listed in type All entries generated by process PID All entries generated by process whose parent is PID All entries generated by the system call name (name may either be a string or number) All entries generated by user UID All entries generated by the executable program name Log Files: aureport Output Summary Report ====================== Range of time in logs: 03/19/2013 10:11:02.774 – 04/16/2013 10:21:15.081 Selected time for report: 03/19/2013 10:11:02 – 04/16/2013 10:21:15.081 Number of changes in configuration: 18 Number of changes to accounts, groups, or roles: 47 Number of logins: 20 Number of failed logins: 1 Number of authentications: 164 Number of failed authentications: 5 Number of users: 3 Number of terminals: 16 Number of host names: 5 Number of executables: 21 Number of files: 0 Number of AVC's: 0 Number of MAC events: 20 Number of failed syscalls: 0 Number of anomaly events: 3 Number of responses to anomaly events: 0 Number of crypto events: 68 Number of keys: 0 Number of process IDs: 4787 Number of events: 29145 With no options Log Files: aureport Output • Let’s look at aureport with options – Below we see an excerpt of aureport –au to show authentication events (events 2-4 below) and aureport –e to show all events (events 10-13 below) 2. 03/19/2013 10:11:16 foxr ? :0 /usr/libexec/gdm-sessionworker yes 35602 3. 03/19/2013 10:15:59 foxr ? ? /usr/sbin/userhelper yes 35609 4. 03/19/2013 10:22:35 root ? pts/0 /bin/su yes 35617 10. 03/19/2013 10:11:16 35607 USER_START 500 yes 11. 03/19/2013 10:11:16 35608 USER_LOGIN 500 yes 12. 03/19/2013 10:15:59 35609 USER_AUTH 500 yes 13. 03/19/2013 10:15:59 35610 USER_ACCT 500 yes Log Files: ausearch Output • Refer back to event 35610 on the previous slide, we can examine the full log entry using ausearch – ausearch –a 35610 time->Tue Mar 19 10:11:16 2013 type=USER_START msg=audit(1363702276.674:35607): user pid=1948 uid=0 auid=500 ses=2 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='op=PAM:session_open acct="foxr" exe="/usr/libexec/gdm-session-worker" hostname=? addr=? terminal=:0 res=success' Log Files: Others • Xorg – log files generated by X windows – including modules loaded when X windows is started and failures of X windows components. There are also messages from the audit client for window opening and closing events • maillog – messages generated by the mail system – even if your computer is not running email, email messages are automatically generated and sent to root • yum.log – entries here indicate yum operations • lastlog – last log in attempts for users, this information is not stored in text so you cannot view it directly • httpd/access_log – the default log file generated by Apache for every received http request • cups – messages logged from printer requests and printer errors • utmp, wtmp, btmp – these log files show who is currently logged into the system and failed log in attempts Log Files: Rotation • Log files can grow rapidly • We do not want log files to get too big – but we also do not want to delete them • Instead, we rotate log files – The older approach was to tack on a number to the end of the log file’s name • e.g., file becomes file.1, file.1 becomes file.2, file.2 becomes file.3, and we create an empty version for file – Most log files now are time stamped with their creation date • file is the current file, the older files are file-20140829, file-20140822, file-20140815, etc Log Files: Rotation • We will employ the logrotate program to rotate our files – For a given log file, we list in /etc/logrotate.conf instructions • How often to rotate it (daily, weekly, monthly, etc) • How many log files to retain (if 4, then we will keep the current file and three others, deleting the fourth oldest) • Ownership and permission information for created log files • Whether to automatically compress older files or not – The following is the entry found for the log file wtmp /var/log/wtmp { monthly create 0664 root utmp rotate 1 } Disaster Planning and Recovery • Risk assessment: identifying the assets and their vulnerabilities of an organization • Threat analysis: identifying the threats based on vulnerabilities • An organization will undergo this type of analysis to generate plans and processes to protect themselves from these threats • Including in this process is disaster planning Disaster Planning and Recovery • First step: prioritize your goals – This will allow you to determine which threats are the most important to safeguard against • • • • Second step: catalog assets Third step: identify vulnerabilities Fourth step: identify threats Fifth step: safeguard against threats – When the threats are disasters, we then put their safeguards and recovery from them in our disaster plan Disaster Planning and Recovery • There are many types of assets for organizations but as we are interested in Linux, we will limit the assets to – Hardware – Software – Data • You might think that hardware is the most critical asset but in fact it will be your data – The data may require confidentiality and security while your hardware does not Disaster Planning and Recovery • Solutions for data threats – Encryption – data cannot be accessed without the proper key – Authentication – data is only accessed by those who have adequate access rights – Intrusion detection software – test that no one has broken into your system – RAID and backups – data is available as needed – Distributed access – place data at different sites so that if one site goes down, the data is still available – Software solutions – protect against attacks (e.g., denial of service) – Education – for the employees to make sure they are not careless with the data (e.g., leaving it on a flash drive) and to protect against the disloyal employee Disaster Planning and Recovery • Solutions for hardware threats – Protection from vandalism through cameras, human monitors or guards – Protection against theft through cameras, bolting equipment down, taking inventories – Protection against fire through fire alarms, sprinklers or fire retardant chemicals – Protection against flood, smoke, heat damage – various mechanisms – Protection from power surges through surge protectors and uninterruptible power supplies Disaster Planning and Recovery • To write a disaster plan, you need to envision the possible disasters and what you should do about them – Include contact information • All personnel • Form emergency response teams • Emergency numbers for fire, police, etc – Full inventory of IT infrastructure (hardware, software including versions installed, servers, network components), also all licenses – A copy of the plan Disaster Planning and Recovery • Let’s consider an example: fire damages your building, your organization has multiple sites – Evacuate the building, contact the fire department – Is the disaster real? If not, return to business, cancel the fire department – Otherwise, contact your team leaders, check to ensure everyone is accounted for – Alert other sites that they will have to take up the slack in processing and data access Disaster Planning and Recovery • Example continued – After the fire is out, assess the damage – Get an estimate for how long your site will be unavailable • Shift processing and data to other sites • Contact personnel to let them know what to do about reporting the site (work from home, report to another site) • Determine the damage to hardware, look at your recovery plan to determine how you will replace the damaged equipment • Work with other sites if your site will be out of action for any duration other than perhaps a day or two • After the disaster is recovered from, analyze your plan for flaws and fix it up! Troubleshooting Problem 1: System is running ineffectively. Description: Simple tasks are taking too long to execute. Log in is taking more time than expected. There is a delay between issuing a command and seeing its result. Steps to determine the problem: Use top, ps, the system monitor, and/or mpstat, sar and pidstat to view the running processes. Is CPU load heavy, approaching 100%? Are there processes that are taking most of the CPU time or are there many processes taking little time but combined cause a heavy load? Use vmstat and free to examine main memory and swap space utilization Is main memory full? Is the system spending a lot of time swapping? Are there too many processes in memory? Use uptime to see how long the system has been running without a reboot; while Linux seldom needs to be rebooted, a reboot may resolve the problem Troubleshooting • Short term solutions: – Identify processes that can be halted and scheduled for later – Identify processes whose priorities can be lowered through renice, or those processes which could be moved to the background – Alternatively, can you contact the users and ask some of them to discontinue their processes and/or log off? – Also, reboot the computer if the above steps do not solve the problem • Long term solutions: – purchase more main memory – increase the size of the swap partition (possibly add a second hard disk to contain more of the swap partition) – purchase a more powerful processor (or additional processors) Troubleshooting Problem 3: Inadequate hard disk space Description: One or more of the file systems is filling up or has become full. Users cannot save files. Or, swap space is commonly low on available space. Steps to determine the problem: Use df to view how full each file system is Use find to search for inordinately large user files and core dumps (if /home is low on space), or log and spool files (if /var is low) If swap space is low, examine swap history using vmstat, sar, and pidstat Troubleshooting • Short term solutions: – back up the file system which is running out of space – delete overly large files and warn the user/owner of the files (e.g., “I have removed several core dumps found in your directory”) – ask users to clean up their file space. • Long term solutions: – back up all file systems – purchase additional hard disks and either segment users onto different partitions (e.g., /home/1 and /home/2) or repartition the file system so that the partition can be moved either to the new hard disk or to be split across the hard disks – implement disk quotas if necessary to prevent user spaces from filling up in the future – initiate mail quotas – move large log files to an archive Troubleshooting Problem 4: Suspicious system behavior. Description: Services or programs are not working as they should. System might be too slow. Files might have disappeared. Steps to determine the problem: Examine your log files, particularly secure, lastlog and btmp, to look for unusual patterns of logins Look for running processes with peculiar ownership Use ausearch to look at authentication events, particularly failed ones Look for evidence of computer virus or Trojan horse Troubleshooting • Short term solutions: – kill any suspicious processes (with apologies to any users who own those processes) – run antiviral software – reboot the computer if needed – examine your firewall to make sure it is running • Long term solutions: – implement a more secure authentication system and a more secure firewall – implement an intrusion detection system – discuss account protection with your users – require all users to change passwords at the next log in – delete any suspicious user accounts Troubleshooting Problem 6: Network not responding. Description: You are unable to reach other computers via your web browser or other network tools. Steps to determine the problem: See if the network service is running Use ip to check the status of your interface device(s), do you have MAC addresses? IP addresses? Do you have a router or gateway connection? Check the physical connection to the network to see if there is something wrong with the cable or port Use ping and/or traceroute to see if you can reach your gateway, if successful, use ping/traceroute to reach a computer on your local area network and then a computer on the Internet Test to see if you can reach computers using IP addresses but not IP aliases (check your resolv.conf file) See if other users are also unable to communicate via the network Troubleshooting • Short term solution: – – – – – restart the network service check your network configuration files (e.g., ifcfg-eth0) if you are using DHCP, make sure your DHCP server is responding make sure your name servers are responding if they are unavailable, you may still be able to reach the network using IP addresses (instead of aliases), or place the mapping information in your /etc/hosts file – if none of this works, reboot your computer – reboot the DHCP server. • Long term solutions: – reconfigure the network itself by replacing the DHCP server and/or network gateway – test your network cables – try an alternate network interface device (e.g., replace your Ethernet card with a new one). Troubleshooting Problem 9: System does not initialize correctly Description: Upon boot/reboot, the operating system does not come up in a usable mode. Steps to determine the problem: Check dmesg for errors during system boot If the system initializes to Linux, see what runlevel /etc/inittab is set to Does /sbin/init exist? Is the root file system being mounted? Is vmlinuz available? Is GRUB configured correctly? Troubleshooting • Short term solution – if errors arose during boot (from dmesg), try to diagnose the cause of those errors (bad device, bad kernel image) and reboot. If the system came up in the wrong runlevel, run telinit to change the runlevel and alter the inittab file’s default statement to modify the default runlevel – if there are errors arising during the init process, you might need to repartition one or more of your file systems – if the system does not come up at all, check the GRUB command line (shortly after booting, press ‘c’ to interrupt the process and drop to a command line prompt) • Long term solution: – reinstall the OS