Introduction
to Linux and
HPC
Presented by:
Al Ritacco, Shailender Nagpal
AGENDA
•
•
•
•
•
•
Introduction to Linux
How to request an HPC account
How to Login to HPC
Basic Linux commands
Available resources
How to submit a job to the cluster
2
AGENDA
Introduction to Linux
How to request an HPC account
How to Login to HPC
Basic Linux commands
Available resources
How to submit a job to the cluster
3
What is UNIX
• Unix is an Operating System (OS), just like Microsoft
"Windows" is an OS Computers
– Runs on many computer "servers“, has ability to provide
multi-user, multi-tasking environment
– Orchestrates the various parts of the computer: the
processor, the on-board memory, the disk drives,
keyboards, video monitors, etc. to perform useful tasks
• Unix operating system comprises three parts – the
kernel (with commands to interact with it), standard
utility programs/ services, system configuration files
What is Linux
• Linux is “souped-up” Unix, and provides additional
user-friendly programs
– command line interface (CLI) and graphical user interface
(GUI) are available to execute commands
• What exactly does this mean?
– It means we can install and run scientific software as well
as business applications
5
Why Unix/Linux?
• UNIX is good for automation of computer tasks:
– performing complex operations with very few key strokes
– operating on large number of objects for e.g.,
• Parsing file contents (pattern matching)
• Manipulating text files containing scientific data
• UNIX is fast
• LINUX(≈ UNIX) is free and runs on all PCs and MACs,
plus specialty hardware for mobile devices
• Many scientific software are freely available on Linux
AGENDA
Introduction to Linux
How to request an HPC account (to work on Linux)
How to Login to HPC
Basic Linux commands
Available resources
How to submit a job to the cluster
7
Getting an account
• To get started on using the Umass linux servers, you
need to have an account. Fill out this form:
https://ghpcc06.umassrc.org/hpc/index.php
• Your PI has to authorize
• To connect to the HPC server from Windows, use
Putty client, or from a Mac, use SSH
http://wiki.umassrc.org/wiki/index.php/Connecting_to_t
he_Cluster
8
Working on a Linux Computer
• Linux as a personal workstation
• Linux/Unix as a central “server” (multi-user)
– Three pieces of information – user name, password and
server name or IP address
• “Putty" on Windows OS can be used to connect to
UMass Research Computing servers
– remote login may not allow for displaying graphics - text
mode interaction only
– graphics or "X" can be displayed using special tools (Xming)
AGENDA
Introduction to Linux
How to request an HPC account
How to Login to HPC
Basic Linux commands
Available resources
How to submit a job to the cluster
10
Logging into Linux
• Why do we need to login?
– Tracking who can login and what access they have
• Logging in
– Use SSH client software
– Login to a particular server which has a designated name:
• Ex: hpcc01.umassmed.edu, ghpcc06.umassrc.org
• User credentials: user name, password
– SSH Client for Windows: Putty
– SSH Client for Mac/Linux: Terminal
11
Connect to UMass Servers
12
How do I interact with Linux?
• Using a command line interface (CLI) where we
explicitly type commands and have Linux execute
them (using a command shell)
• What is a command shell
– A program that interprets the commands we wish to have
executed by Linux
• Enter “bash”
– Bourne again Shell
13
Logging out of Linux
• Logging Out of Linux:
– To end your session use the “exit” command from the
command prompt:
[username@hpcc02 ~]$ exit
Connection to hpcc02 closed
• You can also use the key sequence (<ctrl>+D) to close
a sessions
14
AGENDA
Introduction to Linux
How to request an HPC account
How to Login to HPC
Basic Linux commands
Available resources
How to submit a job to the cluster
15
Before we begin learning…
• We will use the term Linux and UNIX interchangeably
• Many variants of Linux exist – Redhat, Ubuntu,
CentOS, Debian, etc.
• Commands between Linux distributions will be
exactly (or almost exactly) the same
• Most of the commands we will be covering are
applicable to other *NIX based operating systems
16
Files and Linux
• Linux users are working with
– Applications
– Files
• There are several different file types defined for
different types of usage in Linux
– Basic files text or binary type files (sequence files, etc.)
– Executable files (programs). Programs such as bowtie,
gate, ls, cp, cd, etc.
17
Things we need to do on a shell
• Just like with a Windows PC, users need to:
– Create, edit, move, rename and delete files
– Organize files into folders and navigate the filesystem
– Organize users and control permissions of what they can
see and do
– View and manage processes, services
– Install and run programs and work with their output
• In Linux, you have to learn "commands" to get above
things done, implementing them on the "shell" or
"command line"
Filesystem: Relative and Absolute Path
• The Linux file system is hierarchical and resembles a tree
structure
• A user in the “admin” directory can access the “steve”
directory by specifying the relative path “steve” or the
absolute path “/users/admin/steve”. Similarly “users” can be
accessed by specifying “../users” or “/users”
Linux Layout
• Linux commands are typically installed under:
–
–
–
–
–
/bin
/sbin
/usr/bin
/share/bin
/share/apps
Linux commands
Typical system commands
User level commands (editors, etc.)
Specific cluster software
Specific genome based cluster tools
20
Basic command structure
• Basic form of a Unix command is:
command [-options] [arguments]
• Example: ls -l /tmp
– “ls” is the command. It lists contents of a directory
– “-l” is the option or flag or modifier of the default behavior
of command. Try “ls”.
– “/tmp” is the argument. Contents of this directory are
shown
• Aborting a shell command
– most Unix systems allow to abort the current command by
typing Control-C
Note on Linux and commands
• Linux commands are case sensitive so:
– Exit is not the same as exit
– Bowtie is not the same as BOWTIE
– Gate is not the same as gatE
• In Linux we use a / as a directory separator
– In Windows we use \ as the directory separator
• Linux file names can be descriptive and do not
require a file extension
22
Basic Linux commands (List 1)
ls
cp
rm
mv
cd
mkdir
pwd
rmdir
cat
List the contents of directory
Copy file(s)
Remove file(s)
Move file(s)
Change location to another directory
Make a new directory
Display the path of current directory
Remove a directory
Display contents of file
Basic Linux commands (List 1 ..contd)
head
tail
clear
vi
passwd
less
more
history
Display beginning of file
Display end of file
Clear up the shell window
Open a file for editing in the VI editor
Change the password
Displays contents of file with scrolling
Displays contents of file with scrolling
Displays history of commands executed
Basic Linux commands (List 1 ..contd)
date
who
whoami
last
exit
wc
grep
man
Displays the current date and time
Displays who is currently logged in
Displays your username
Displays recent login activity
Exit the shell
Count words and lines in file
Search for string pattern in file
Display “manual” page for chosen command
25
Determining Present Working Directory
with “pwd”
• When user logs in, they are placed in their HOME directory,
which is usually under the “/home” directory
• The linux shell account name and the home directory name
are usually the same, so “/home/snagpal” would be the home
directory location for user “snagpal”
• As users navigate the filesystem, they can check/confirm
where they are currently by running the “pwd” command
[snagpal@u15982204 ~]$ pwd
/home/snagpal
• In windows, you can view the same in the windows explorer
address bar
Changing directories with “cd”
• Often, users need to go to another directory that is:
– a sub-directory that can be accessed below in the tree hierarchy of the
present working directory
– a super-directory that can be accessed through the parent of the
present working directory
• In both cases, absolute and relative paths can be used. Lets
say user is currently in “/home/snagpal” and needs to access
– A sub-directory of the home directory
cd linuxcourse
cd /home/snagpal/linuxcourse
– A super-directory of the home directory
cd ../../usr/local
cd /usr/local
Listing files and directories with “ls”
• “ls” lists files and sub-directories in a chosen directory.
Windows explorer offers a rich, graphical equivalent
– To list files in the current directory
ls
– To list files in another directory (absolute path)
ls /usr/local
– To modify the default view of the output to a long list
ls –l /usr/local
Making Directories with “mkdir”
• To create new sub-directories in the home folder or elsewhere
on the filesystem, use the “mkdir” command
• Absolute or relative paths can be specified
mkdir linuxcourse
mkdir /home/snagpal/linuxcourse
Removing Directories with “rmdir”
• To remove directories in the home folder or elsewhere on the
filesystem, use the “rmdir” command
• Absolute or relative paths can be specified
rmdir linuxcourse
rmdir /home/snagpal/linuxcourse
Copying, Moving and Removing files
• Users needing to make duplicates of a file can easily do so using the
“cp” command. It requires the source and destination location to be
specified (absolute or relative path)
cp /share/training/linux/test.txt /home/snagpal
cp /share/training/linux/test.txt .
• The dot “.” represents current working directory. Copying leaves a
copy of the file in its original source location. Move deletes it, and
also allows to rename files
mv /share/training/linux/test.txt
/home/snagpal/file.txt
mv /share/training/linux/test.txt file.txt
• To remove a file, use “rm”
rm test.txt
rm /home/snagpal/file.txt
File Naming conventions in Linux
• To name files and directories, use:
–
–
–
–
–
characters
numbers
period
dash
underscore
A-Z, a-z
0-9
.
_
• Files and Directory with shell meta characters in the name
should be avoided, such as: \ / < > ! $ % ^ & * | { } [ ] “ ‘ ` ; ~
The “vi” editor (…contd)
• To exit the “vi” editor and return to the linux prompt, you
have to return to command mode, by pressing the “Esc” key.
Then use the “:” key to enter the command line mode
wq
w!
q!
saves the current changes and exits vi
saves the current changes but doesn’t exit vi
exits vi without saving any change
• There are many more commands to execute in the command
mode and command line mode. A vi tutorial is suggested
Creating and editing files
• Linux has many text editors, most commonly “vi”, but
“emacs”, “pico” and “nano” can also be installed
• Most common syntax is:
vi newfile.txt
vi existingfile.txt
# Creates new file
# Opens existing file
• The filename is checked to see if it exists. If it does, it is
displayed. If not, a new file with the name is created
• By default, “vi” opens in command mode. Users can scroll in
the file – up, down, page up, page down, move cursor, delete
lines, undo, etc
• To enter the “write” or “insert” mode for adding text, users
press the “i” or “a” key on keyboard. To exit, press “Esc” key
Searching for patterns in text with “grep”
• Grep searches line-by-line for a specified pattern, and outputs
any line that matches the pattern. Basic syntax for the grep
command is: grep [options] pattern [files]
cp /share/training/linux/seq.fasta .
grep ">" seq.fasta
grep TCGAAGA seq.fasta
• Many “options”, also searches using regular expressions (a
mathematical expression that expresses the characteristics of
one or more strings, e.g.:te?xt, *omics
Counting words in file with “wc”
• The “wc” command counts words and lines in a file
cp /share/training/linux/abstract.txt .
cat abstract.txt
wc abstract.txt
wc –l abstract.txt
Text processing Linux Commands
$
$
$
$
$
head -2 file_name List the first two lines
tail -2 file_name List last two lines
head -5 file_name|tail -1 List fifth line
cat file_name|head -50|tail -1
List 50th line
cat file|sort -rn|tail -5
List the last 5 items
(sorted in reverse numerical order)
$ sort -rn file|uniq –c
Sort a file, and
count the number of line occurrences
37
Miscellaneous commands
• Displaying current date and time with "date“
date
• Clearing the terminal with "clear“
clear
• Displaying history of commands with "history“
history
Getting Help in Unix
• Use the man command, followed by the name of the
command you need help with
– Type ‘man ls’ to see the manual page for the "ls" command
man ls
User convenience features
• Shell tab completion with suggestions
• Shell expansion of wild-cards for specifying multiple
arguments
ls –l *.txt
• Combining options/flags
ls –la *.txt
• Using flag names with "--“
• Copying and pasting clipboard with left and right mouse clicks
Tying Linux commands together
• All commands are executed left -> right (LR)
– Output is expressed in the same manner
• Linux Pipes ‘|’ and commands
• Ex: determine how many sequences we have
$ cat sequence.fastq | wc
There are 4 lines per sequence in a fastq, how can we
determine the # of sequences (x/4):
$ wc -l sequence.fastq| awk '{print $1}‘ |
xargs -i echo "scale=0; {}/4“ | bc -l
41
Linux/UNIX Redirection
• What is redirection?
– Linux uses the notion of < and > for redirection of input
and output respectively.
– A redirection using > allows the user to save the output to
a file for example. In the same way > redirects output, <
redirects input from for example the keyboard to a file for
input.
– Ex: echo “test” > file1
# “test” to file1
– Ex: cat < file1
# output the “file1” file
42
Redirection (..contd)
• A word on redirection: be careful when using
redirection to a file, as a single > (redirect output
from stdout to a file) will overwrite (or create) a file,
whereas a >> (two > signs in a row) will attempt to
append to a file thus preserving the initial file input.
43
Redirection (…contd)
• If we create two files
(file1/2) with Line1, and
Line2 in them
respectively
• We can then create a
new file using the >
Redirection operator
$ cat file1 file2 > file3
44
Redirection (…contd)
• Using bowtie with re-direction
– Ex: analyze fastq files to look for all alignments per
read, with hits guaranteed best stratum (with ties
broken by quality), and reporting 2 end-to-end hits
• In the bowtie example we are redirecting the
output of the bowtie alignment reads to the file
we have named ‘output_file’ in your scratch dir.
$ bowtie -a --best -v 2 upstream_mate downstream_mate.fastq > ~/scratch/output_file
45
Shortcut BASH keystrokes
• Keyboard shortcut timesavers in BASH
–
–
–
–
–
–
CTRL + A
Move cursor to start of line
CTRL + C
Stop a program
CTRL + D
Logout (Same as ‘exit’ command)
CTRL + E
Move Cursor to end of line
CTRL + Z
Suspend program
TAB
Command completion (type part of
command and hit tab to complete command)
– TAB TAB
Shows all commands available
46
Executing Commands
• PATH
– Commands are part of your shell’s PATH
• For example: when we type a command such as ‘ls’ the
command will be run as it is part of the search PATH
– An example PATH is
$ echo $PATH
/bin:/sbin/:/home/ritaccoa
– Commands which are not in your PATH will not be
found and therefore not executed
47
Calling external bioinformatics programs
• On our server, several Bioinformatics software are
installed
$ module avail
• General method to using a software is to load the
software’s module
$ module load bowtie/1.0.2
$ bowtie --help
AGENDA
Introduction to Linux
How to request an HPC account
How to Login to HPC
Basic Linux commands
Available resources
How to submit a job to the cluster
49
HPC infrastructure at UMass RC
• Massachusetts Green High Performance Computing
Cluster
– 10264 cores available, each node has 196 - 512 GB RAM.
12 GPU nodes available
– 400TBs of high performance EMC Isilon X series storage
– FDR based Infiniband (IB) network and a 10GE network for
the storage environment
• Software related to research installed:
– Physics, Medical Physics, Genomics, Chemistry…
50
51
Information Services, 00/00/2010
52
Information Services, 00/00/2010
Basic terminology
• What is a node?
• What is a CPU?
• What is a core?
• What is an Operating System
– What is a kernel?
• What is a process?
– Single process OS and processes
– Concurrent (Multi-tasking) OS and processes
– Multiple cores (SMP) and Linux processes
53
Basic Terminology
• What is a Node?
– A single computer/blade which contains X number
of CPUs and Y number of cores per CPU
• What is a CPU?
– The central processing unit (CPU) carries out all of
the instructions in which a computer system
requires to execute/perform a given task
• What is a core?
– A core is a processor within a CPU chip (there can
be many cores on a given CPU)
54
Basic Terminology
• What is a process?
– A process is a program executing (ex: iTunes)
• What is a Kernel?
– The kernel is the glue between the hardware and
the user. The Kernel schedules processes.
– The kernel can be thought of as a crossing guard
directing traffic for optimal performance
55
Basic Terminology
• Processes and tasks
– Single process OS and processes
• Single processing OSs can run only one user process at a
given time, a single task
• All tasks run until completion before another task is started
• MSDOS is an example of this type of single user execution
OS.
• Linux Processes and Cores
– A one to one relationship is optimal for performance
56
Basic Terminology
• Processes and tasks, cont
– Concurrent (Multi-tasking) OS and processes
• A concurrent OS provides users the ability to execute
many programs simultaneously
• Linux provides users the ability to execute: an editor, a
music player, and other tasks simultaneously, thus
allowing for multi-tasking
– Multiple cores (SMP) and Linux processes
• A process which can take advantage more than one
core while running. These are typically called: multithreaded.
57
Short Review
• If a node has four CPUs and two cores per
CPU, how many total cores are there?
• In Linux can we execute an editor and a
program to search a genome at the same
time?
• How many processes should we execute on a
node which has two CPUs with 8 cores each?
58
AGENDA
Introduction to Linux
How to request an HPC account
How to Login to HPC
Basic Linux commands
Available resources
How to submit a job to the cluster
59
What is HPC?
• HPC = High Performance Computing
– Infrastructure where hundreds or thousands of
computers are networked together with shared
common storage
– Multiple users can login and use the infrastructure
– More than 1 computer can be used to complete a
computing task
– Special tools/skills required to leverage HPC
environment – Linux, LSF commands
60
Definitions
HPC Term Definition
Node
A single computer available to perform computing tasks
Rack
A cabinet in which multiple nodes can be stacked vertically and/or
horizontally, allowing for efficient housing, networking and power
management
Cluster
A collection of computer “nodes” that are on the same network for
inter-node communication, shared storage and to execute jobs
CPU
A CPU is the electronic circuitry (Microprocessor) within a computer
that carries out the instructions of a computer program
Core
Independent programming unit within a CPU that can execute
program instructions. A modern CPU can have multiple cores
Head node
In a cluster, one or a few nodes can be designated as a head node
where users typically are able to login and create/monitor jobs
61
Definitions (…contd)
HPC Term
Definition
Compute
node
Compute nodes in a cluster execute a job created by a head node.
Users cannot login into a compute node
Process
A process is an instance of a computer program that is being
executed. It contains the program code and its current activity
Thread
A thread is the smallest sequence of programmed instructions that
can be managed independently by the scheduler of an OS
Job
A job is a linux command that is designated to be executed on a
compute node rather than the head node
Job array
Identical jobs that have a different iterator variable
Parallel job
Jobs that break a complex computing task into smaller tasks, such
that each task is executed on different nodes simultaneously
Queue
Designated “lanes” for submitting different types of jobs depending
on priority, resources required or expected duration of execution
62
Definitions (…contd)
HPC Term
Definition
Scheduler
HPC software that allows for efficient utilization of cluster resources
based on submitted job types
Job
Management
HPC software that keeps track of jobs submitted
Research
computing
One of the departments within Umass Medical School responsible
for supporting the HPC infrastructure on campus. Not related to “IT”
Cloud
computing
A variant of HPC infrastructure which is not limited to a particular
organization, where computing resources are requested on demand
Distributed
computing
Buzzword similar to High Performance Computing
Parallel
computing
Buzzword similar to High Performance Computing
63
Why do you need HPC?
• Needs assessment:
– Use software that’s only available on linux
• Install it yourself on your own linux PC?
• RC already has it installed?
– Automate data crunching tasks
• Routine incoming data that needs to be crunched?
• Workflow available within RC to handle it?
– Simulations
• Molecular dynamics simulations taking too much time?
64
HPC is not for these!
• To run windows software with ponit-n-click
interfaces
• Working with office documents –
spreadsheets, slides, etc
• Video games, music or general video
• Web browsing
• Emails
65
Policies for HPC use
• If you have a “need” to use HPC, RC group can
help, but there are expectations:
– Understanding of the constraints of our HPC
implementation – CPUs, memory, local and shared
storage, networking, etc
– Good knowledge of your own tasks/jobs that you
are going to run – expected run times, utilization
of memory, disk space and network bandwidth
– Fair share policies
66
Typical HPC environment
The Cluster
HEWLETT
PACKARD
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
Slave node
Connections:
HEWLETT
PACKARD
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
Slave node
HEWLETT
PACKARD
Slave node
Slave node
Cluster head
PROLIANT
SD
1850R
CISCOSYSTEMS
Power Supply 0
Power Supply 1
Catalyst
8500 SERIES
Switch
Processor
Internal cluster traffic
(ethernet 1 Gb/s)
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
HEWLETT
PACKARD
Slave node
Storage unit
NAS/SAN
Storage
67
SD
NAS storage
(ethernet 1 Gb/s)
Public network
(ethernet 100 Mb/s)
What is a computing “Job”?
• A computing “job” is an instruction to the HPC
system to execute a command or script
– Simple linux commands that can be executed
within miliseconds would probably not qualify to
be submitted as a “job”
– Any command that is expected to take up a big
portion of CPU or memory for more than a few
seconds on a node would qualify to be submitted
as a “job”. Why? (Hint: multi-user environment)
68
How to submit a “job”
• The basic syntax is:
bsub <valid linux command>
• bsub: LSF command for submitting a job
• Lets say user wants to count number of lines
in a FASTQ file. On a linux PC, the command is
wc –l reads.fastq
• To submit a job to do the work, do
bsub wc –l reads.fastq
69
Specifying more “job” options
• Jobs can be marked with options for better job
tracking and resource management
– Job should be submitted with parameters such as
queue name, estimated runtime, job name,
memory required, output and error files, etc.
• These can be passed on in the bsub command
bsub –q short –W 1:00 –R
rusage[mem=2048] –J “Myjob” –o hpc.out –e
hpc.err wc –l reads.fastq
70
Job submission “options”
Option flag or Description
name
-q
Name of queue to use. On our systems, possible values are “short”
(<=4 hrs execution time), “long” and “interactive”
-W
Allocation of node time. Specify hours and minutes as HH:MM
-J
Job name. Eg “Myjob”
-o
Output file. Eg. “hpc.out”
-e
Error file. Eg. “hpc.err”
-R
Resources requested from assigned node. Eg: “-R
rusage[mem=1024]”, “-R hosts[span=1]”
-n
Number of cores to use on assigned node. Eg. “-n 8”
71
Why use the correct queue?
•
•
•
•
Match requirements to resources
Jobs dispatch quicker
Better for entire cluster
Help GHPCC staff determine when new
resources are needed
72
Demo
Create a script “hello-job-array.sh”
#!/bin/bash
#BSUB -q short
#BSUB -W 00:10
#BSUB -n 1
#BSUB -R "rusage[mem=1024]"
#BSUB -J "myTask[1-80]”
#BSUB -o logs/out.%J.%I
echo "Hello Job $LSB_JOBID Task $LSB_JOBINDEX"
To execute on shell, run: bsub < hello-job-array.sh
73
Learning to use HPC
• Linux is a pre-requisite to using any HPC
system
– Plenty of linux tutorials on the internet
– Attend our “Intro to linux” sessions when offered
• Our website is a good resource for learning to
use HPC, visit
www.umassrc.org
• Lots of examples provided
74
Disk usage best practices
• Archive your data
– Make backups of your data on mid-long term storage
• Use local storage if possible
– Local storage always faster than network
• Don’t use farline for cluster processing
75
HPC Best practices
• When submitting a large number of jobs
please consider:
– Single CPU jobs versus multi CPU Jobs
– Correct amount of memory for your job
– Job Arrays
– Job dependencies
76
HPC Best practices cont.
• The earlier your jobs are submitted the earlier
your job will gain needed LSF resources.
• Re-direct all LSF output to one directory for
convenience
• Add the following to your LSF / Job directives:
(redirects stdout/stderr)
#BSUB -o $HOME/LSF_jobs_output/LSF_job.%J.out
#BSUB -o $HOME/LSF_jobs_output/LSF_job.%J.%I.out
77
HPC Best practices cont.
LSF Queues and policies
• Fair share attempts to equalize CPU (slot) resources for
Labs and users at job submission.
• The priority of a job is calculated in relation to other
submitted jobs. The priority for jobs will change as
jobs complete and job slots become available
• All labs start with an equal weight
• Each lab member shares in this weight when
submitting jobs
• Weights are measured from job submissions per user
and per lab
• Weights are based on CPU time used and a decay time
78
Working with bioinformatics data files:
A demo
• Log on to the Umass server using Putty on windows or
Terminal on Mac
• Request an interactive shell session on one of the
compute nodes for this demo
$ bsub –q interactive –W 4:00 –Is bash
• Navigate to the training directory or copy the
examples to your local directory
$ cd /share/training/linux-bioinformatics
$ cp /share/training/linux-bioinformatics/*
~
Working with bioinformatics data files:
A demo (…contd)
• We have a file with genomic sequence, called
“sequence.fa”, and a file with NGS reads, “reads.fq”.
Confirm them
$ ls
• We can examine a file using this Linux command
$ file sequence.fa
sequence.fa: ASCII text
• Lets look at the attributes of the files in this directory
$ ls -l
Working with bioinformatics data files:
A demo (…contd)
• The “cat” command can be used to display the
contents of one or more files to the screen
$ cat sequence.fa
• Maybe better to scroll through the file, as pages?
$ less sequence.fa
• Display just the first line of file (header)
$ head -1 sequence.fa
• Display the last 3 lines of the file
$tail -3 sequence.fa
Working with bioinformatics data files:
A demo (…contd)
• Determine number of lines in FASTQ file
wc –l reads.fq
• Count the number of reads in FASTQ file
$ x=`wc -l reads.fq | cut -f 1 -d ' '`
$ echo “$((x/4)) reads”
• Search for pattern in the sequence file and count
grep –c ACGTCA sequence.fa
• Search for adapter and count reads containing it
grep ^ACGTCA reads.fq | wc -l
Innovagene
Informatics. All rights
reserved
Working with bioinformatics data files:
A demo (…contd)
• Case-insensitive search and count
grep –i ^ACGTCA reads.fq
grep –i ^ACGTCA reads.fq | wc –l
• Display all headers in sequence file
$ grep ^> sequence.fa
• Count number of bases in single-sequence FASTA file
$ more +2 sequence.fa | wc -m
Working with bioinformatics data files:
A demo (…contd)
• Now lets align the reads to the sequence file (chr19)
module load bowtie2/2-2.1.0
module load samtools/1.2
• If you still have enough time remaining on this
compute node (interactive sessions can be requested
for up to 8 hours), run bowtie2
bowtie2-build index sequence.fa
bowtie2 -p 1 -x sequence.fa reads.fq -S
read.fq.sam
• You can also submit this alignment as a compute job
Working with bioinformatics data files:
A demo (…contd)
• Create a bowtie script with the following content
#!/bin/bash
module load bowtie2/2-2.1.0
module load samtools/1.2
bowtie2-build sequence.fa reference
bowtie2 -p 8 -x reference reads.fq -S reads.fq.sam
samtools view -b reads.fq.sam –o reads.fq.bam
Working with bioinformatics data files:
A demo (…contd)
• Now submit this script as a compute job
bsub -W 4:00 -q short -R
"rusage[mem=4096]" -J "bowtie-job" -o
ngs.out -e ngs.err ./bowtie-align.sh
• Another way of writing the script is to include all of
the command line options into the script itself (next
slide)
• Then submit the compute job as
bsub < bowtie-align2.sh
Working with bioinformatics data files:
A demo (…contd)
#!/bin/bash
#BSUB -J "SeqAlignJob"
#BSUB -R rusage[mem=4096]
#BSUB -q short
#BSUB -W 4:00
#BSUB -o ngs.out
#BSUB -e ngs.err
module load bowtie2/2-2.1.0; module load
samtools/1.2
bowtie2-build sequence.fa reference
bowtie2 -p 8 -x reference reads.fq -S reads.fq.sam
samtools view -b reads.fq.sam -o reads.fq.bam