Python for High Performance Computing Who is this for? Know basics of SciPy ? Interested in HPC applications? Know basics of NumPy? You already know Python? Know basics of Matplotlib ? Facts about Python Pros Cons For most users, computer speed is not as important as the human investing in coding! Is Python good fit for your problem? The CPU-intensive parts of your application may be migrated to C or Fortran Prototyping new algorithms/ide as Communication is NOT frequent in your Rapid development is programs ( if it is you need to spend time writing Python MPI code) If your application can make good use of proper python libraries: NumPy, SciPy… your primary concern Yes When you need to easily and quickly glue existing components of your program written in different languages Python in interactive shell on Beagle In the same path as the running script Python looks for MODULES In the system-wide modules folder: site-packages In paths specified in $PYTHONPATH ams@login1:~> module avail python python/2.7.1(default) python/2.7.3 python/2.7.3-vanilla python/3.3.0-vanilla ams@login1:~> module load python/2.7.3 Python version 2.7.3 loaded ams@login1:~> export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/numpy/1.7.0/lib/python2.7/site-packages ams@login1:~> export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/scipy/0.12.0/lib/python2.7/site-packages ams@login1:~> python Python 2.7.3 (default, Jul 10 2013, 13:27:48) [GCC 4.7.2 20120920 (Cray Inc.)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy >>> import scipy PYTHONPATH in PBS script on Beagle … #PBS -l mppwidth=480 #PBS -j oe . /opt/modules/3.2.6.6/init/bash module swap PrgEnv-cray PrgEnv-gnu module load numpy export PATH=/soft/python/2.7/2.7.3/python/bin/:$PATH export LD_LIBRARY_PATH=/soft/python/2.7/2.7.3/python/lib/:$LD_LIBRARY_PATH export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/numpy/1.7.0/lib/python2.7/site-packages export PYTHONPATH=$PYTHONPATH:/soft/python/2.7/2.7.3/modules/mpi4py/1.3/lib/python2.7/site-packages/ cd $PBS_O_WORKDIR aprun -n 20 -N 1 python myampijob.py NumPy NymPy maps large regions of memory to a single object A powerful N-dimensional array object with fast calculations (in an “element-by-element fashion”) Supports all native data types (int, floats…) You can describe z = np.arange(3, dtype=np.float) custom objects >>> z array([ 0., 1., 2.]) Homogeneous collection of exactly the same data-type, every item takes up the same size block of memory, and each block of memory in the array is interpreted in exactly the same way. Linear algebra Basic scientific toolkit: Fourier transform Random number capabilities Basic statistics Shape of the array: Is a tuple of N integers (one for each dimension) using a data type object which describes how the bytes in the fixed-size block of memory should be interpreted. Array Data Structure Strides: a tuple which shows you how many bytes you have to move to get from first row to second row. Setting this attribute to another tuple will change the way the memory is viewed. >>> a=np.arange(9).reshape(3,3) >>> a array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) >>> a.dtype dtype('int64') >>> a.strides (24, 8) NumPy Data Types NumPy Data Types comparaison import numpy as np import timeit setup = """ import numpy as np A = np.ones((1000,1000,3), dtype=datatype) """ datatypes = "np.uint8", "np.uint16", "np.uint32", "np.uint64", "np.float16", "np.float32", "np.float64" stmt1 = """ A = A * 255 / 255 - 1 + 1 """ stmt2 = """ A[:,:,:2] = A[:,:,:2] * A[:,:,:2] "”” for stmt in [stmt1, stmt2]: print stmt for d in datatypes: s = setup.replace("datatype", d) T = timeit.Timer(stmt=stmt, setup=s) print d,":", min(T.repeat(number=30)) print print ams@login1:~> python datatest.py A = A * 255 / 255 - 1 + 1 np.uint8 : 1.2955300808 np.uint16 : 1.58471012115 np.uint32 : 1.67583012581 np.uint64 : 1.88479685783 np.float16 : 9.77974891663 np.float32 : 1.0303478241 np.float64 : 1.50705814362 A[:,:,:2] = A[:,:,:2] * A[:,:,:2] np.uint8 : 1.17167806625 np.uint16 : 1.38199210167 np.uint32 : 1.75739693642 np.uint64 : 1.88689804077 np.float16 : 2.96481609344 np.float32 : 1.71590399742 np.float64 : 1.86970305443 “Structured” Arrays # "Data structure" (dtype) that describes the fields and # type of the items in each array element. Elements of an array can be any fixed-size data structure! >>> from numpy import * >>> particle_dtype = dtype([('mass','float32'), ('velocity', 'float32')]) # This must be a list of tuples >>> fmt = dtype([('name', 'S10'), ('age', int), … ('weight’ float) ... ]) >>> a = empty((3,4), dtype=fmt >>> a.itemsize 26 >>> particles = array([(1,1), (1,2), (2,1), (1,3)], dtype=particle_dtype) >>> print particles [(1.0, 1.0) (1.0, 2.0) (2.0, 1.0) (1.0, 3.0)] >>> print particles['mass'] [ 1. 1. 2. 1.] >>> print particles['velocity'] [ 1. 2. 1. 3.] >>> particles[0] (1.0, 1.0) >>> particles.sort(order=('velocity','mass')) >>> print particles [(1.0, 1.0) (2.0, 1.0) (1.0, 2.0) (1.0, 3.0)] “element-by-element fashion” >>> a = np.array([[1, 2], [3, 4]]) >>> b = np.array([[2, 3], [4, 5]]) >>> a + b array([[3, 5],[7, 9]]) >>> a * b array([[ 2, 6],[12, 20]]) >>> np.multiply(a, b) array([[ 2, 6],[12, 20]]) >>> a ** b array([[ 1, 8],[ 81, 1024]]) >>> np.dot(a,b) array([[10, 13], [22, 29]]) Numpy arrays properties >>>import numpy as np >>> import numpy as np. >>> b = np.array([4,5,6]) >>> a=np.array([[ 0, 1], ... [30, 3], ... [ 4, 5], ... [ 6, 7]]) >>> b.dtype dtype('int64') >>> b[2] = 12.6 >>> b array([ 4, 5, 12]) >>> b.nbytes 24 >>> b>2 array([ True, True, True], dtype=bool) >>> c=np.where(b>5) >>>c (array([2]),) >>>b[c] array([6]) a=np.arange(8).reshape(4,2) >>> a.shape (4, 2) >>> d=a.T >>> d array([[ 0, 30, 4, 6], [ 1, 3, 5, 7]]) >>> a.strides (16, 8) >>> a.T.strides (8, 16) Transpose doesn’t move values in the memory it only changes the order of strides in array Broadcasting: combine different dimensionality >>> a=np.arange(12).reshape(4,3) >>> a array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]]) >>> b=np.arange(3) >>> b array([0, 1, 2]) >>> b.shape (3,) >>> a+b array([[ 0, 2, 4], [ 3, 5, 7], [ 6, 8, 10], [ 9, 11, 13]]) >>> a*b array([[ 0, 1, 4], [ 0, 4, 10], [ 0, 7, 16], [ 0, 10, 22]]) The smaller array broadcasts across the larger array so that they have compatible shapes! WITHOUT COPYNG DATA Broadcasting rule >>> a=np.arange(12).reshape(4,3) >>> a array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]]) >>> b=np.arange(4) >>> b array([0, 1, 2, 3]) >>> a+b If both your arrays are twodimensional then Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: operands could not be broadcast together with shapes (4,3) (4) >>> b=np.arange(4).reshape(4,1) >>> b array([[0], [1], [2], [3]]) >>> a+b array([[ 0, 1, 2], [ 4, 5, 6], [ 8, 9, 10], [12, 13, 14]]) their corresponding sizes have to be either equal or one of them has to be 1! If you have two arrays with different dimensions number, say one 1x2x3 and other 2x3, then you compare only the trailing common dimensions, in this case 2x3 Instantiating NumPy arrays >>> import numpy as np >>> a = np.array([1, 2, 3]) >>> a array([1, 2, 3]) >>> b = np.ones((3,2)) >>> b array([[ 1.,1.],[ 1.,1.],[ 1.,1.]]) >>> b.shape (3,2) >>> c = np.zeros((1,3), int) >>> c array([[0, 0, 0]]) >>> c.dtype dtype('int64') >>> d = np.linspace(1,5,11) >>> d array([ 1. , 1.4, 1.8, 2.2, 2.6, 3. , 3.4, 3.8, 4.2, 4.6, 5. ]) NumPy & Slicing >>>a=np.arange(36).reshape(6,6) >>> a[:3] array([[ 0, 1, 2, 3, 4, 5], [ 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17]]) >>> a[::2] array([[ 0, 1, 2, 3, 4, 5], [12, 13, 14, 15, 16, 17], [24, 25, 26, 27, 28, 29]]) >>> a[0,3:5] array([3, 4]) >>> a[4:,4:] array([[28, 29], [34, 35]]) >>> a[:,2] array([ 2, 8, 14, 20, 26, 32]) >>> a[2::2,::2] array([[12, 14, 16], [24, 26, 28]]) var[lower:upper:step] Slices are References! Slices are references to memory in the original array. Changing values in a slice also changes the original array. >>> a = np.array((0,1,2,3,4)) >>> b = a[2:4] >>> b array([2, 3]) >>> b[0] = 10 >>> a array([ 0, 1, 10, 3, 4]) Searching and Indexing >>> a = np.array([1, 3, 0, -5, 0], float) >>> np.where(a != 0) (array([0, 1, 3]),) >>> a[a != 0] array([ 1., 3., -5.]) >>> len(a[np.where(a > 0)]) 2 >>> x = np.arange(9.).reshape(3, 3) >>> x array([[ 0., 1., 2.], [ 3., 4., 5.], [ 6., 7., 8.]]) >>> np.where( x > 5 ) (array([2, 2, 2]), array([0, 1, 2])) >>> b = np.array([10,20,30,40,50]) >>> i = np.where(a > 0) >>> i (array([0, 1]),) >>> b[i] array([ 10, 20]) >>> b array([10, 20, 30, 40, 50]) Fast way to search (and extract) individual elements of a NumPy array. Much faster than using a for loop or list comprehension Fancy Indexing creates copies instead of references! Indexing with newaxis newaxis is a special index that inserts a new axis in the array at the specified location. Each newaxis increases the array’s dimensionality by 1. >>> a = array((0,1,2)) >>> b = array((0,10,20,30)) >>> shape(a) (3,) >>> y = a[newaxis,:] >>> shape(y) (1, 3) >>> y = a[:, newaxis, newaxis] >>> shape(y) (3, 1, 1) >>> y = a[:, newaxis] >>> shape(y) (3, 1) >>> y = a+ b[:, newaxis] >>> y array([[ 0, 1, 2], [10, 11, 12], [20, 21, 22], [30, 31, 32]]) Example: Arrays from/to ASCII files vi data.txt -- BEGINNING OF THE FILE % Day, Month, Year, Skip, Avg 01, 01, 2014, x876, 13 % crazy day! % we don’t have Jan 03rd 04, 11, 2013, xfed, 55 vi fromto.py from numpy import * # loadtxt() automatically generates an array from txt file file = loadtxt('data.txt', skiprows=1, dtype=int, delimiter=",",usecols = (0,1,2,4), comments = "%") savetxt('fromto.txt', file) Fancy Indexing in 2D >>> a=np.arange(0,60).reshape(6,10) >>> a[0:6,0:6] >>> a[(0,1,2,3,4),(1,2,3,4,5)] array([ 1, 12, 23, 34, 45]) >>> a[3:,[0, 2, 5]] array([[30, 32, 35], [40, 42, 45], [50, 52, 55]]) mask = np.array([1,0,1,0,0,1], dtype=bool) >>> a[mask,2] array([ 2, 22, 52]) Example: Broadcasting & Fancy Indexing “”” Calculate the sinc function: sin(r)/r. Use a Cartesian x,y grid and calculate ``r = sqrt(x**2+y**2)`` with 0 in the center of the grid. Calculate the function for -15,15 for both x and y. """ from numpy import linspace, sin, sqrt, newaxis from pylab import imshow, gray, show x = linspace(-15,15,101) # flip y up so that it is a "column" vector. y = linspace(-15,15,101)[:,newaxis] # because of broadcasting rules, r is 2D. r = sqrt(x**2+y**2) # calculate our function. sinc = sin(r)/r # replace any location where r is 0 with 1.0 sinc[r==0] = 1.0 imshow(sinc, extent=[-15,15,-15,15]) gray() show() Array function methods: op.reduce (a, axis) # multidimensional array, example # 1D example >>> a = array([1,2,3,4]) >>> add.reduce(a) 10 >>> a = array(['ab','cd','ef'],dtype=object) >>> add.reduce(a) 'abcdef’ >>> a = array([1,1,0,1]) >>> logical_and.reduce(a) False >>> logical_or.reduce(a) True >>> a=arange(40).reshape(8,5) >>> a array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19], [20, 21, 22, 23, 24], [25, 26, 27, 28, 29], [30, 31, 32, 33, 34], [35, 36, 37, 38, 39]]) >>> b=a[::2,0:3] >>>b array([[ 0, 1, 2], [10, 11, 12], [20, 21, 22], [30, 31, 32]]) # default axis is 0: sum columns >>>add.reduce(b) array([60, 64, 68]) # axis a: sum rows >>> add.reduce(b,1) array([ 3, 33, 63, 93]) Array function methods: op.accumulate(a) >>> a = array([1,2,3,4]) >>> add.accumulate(a) array([ 1, 3, 6, 10]) >>> a = array(['ab','cd','ef'],dtype=object) op.accumulate(a) creates a new array containing the intermediate results of the reduce operation at each element in a. >>> add.accumulate(a) array(['ab', 'abcd', 'abcdef'], dtype=object) >>> a = array([1,1,0]) >>> logical_and.accumulate(a) array([ True, True, False], dtype=bool) >>> logical_or.accumulate(a) array([ True, True, True], dtype=bool) Array function methods: op.reduceat() >>> a = array([0,10,20,30,40,50]) >>> indices = array([1,4]) >>> add.reduceat(a,indices) array([60, 90]) op.reduceat (a,indices) applies op to ranges in the 1-D array a defined by the values in indices. The resulting array has the same length as indices. NOTE: For multidimensional arrays, reduceat() is always applied along the last axis (sum of rows for 2-D arrays). This is different from the default for reduce() and accumulate(). Array function methods: op.outer() op.outer(a,b) forms all possible combinations of elements between a and b using op. The shape of the resulting array results from concatenating the shapes of a and b. (Order matters!) Array function methods: op.choose() # clip lower values up 10 # clip lower and upper values >>> b array([[ 0, 1, 2], [10, 11, 12], [20, 21, 22], [30, 31, 32]]) >>> lt = b < 10 >>> gt = b > 15 >>> b<10 array([[ True, True, True], [False, False, False], [False, False, False], [False, False, False]], dtype=bool) >>> choose(b<10,(b,10)) array([[10, 10, 10], [10, 11, 12], [20, 21, 22], [30, 31, 32]]) >>> choice = lt + 2 * gt >>> choice array([[1, 1, 1], [0, 0, 0], [2, 2, 2], [2, 2, 2]]) >>> choose(choice,(b,10,15)) array([[10, 10, 10], [10, 11, 12], [15, 15, 15], [15, 15, 15]]) Array function methods: op.concatenate() concatenate((a0,a1,…,aN),axis=0) The input arrays (a0,a1,…,aN) are concatenated along the given axis. They must have the same shape along every axis except the one given. >>>concatenate((x,y)) >>> concatenate((x,y),1) >>> array((x,y)) NumPy & speed ams@login1:~> python -m timeit -s "s=0" "for i in range(1000000): s+=i" 10 loops, best of 3: 106 msec per loop ams@login1:~> python -m timeit "sum(range(1000000))" 10 loops, best of 3: 41.6 msec per loop ams@login1:~> python -m timeit -s "import numpy" "numpy.sum(numpy.arange(1000000))" 100 loops, best of 3: 4.16 msec per loop Use NumPy array functions instead of loops , because they’ve been optimized to work on arrays NumPy “slows down” If you are only going to iterate one item at the time with NumPy over a set of data (array) expect slowdown because Numpy adds that additional layers of abstraction when calling undelaying C. There is a way around it! DEREFERENCING zi=complex(z[i]) qi=complex(q[i]) for iteration in range(maxiter): zi = zi*zi + qi Don’t use NumPy as a container! Adding two elements from a list is much faster than adding two scalars in array. Python lists are optimized to be containers of different objects. Using Python functions (sum) on a NumPy array will be slower than using it on a list of floats. ams@login1:~> python -m timeit -s "import numpy" "numpy.sum((numpy.linspace(10,100,1000)).tolist())" 10000 loops, best of 3: 196 usec per loop ams@login1:~> python -m timeit -s "import numpy" "sum((numpy.linspace(10,100,1000)).tolist())" 10000 loops, best of 3: 55 usec per loop ams@login1:~> python -m timeit -s "import numpy" "sum((numpy.linspace(10,100,1000)))" 1000 loops, best of 3: 441 usec per loop Vectorization = get rid of for loops Vector function is any function that operates on NumPy array. import numpy as np from time import time # 1.store data set into Numpy array myarray=np.linspace(1,5,1000000) t1=time() for i in range(len(myarray)): myarray[i]+=1 print " time for loop\n" + str(time() - t1) t2=time() # 2. replace loop with equivalent Vector Functions (sum) myarray+=1 print " time for vectorized version:" + str(time() - t2) ams@login1:~/presentation> python vectorsimple.py time for loop 1.765157938 time for vectorized version: 0.00224208831787 Vectorized functions module load python vanilla # img_0001.jpg shows two different proteins , expressed differently over time # you can measure red and green chanels # Where vectorization really counts is in multidimensional arrays, like images. ################ Loop ########## time3 = time.time() import time import matplotlib.pyplot as plt for i in xrange(0, image.shape[0]): for k in xrange(0, image.shape[1]): if image[i,k,1] < 40: image[i,k,1] = 0 image = plt.imread(”protein.jpg") time4 = time.time() # We think the low green values are actually non-specific, background noise. # Make anything in the green channel less than 40 be a zero timeloop = (time4 - time3) print 'time for loops: ', timeloop time1 = time.time() print 'vectorized is', (timeloop / timevec), 'times faster' ## Built in vectorized functions image[ image[:,:,1] < 40 ] = 0 plt.imshow(image) plt.show() time2 = time.time() timevec = (time2 - time1) print 'time for vectorized: ', timevec # thats the difference in your code taking a month to run, or a few hours. There is np.vectorize but the purpose of it is to transform functions which are not numpy-aware (e.g. take floats as input and return floats as output) into functions that can operate on (and return) numpy arrays. Techniques for improving Python performance Remarks on efficiency: xrange vs range XRANGE Less memory consuming: doesn't generate the whole list, it just generates one element at the time as they are needed. RANGE Has to generate a list of the numbers in memory before enumeration starts. python xrangeinstead.py :::Results of test - range v. xrange::: Adding 1 1000000 times Runtime using range: 0.236502885818 Runtime using xrange: 0.193695068359 To iterate over a list you must use range function! Remarks on Efficiency: List vs Dictionary List is sequential, use dictionary! How much of a difference are these two as far as performance goes? ams@login1:~/presentation> python –mtimeit 'tmp=[]; tmp.append(True); x=tmp[0]' 1000000 loops, best of 3: 0.326 usec per loop ams@login1:~/presentation> python –mtimeit 'tmp={}; tmp[0]=True; x=tmp[0]' 1000000 loops, best of 3: 0.259 usec per loop Remarks on efficiency: if-else vs try-except The rule of thumb: “exceptions should never happen”! ///the except block is costly/// python ifelse.py :::Results of test - if/else v. try/except::: Square root of 10000000 numbers Runtime using if/else no errors: 0.748867034912 Runtime using try/except no errors: 0.665756940842 Runtime using if/else with errors: 0.545713186264 Runtime using try/except with errors: 2.7115380764 if-else is faster than try-except! Remarks on efficiency: resizing NumPy array? Avoid resizing NumPy arrays!!! For arrays that need to grow / shrink, convert them to list with the tolist method, add/remove list elements, and convert back to NumPy array again. python noresize.py :::Results of test - numpy.resize v. convert to list::: grow array 1000000 times Runtime using numpy.resize: 1.68510603905 Runtime using list conversion: 0.388118982315 Remarks on efficiency: avoid “.” Avoid module prefix in frequently called functions! Instead create local aliases. fastsin = math.sin x = range(1000000) for i in x: x[i] = fastsin(x[i]) python avoidprefix.py :::Results of test - module prefixes::: Sin of integers 0 to 1000000 Runtime using module prefix: 0.611773014069 Runtime pre-loading function: 0.488955020905 x = range(1000000) for i in x: x[i] = math.sin(x[i]) Remarks on efficiency: inlining functions def myfunc(x): return x + 1 x=0 for i in xrange(1000000): x = myfunc(x) # inline version x=0 for i in xrange(1000000): x=x+1 python inlinefunctions.py :::Results of test - inline vs. user-defined functions::: Adding 1 1000000 times Runtime using user-defined function: 0.373387098312 Runtime inline function: 0.207398891449 Function calls are in general expensive in Python so for small functions a performance gain can be obtained by inlining functions manually. Remarks on efficiency: eval & exac Let eval and exec work on compiled code! python eoncompiled.py :::Results of test - strings v. compiled code for exec::: Adding 1 10000 times Runtime using string: 0.162658929825 Runtime using compiled code: 0.00853800773621 Unless the eval and exec is called only once with the same argument, you should compile the argument first and thereafter work with the compiled version. Remarks on efficiency: math module vs NumPy Avoid using NumPy functions with scalar arguments! Use math module for calculations on one number. Use NumPy for calculations on all numbers in an array. python numpynoscalar.py :::Results of test - math vs. numpy for scalar arguments::: Sin of integers 0 to 100000 Runtime using math.sin: 0.0568699836731 Runtime using numpy.sin: 0.659817934036 NOTE: These efficiency considerations are significant only when the mathematical functions are called a large number of times! Case study on Numerical Efficiency Pure Python loops Reduce/map loop We express here the matrix-vector product as a sum of the scalar products between each row and the vector. Uses for loop just in one dimension. Map/reduce Dot product We relay here on NumPy matrix multiplication operator to perform the scalar product. python casestudy.py plain loops: 7.81488e+00, reduce/map loop: 7.77240e+00, map/reduce: 3.36382e+00, dot: 1.10629e-02 >>> foo = [2, 18, 9, 22, 17, 24, 8, 12] Running matrix-vector product with 1000x1000 matrix >>> print filter(lambda x: x % 3 == 0, foo) [18, 9, 24, 12] >>> print map(lambda x: x * 2 + 10, foo) [14, 46, 28, 54, 44, 58, 26, 34] >>> print reduce(lambda x, y: x + y, foo) 112 Pickling Makes a byte stream representation of any python object that can be stored, transmitted and unpicked. Byte stream can be used to send objects across a network (MPI) instead of storing them in a file. In Python, you can define functions inside other functions, and classes inside other classes. But because of the way the "pickle”protocol works, you can only pickle/unpickle (and in the context of mpi4py, send/receive) functions and classes defined at the top level of a *.py file, but not any function or class defined inside an outer one. Pickle vs Cpickle import pickle,time,cPickle pick_size=2000000 print 'creating pickles %d items big' %(pick_size) r=range(pick_size) t=time.time() for n in xrange(10): d=pickle.dumps(r) The pickle module is rather slow for large data structures. print '%d sec pickle' %(time.time()-t) t=time.time() for n in xrange(10): d=cPickle.dumps(r) print '%d sec cPickle' %(time.time()-t) t=time.time() for n in xrange(10): d=cPickle.dumps(r,protocol=pickle.HIGHEST_PROTOCOL) print '%d sec cPickle latest protocol' %(time.time()-t) python picklevscpickle.py creating pickles 2000000 items big 58 sec pickle 5 sec cPickle 1 sec cPickle latest protocol Other useful scientific Python libraries Cython Translates Pythonlooking code into C-code that compiles to reasonably fast Ccode. Works across all platforms (Win/Mac/Lin) Cython is an extension-module writing language that looks a lot like Python except for optional type declarations for variables. Type declarations allow Cython compiler to replace generic, highly dynamic Python code with specific and very fast compiled code that is then able to be loaded into the Python run-time dynamically. Even NumPy arrays can be declared with Cython Cython “step by step” # Load the proper environment ams@login1:~> module load python/2.7.3-vanilla Python version 2.7.3-vanilla loaded # Extract python my_function to a new file .pyx vi my_function.pyx # "cimport" is used to import special compile-time information about the numpy module # we have to tell cython where to find Numpy libraries # on the python side it needs to know where Numpy is cimport numpy as np import numpy as np # annotate cython types cdef unsigned int i,j … # In Python file (mandelbrot_final.py) reference your function which you are going to compile in C import my_function Cython “step by step”…Cython Types # Help Cython by adding annotations (they tell what the objects are) list q z int unsigned int complex complex double # no negative indices with for loop # In my_function.pyx add those types cdef unsigned int i cdef int iteration cdef complex zi, qi cdef list output Cython setup.py # create setup.py file in order to produce a compiled-module vi setup.py from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext import numpy ext = Extension(”my_function", [”my_function.pyx"], include_dirs = [numpy.get_include()]) setup(ext_modules=[ext], cmdclass = {'build_ext': build_ext}) # to compile the extension module in this in this directory # you will get my_function.so python setup.py build_ext –-inplace Matplotlib Python 2D plotting library which produces publication quality figures Line plots Scatter plots Polynomial Fitting (np.polyfit) Random Sampling 3D surface plots Histograms Bar charts Scatterplots … Matplotlib on Beagle It is not recommended to execute Matolotlib commands on compute nodes. Instead a good approach would be to pickle your result data and make a separate script which with Matplotlib commands and execute that script on login nodes. Generating and plotting a Mandelbrot Set using MPI4PY and Matplotlib. ams@login1:/lustre/beagle/ams/mandel> qsub mandel.pbs ams@login1:/lustre/beagle/ams/mandel> vi mandel_final.py …. import pickle outfile = open('/lustre/beagle/ams/mandel/mandelbrot_data.dat', 'wb') pickle.dump(C, outfile ) … ams@login1:/lustre/beagle/ams/mandel> python mandelplot.py SciPy Collection of algorithms and mathematical tools. Built on the top of NumPy gives you structure for handling your arrays of data. Special functions (scipy.special) Integration (scipy.integrate) Optimization (scipy.optimize) Interpolation (scipy.interpolate) Fourier Transorms (scipy.fftpack) Signal Processing (scipy.signal) Linear Algebra (scipy.linalg) Sparse Eigenvalue Problems with ARPACK Statistics (scipy.stats) Weave (scipy.weave)….. Special Functions >>> from scipy import special >>> special? Airy Functions Elliptic Functions and Integrals Bessel Functions Zeros of Bessel Functions Integrals of Bessel Functions Derivatives of Bessel Functions Spherical Bessel Functions Ricatti-Bessel Functions Struve Functions Raw Statistical Functions (Friendly versions in scipy.stats) Gamma and Related Functions Error Function and Fresnel Integrals Legendre Functions Orthogonal polynomials --- 15 types HyperGeometric Functions All of your favorite special functions for any occasion! Building N-dimensional arrays >>> import numpy as np >>> import scipy as sp NumPy methods SciPy methods mgrid[] arange zeros ones linespace • Get equally spaced points in N output arrays for an N-dimensional (mesh) grid. ogrid[] • Construct an "open" grid of points (not filled in but correctly shaped for math operations to be broadcast correctly). >>> sp.mgrid[0:5] array([0, 1, 2, 3, 4]) >>> sp.mgrid[0:1:5j] array([ 0., 0.25, 0.5, 0.75, 1.]) >>> sp.mgrid[0:3,0:3] array([[[0, 0, 0], [1, 1, 1], [2, 2, 2]], [[0, 1, 2], [0, 1, 2], [0, 1, 2]]]) >>> np.mgrid[0:3,0:3].shape (2, 3, 3) >>> sp.ogrid[:3,:3] [array([[0], [1], [2]]), array([[0, 1, 2]])] Weave: Inline C Code wtest.py #! /usr/bin/env python import numpy as np import scipy from scipy import weave x = np.arange(1000)/1000. y = np.empty(1000,dtype=np.double) n=len(x) code = """ int i; for (i=0;i<n;i++) { if (*x<0.5) *y = exp(-(*x)*2); else *y = exp(-(*x)); x++; y++; } """ weave.inline(code,['n','x','y']) print x, y You can run C code In Python through the weave module! It understands NumPy arrays. weave.inline() : Includes C/C++ code directly in Python code on_the_fly execution Inline C Code #!/bin/bash #PBS -N myjob #PBS -l walltime=00:30:00 #PBS -l mppwidth=24 #PBS -j oe . /opt/modules/3.2.6.7/init/bash module swap PrgEnv-cray PrgEnv-gnu module load python/2.7.3 From /lustre/beagle/’whoami’ do qsub mypbs.pbs export HOME=/lustre/beagle/$LOGNAME export PATH=/soft/python/2.7/2.7.3/python/bin/:$PATH export LD_LIBRARY_PATH=/soft/python/2.7/2.7.3/python/lib/:$LD_LIBRARY_PATH export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/numpy/1.7.0/lib/python2.7/site-packages export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/scipy/0.12.0/lib/python2.7/site-packages # Location of the libraries for python 2.7 export LDFLAGS="-L/soft/python/2.7/2.7.3/lib” ## This is necessary when using weave because otherwise python will attempt to work on /home export PYTHONCOMPILED="/lustre/beagle/ams/mypythonlibs" cd $PBS_O_WORKDIR aprun -n 1 -d 24 python wtest.py Weave Speed Comparison First set the proper environment. Remember on Beagle Cary Env. is default! vi setup.sh export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/numpy/1.7.0/lib/python2.7/site-packages export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/scipy/0.12.0/lib/python2.7/site-packages module swap PrgEnv-cray PrgEnv-gnu module swap gcc/4.8.1 gcc/4.7.2 module load python/2.7.3 ams@login1:~> . setup.sh Python version 2.7.3 loaded Weave Speed Comparison vi weavetimeit.py #! /usr/bin/env python import timeit import numpy as np import scipy from scipy import weave def foo(): x = np.arange(100)/100. y = np.empty(100,dtype=np.double) n=len(x) code = """ int i; for (i=0;i<n;i++) { if (*x<0.5) *y= exp(-(*x)*2); else *y= exp(-(*x)); x++; y++; } """ >>> import timeit >>> setup=""” ... import numpy as np ... ... def my_function(x,y): ... lx=len(x) ... for k in xrange(lx): ... if (x[k]<0.5): ... y[k]=np.exp(-x[k]*2) ... else: ... y[k]=np.exp(-x[k]) ... ""” >>> t=timeit.timeit('my_function(np.arange(100)/100.,np.empty(100,dty pe=np.double))',setup=setup) >>> print 'Pure Python loops', t Pure Python loops 531.699547052 WOW weave.inline(code,['n','x','y']) if __name__=='__main__': from timeit import Timer t = Timer("foo()", "from __main__ import foo") print t.timeit() python weavetimeit.py 18.4321630001 Python modules for parallel computations Parallel Programming You can parallelize your code splitting Function Data set Your parallel code is going to run as fast as the slowest process Communication is the slowest part of parallel program Communication between processes on different nodes is the most expensive Distributed Memory Model High Speed Network Processor/Memory Processor/Memory Processor/Memory Python & Global Interpreter Lock (GIL) Python does contain threads by itself but does not use multiple processors due to the GIL! GIL ensures that only one thread runs at the time. One thread is always main: the one that started the program. GIL is realized on I/O. When thread is running is holds on GIL. When we run two threads on different cores they compete for the GIL, so running on two cores is almost twice as slow than on one core due to huge amount of signal handling. GIL is not enforced if you use C extension modules in Python because you are side stepping interpreter. Python & Multithreading When to use multithreading? Threads are only good for I/O bound tasks You have a thread which needs to talk with server While waiting for server to respond the other thread can do stuff qsub threadedfinal.pbs ('Lapse for thread with threads(', 2, ': ', 9.686963081359863) ('Lapse with multithreading (', 10, ': ', 9.939857006072998) qsub sequentialfinal.pbs ('Lapse for sequential : ', 8.240514993667603) Performance is worse than in sequential case! Python & Multiprocessing When to use multiprocessing? For non numerical calculations. Two processes can run at the exact same time, they don’t need to interrupt each other. No interprocess communication. Therad safe! Memory safe! We know how to divide up equally your job among the processes, and how to coordinate them and bring them all back together again. Each process do its job and puts in into a queue. It can spawn multiple processes but they will still be bound within a single node. Downside: memory overhead qsub multiprocfinal.pbs ('Lapse for multiprocessing (', 10, ': ', 1.2027549743652344) ('Lapse for multiprocessing (', 2, ': ', 4.594444990158081) MPI: Message Passing Interface When to use MPI? For numerical calculations. For communication among processes, which have separate address spaces Once you want to scale to a cluster or supercomputer Most useful on Distributed Memory Machines Every process has a copy of the same code: aprun -n 5 python hello.py #starts all 5 instances of that code You don't move executables around you are just passing messages-pure data export PYTHONPATH=$PYTHONPATH:/soft/python/2.7/2.7.3/modules/mpi4py/1.3/lib/python2.7/site-packagesc MPI4PY quick review Allows a user to develop, wrap or interface with MPI interface from Python. Uses pickling to serialize objects (any Python object) and pass them between different processors or broadcasting arrays directly. from mpi4py import MPI #i mport MPI module comm = MPI.COMM_WORLD # to initialize MPI, refers to all processes (numbers them from 0 to n-1) A communicator is a logical unit that defines which processes are allowed to send and receive messages size=comm.Get_size() # Returns the number of processes in the communicator. rank = comm.Get_rank() # Gives a number of a process. Each process will have different rank MPI4PY quick review Optimized to pass any Python object but you have to specify data type. MPI4Py supports two methods of communication: Python’s pickling method only for general Python objects (list, tuples…) because it uses auto detection of data type. Your data is serialized in buffer, or you get a bucket of bytes with a size. randNum = numpy.zeros(1) e.g. Bcast(...) e.g. bcast(...) MPI4Py natively understands NumPy arrays and performance will be faster if you use them, but you have to initialize arrays! SPMD Single Program Multiple Data There is only one file of code That the code is written to assign certain lines to certain processors Most commonly used MPI standard talks about moving around buffers; raw blocks of data. We use NumPy module to get access to those arrays, we move around numpy arrays! It is usually bad practice to perform I/O (e.g., call print) from any process besides the root process, though it can oftentimes be a useful tool for debugging!!! MPI Communication: Point-Point On a Recv you do not always need to specify the source. You can allow the calling process to accept a message from any process that happen to be sending to the receiving process. from mpi4py.MPI import ANY_SOURCE or source=MPI.ANY_SOURCE Send/Recv: are blocking functions, means wait until receive the message Involves sender & receiver Point-Point Only two processes participate Isend/Irecv: are non blocking functions, will do some work until message is received MPI Communication: Collective All processes in a communicator participate Collective communication Functions: barrier, reduce, gather, scatter… Principles to remember: Load Balancing: Example on 20 nodes • A program is inefficient if it is not using all of the available resources (e.g., processes are idling because they are waiting on another) Communication is Expensive: • Broadcast and Reduce are designed to optimize communication among the whole communicator. However, any sort of message passing is extremely expensive and one of the main obstacles to obtaining speedups when parallelizing an algorithm. qsub loadmpi.pbs qsub non_loadbalanced_mpi.pbs 0.467461109161 0.46592092514 0.468319892883 0.491802930832 0.471796989441 …. 0.539043903351 Efficiency comparison ams@login1:/lustre/beagle/ams/threading/matprint> export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/numpy/1.7.0/lib/python2.7/sitepackages ams@login1:/lustre/beagle/ams/threading/matprint> export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/scipy/0.12.0/lib/python2.7/sitepackages ams@login1:/lustre/beagle/ams/threading/matprint> export PYTHONPATH=$PYTHONPATH:/soft/python/2.7/2.7.3/modules/matplotlib/1.2.1/lib/python2.7/sitepackages ams@login1:/lustre/beagle/ams/threading/matprint> export PYTHONPATH=$PYTHONPATH:/soft/python/2.7/2.7.3/modules/mpi4py/1.3/lib/python2.7/sitepackagesc ams@login1:/lustre/beagle/ams/threading/matprint> module load python/2.7.3 Switching to GNU compiler environment Python version 2.7.3 loaded ams@login1:/lustre/beagle/ams/threading/matprint> python matplotlib_efficiancy.py Efficiency comparison: calculate prime numbers Efficiency comparison: Mandelbrot Set qsub mandel_mpi_numpy.pbs # using 10 nodes: all tasks, mean: 0.27 seconds all tasks, min: 0.23 seconds all tasks, max: 0.36 seconds all tasks, sum: 2.72 seconds qsub mandel_mpi_numpy_cython.pbs # using 10 nodes: all tasks, mean: 0.08 seconds all tasks, min: 0.04 seconds all tasks, max: 0.23 seconds all tasks, sum: 0.84 seconds Conclusion • When it comes to doing a lot of heavy number crunching, Pure Python is not really an option. • NumPy can get you most of the way to compiled speeds through vectorization (use of NumPy functions instead of for loops). • In situations where you still need the last ounce of speed in a critical section and you already know C/C++, then Weave is a simple and speedy solution. • If, however, you are not already familiar with C then you may find Cython to be exactly what you are looking for to get the speed you need out of Python. • Multiprocessing • MPI4Py Further reading: Matplotlib example gallery: http://matplotlib.org/gallery.html Great tutorial on mpi4py: http://www.bu.edu/pasi/files/2011/01/Lisandro-Dalcin-mpi4py.pdf NumPy Example List with Doc: http://www.biochem-caflisch.uzh.ch/zhou/Numpy_Example_List_With_Doc.html NumPy / SciPy documentation http://docs.scipy.org/doc/ Questions Please write to: Ana Marija Sokovic (creator of this presentation) ams@ci.uchicago.edu or Beagle Support beagle-support@ci.uchicago.edu All scripts mentioned in this presentation can be found at: http://wiki.ci.uchicago.edu/Beagle/PYTHON