SERPent Cookbook Version 25/09/12 Luke W. Peck∗ & Danielle M. Fenech† October 8, 2012 ∗ † email: lwp@star.ucl.ac.uk email: dmf@star.ucl.ac.uk 1 Contents 1 Introduction to SERPent 1.1 SumThreshold Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 SERPent’s Implementation of the SumThreshold Method . . . . . . . . . . . . . . . 1.3 Outputs from SERPent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 5 2 Running SERPent 2.1 File Inputs . . . . . . . . . 2.2 e-MERLIN Reduction Tools 2.3 Baseline Choice . . . . . . . 2.4 Flagging Options . . . . . . . . . . 6 6 7 8 8 . . . . . . 9 9 10 10 11 12 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 SERPent Requirements, Performance, and 3.1 Requirements . . . . . . . . . . . . . . . . . 3.2 Performance . . . . . . . . . . . . . . . . . . 3.2.1 RFI Flagging Performance . . . . . . 3.2.2 Lovell Stationary Scan Performance 3.2.3 Speed Performance . . . . . . . . . . 3.3 Future Improvements and Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Other Notes on SERPent 15 4.1 SERPent Updates and Bug Fixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3 FUQ (Frequently Unasked Questions) . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 *** Please note that SERPent is by no means a finished product and is still in it’s beta testing stage at this time. *** 1 Introduction to SERPent Due to the nature of modern interferometers becoming ever more sensitive and powerful, and thus data sets becoming ever bigger, the need for automation of certain procedures in reduction and calibration of interferometric data is vital. The main ‘bottleneck’ in this calibration procedure is the removal of Radio Frequency Interference (RFI). The flagging of bad visibilities was traditionally a manual task, whereby the astronomer would cycle through visibilities and highlight bad data points. This was sufficient enough as the datasets were of the order of Megabytes (MB) and took a ‘reasonable’ amount of the astronomers time, albeit being a tedious process. However, with improvements in receivers, electronics, correlators and optical fibre networks and various aspects of the interferometric setup, datasets are now on the order of Gigabytes (GB) and observations now span a wide frequency range into bands which are not protected for radio astronomy and thus increase the incidence of RFI. The Scripted E-merlin Rfi-mitigation PypelinE for iNTerferometry (SERPent) was created to tackle this problem for the RFI environment affecting e-MERLIN using Parseltongue; a python based language which is able to call upon AIPS tasks. 1.1 SumThreshold Method The most effective thresholding method was demonstrated by Offringa et al. 2010b [3] to be the SumThreshold and this is the adopted RFI detection method. An overview of the method will be given here, for a more in depth analysis of the method please see the afore-mentioned literature. Threshold methods work on the basis that RFI increase visibility amplitudes for the times and frequencies they are present. Therefore there will be a considerable difference compared to other RFI-free visibility amplitudes, thus these RFI will be statistical outliers. If these RFI are above a certain threshold condition then they are detected and flagged. The threshold level is dictated by the statistics of the sample population, which can be the entire observation (all time scans, frequency channels, baselines etc) or a smaller portion, for example: separate baselines and IFs. This has the advantage of not only an increase in computational performance (Python is faster when operating on many smaller chunks of data rather than one big chunk; i.e. Dynamic Programming), but RFI may be independent of baseline and the distribution between different IFs may differ; thus splitting the data makes the statistics more reliable. This is particularly relevant for L band observations where the RFI is more problematic. The SumThreshold method works on data which is separated by baselines and stokes and arranged in a 2D array with the individual time scans and frequency channels comprising the array axes i.e. time-frequency space. The frequency channels were further split by IFs due to the arguments previously stated. The idea is that peak RFI and broadband RFI will be easily detectable when the visibility amplitudes are arranged in time-frequency space. The e-MERLIN correlator outputs three numbers associated with any single visibility: the real part, the complex part and the weight of the visibility. When appending visibilities in the time-frequency space, if the weight is greater than 0.0 i.e. data exists for that time and frequency, then the magnitude of the real and complex part of the visibility is taken to constitute the amplitude. If the weight is 0.0 or less i.e. no data exists for this baseline, time scan etc, then the amplitude is set to 0.0. This has no effect on the statistics or threshold value, but acts as a substitute for that elemental position within the array. The Python module NumPy was employed to create and manipulate the 3 2D arrays, as the module is written in Fortran (which is intrinsically faster than Python) and has been optimized1 . There are two concepts associated with the SumThreshold method: The threshold and the subset size i.e. a small slice of the total elements (in this case visibitility amplitudes) in a certain direction of the array (time or frequency). The difference between the SumThreshold method (a type of combinatorial thresholding) and normal thresholding is that after each individual element in the array has been tested against the first threshold level χ1 , the values of a group of elements can be averaged and tested against a smaller threshold level χi , where i is the subset number i.e. the number of elements averaged and tested. Empirically a small subset i = [1, 2, 4, 8, 16, 32, 64] works well (Offringa et al. 2010b) [3]. A window of size i cycles through the array in one direction (e.g. time) for every possible permutation for the given array and subset size. After each subset cycle a binary array of identical size records the positions of any elements which are flagged. 0.0 denotes a normal visibility and 1.0 signifies a RFI in the time direction (2.0 for frequency direction and higher values for any subsquent runs of the flagger). At the beginning of the next subset cycle any element within the flag array whose value is greater than 0.0, the corresponding amplitude in the visibility array is reduced to the threshold level χi which progressively gets smaller with increasing subset size. If a group of elements of any subset size i is found to be greater than the threshold level χi , then all elements within that window are flagged. This method is implemented in both array directions (i.e. time and frequency). 1.2 SERPent’s Implementation of the SumThreshold Method In addition to the SumThreshold methodology, certain clauses have been added to prevent the algorithm to overflag the dataset. If any threshold level reaches the mean + varianceestimate the flagging run for that direction stops. The flagging process can run multiple times at the cost of computational time, and written in the code as default is a second run, if the maximum value within the array is a certain factor of the median and if there are flags from the previous run. On this second run all flagged visibilities from the first run are set to 0.0 in the visibility array so they don’t skew the statistics and this run can then look for weaker RFIs which may remain. This may be necessary as some RFI in the e-MERLIN commissioning data were found to be over 10, 000 times stronger than the astronomical signal and some weaker RFIs were still present. Note that the first run subsets increase in size in binary steps up to 32, and the second run goes deeper to 256. This can easily be manually changed to lower values to save time if there isn’t much RFI in the observations. The first threshold level can be calculated by a range of methods and statistics. The variance of a sample is an important component for this threshold and various methods are described and tested by Fridman (2008) [1]. The author concluded that Exponential Weighting is the best method from the point of view of LOSS: a measure of the difference in standard deviation of a robustly estimated variance and a simple estimate, in the absence of outliers. But the Median Absolute Deviation (MAD) and Median of Pairwise Averaged Squares are the most effective ways to remove outliers, although they comment that both are not as efficient i.e. needs more samples to produce the same power, as other methods. Since the sample size in any given observation from e-MERLIN will be of adequate size, this is not such an issue. The breakdown point for MAD is also very high (0.5) i.e. almost half the data may be contaminated by outliers (Fridman 2008) [1]. MAD is adopted by this algorithm due to these robust properties. Again the author stresses that the type 1 It should be noted here that how this module is compiled and called upon on can have a significant effect on performance. 4 and intensity of RFI, type of observation and the method of implementation are important factors when deciding what estimate to use for any given interferometer. The variance MAD used in the SERPent algorithm is defined by equation 1, where mediani (xi ) is the median of the original population. Each sample of the population is then modified by the absolute value of the median subtracted from each sample. The median of this new absolute median subtracted population is taken and multipled by a constant 1.4286 to make this estimation consistent with that of an expected Guassian distribution. M AD = 1.4826 medianj {|xj − mediani (xi ) |} (1) The first threshold level χ1 is thus determined by an estimate of the mean x̄, the variance σ and an aggressiveness parameter β (equation 2) (Niamsuwan, Johnson & Ellingson 2005) [2]. Since the median is less sensitive to outliers, it is preferred to the traditional mean in this equation (thus x̄ = median) and the MAD to the traditional standard deviation for the variance for similar reasons (σ = M AD). If the data is Guassian in nature then the MAD value will be similar to the standard deviation (and the median to the mean). A range of values for β were tested until a stable value were found for multiple observations and frequencies of around β = 25. Increasing the value of β reduces the aggressiveness of the threshold and decreasing the value increases the aggressiveness. χ1 = x̄ + βσ (2) The subsequent threshold levels are determined by equation 3 where ρ = 1.5, this empirically works well for the SumThreshold method (Offringa et al. 2010b) [3] and defines how ‘coarse’ the difference in threshold levels, and i is the subset value. χi = χ1 ρlog2 i (3) In summary, SERPent firstly calculates the median and MAD for each IF, baseline, stokes and then removes any strong individual visibility which is over the first threshold. It then sets any flagged visibility amplitude’s to zero and recalculates the statistics. Then it starts the ‘first’ run of the SumThreshold method in both time and frequency directions. After this is completed it again sets any flagged visibility amplitude’s to zero and recalculates the statistics. Then the second SumThreshold run is performed to try and remove some weak RFI 1.3 Outputs from SERPent SERPent reads visibilities directly from AIPS, finds the RFI and records them in a numpy array. This numpy array is then pickled using the cPickle module in Python, and named according to its source name, baseline and IF number. These Pickle files will be located wherever the user designated as the output folder (See section 2.1). A corresponding dictionary input is created containing this same pickle name as the key, and all the information i.e. source number, baseline, time information for time axis, which will be used when creating the FG rows. A dummy text file is then created as the Pickle files are read and the FG rows are created. Once all the Pickle files have been read and the FG rows appended to the dummy file, the actual FG text file is created and the FG extension table header is written. Then all the rows from the dummy are appended below the header in the real FG text file, the dummy file is then deleted. Then SERPent runs the AIPS verb REFLG, which condenses FG rows in an efficient manner and outputs another FG text file with the condensed rows. 5 The outputs that will be found in the user’s directory are: the Pickle files, the original FG text file by the name <name-of-source>.fg, and the REFLG FG text file by the name <name-ofsource> r.fg. SERPent will also automatically read in these flag files and thus the user will have 2 FG tables for their source in AIPS. 2 Running SERPent There are only two files associated with SERPent: SERPent.py - The main script file where all the business is conducted, and SERPent input.py - The input file where the user will have to input details concerning AIPS data information and what they want SERPent to do. To run the script make sure both files are in the same directory as each other and simply type in the terminal: parseltongue SERPent.py and some printouts will appear on your terminal. The next few sections will describe the SERPent input file and what the user needs to enter. 2.1 File Inputs There are six variables in total which require details about your observation, whether you require and log file containing timed SERPent runs, the number of CPUs available on your computer and your designated directory path. Below is an example of what you’ll see. # Data Information: AIPS_user_number = 101 Name = '1436+6336' Klass = 'SPLIT ' Disk = 1 Seq = 1 # # # # # The The The The The AIPS user number the data is on. uvdata name of the catalogue. uvdata klass of the catalogue. uvdata disk number of the catalogue. uvdata sequence number of the catalogue. # Log Information: write2log = 'no' # 'yes' only if using the multi.py script file to write timing # details to a log file (only used for performance testing) # Parallelization Information: NCPU = 4 # # # # # Define here the number of CPUs you want to use. Parallelization is currently implemented on a baseline basis and thus the maximum number of CPUs ultilized will be the number of baselines in the data. i.e. for e-MERLIN a maximum of 21 CPUs will be used by SERPent (21 baselines). # Directory Information: path2folder = '/home/lwp/aips_test/' 6 The first five variables are to do with AIPS. AIPS user number is the AIPS user number of the catalogue where your observation is loaded to and should be an integer. Name is the name of your observation (probably the name of the source for single source files) and is a string. Klass is the class name of your observation and is a string. Disk is the disk number your catalogue is located on and is an integer. Seq is the sequence number of the observation you want to run SERPent on and is an integer. write2log is a variable for keeping a log of multiple runs for performance tests with SERPent using different number of CPUs, sources etc. For the majority of users, this can be ignored. NCPU defines the number of CPUs you want to run SERPent on. For information concerning how the number of CPUs affect the performance of SERPent, please consult section 3.2.3. The final variable path2folder defines the directory where you want all the SERPent outputs such as the Pickle files and FG text files to go and is a string. SERPent can flag multi-source files (e.g. Klass = ‘MULTTB’) in it’s current condition, but with the following constraints. That all the sources have been SPLIT from the multi-source file and are on the same disk number and the same sequence number as the multi-source file. All the Names of the following SPLIT files are that of the source’s names in the SU table in the multisource file, and that you input the details of the multi-source file in the variables in the input file. It is recommended however that SERPent is ran on single source files, as the files will be big for e-MERLIN / new generation interferometers and will take some time to run. 2.2 e-MERLIN Reduction Tools The next couple of variables are related to other reduction passages written into SERPent as a consequence of issues arising in our COBRaS e-MERLIN commissioning data. Currently there are two extra reduction tools available to SERPent in addition to the SumThresholding RFI flagging sequence, one for removing the Lovell Stationary Scans and another for removing In-scan zero level amplitudes. # Phase Cal Information phasecal = '2007_0718' # # # # # # # If one of the sources (multi files) or source is the phase cal, please write the name of the source as this variable. Else put 'no'. This information is used for the Lovell Stationary Scans passage of SERPent if the source is the phasecal designated here, the telescope is e-MERLIN and the baseline contains the Lovell telescope. # Execute the Zero Level code to flag telescope dropouts and other system failures # where the visibilities drop to around 0 Jy in the same scans where good data is # present. zero_level = 'yes' # To execute this passage set this variable to 'yes' # else set it to 'no'. phasecal defines the phase calibrator for your observation. If this is a single source file and the source is the phase calibrator then this should be the same as the Name input above. If you are running a multi-source file then this should be the name of the catalogue which corresponds to the 7 phase calibrator. The Lovell Stationary Scan will only run if the array is e-MERLIN, on baselines which contain the Lovell telescope and on data which is the phase calibrator. The zero level variable refers to a passage which will search for zero or low level amplitudes within scans which contain good data. This will act upon all of the data. Select either ‘yes’ or ‘no’ for this option. 2.3 Baseline Choice There are two variables in the input file which dictate what baselines are to be flagged by SERPent. Below is what you’ll confronted with in the SERPent input file. # Baseline Information for Flagging which_baselines = 'choose' # Variable to define whether to flag all # baselines or a select few. # Options are: 'all' or 'choose'. baselines = ['5-7', '7-8'] # chosen baselines for flagging in the correct format. # the order of the baselines does not matter. The first variable is called which baselines, and has two recognisable inputs: ‘choose’ or ‘all’, both which are strings. The ‘all’ choice will select all baselines in the observation to be processed for flagging and will ignore the second variable concerning baselines (below). The ‘choose’ option will allow the user to select specific baselines for flagging. The user will have to already know which baselines they want flagging and the associated antenna numbers. If the user has selected for which baselines = ‘choose’, then the second baseline variable will be read by the script and will need the necessary information. The baselines variable is a python list which requires strings in the strict format: antenna number − antenna number, i.e. to include the baseline with antennae 5 and 7, the item in the list would take the form: ‘5-7’. 2.4 Flagging Options The final part of the input file concerns the flagging options. This is to enable the user complete control over the flagging options available, but without having to dig around in the code to modify them. There is a ‘default’ setting which flags well so its not essential to change anything, but if you want to tinker with the parameters then see below for a description of the variables you’ll see in the input file. # Flagger Options: flagging_options = 'choose' # # # # # aggressiveness_first_run = 25 variable to define whether to use the flagging options below or the default options in the SERPent.py file default ignores whatever variables are set in this file Options are: 'choose' or 'default' # How aggressive the first run is # Note: a lower number is more aggressive 8 max_subset_first_run = 32 # Maximum subset for the first run of SERPent # Note: should be a binary number: 1,2,4,8,16... aggressiveness_second_run = 25 # How aggressive the second run is max_subset_second_run = 256 # Maximum subset for the second run of SERPent rho = 1.5 # Difference in coarseness between each # threshold level The first variable flagging options decides whether SERPent should use it’s inbuilt parameter values or whether it should read the user defined ones. For the record the values displayed above are the current SERPent ‘default’ settings. This variable has two options which are: ‘choose’ or ‘default’, both which are strings. ‘default’ will ignore the user input parameter values and use the SERPent default values. ‘choose’ will then enable SERPent to read the rest of the variables in this section. The variable aggressiveness first run defines the aggressiveness of the first run of the SumThreshold method. A higher value actually makes the run less aggressive, and a lower value more aggressive. This value should be an integer, although a floating point is also acceptable. The max subset first run variable defines the maximum size a subset will take (see section 1.1 for more information). This needs to be a binary number (1, 2, 4, 8, 16, 32, 64...) and thus an integer as the flagging will start with a subset size of 1 and then double until the maximum subset size is met. As the thresholds are constantly lowered by equation 3, the larger subsets test more for constant weaker RFI in both time and frequency. The next two variables aggressiveness second run and max subset second run are the same as the first run parameters, except they correspond to the second run of the SumThreshold method. Here the default setting is the same for the aggressiveness as for the first run but different for the second run’s maximum subset. Here we allowed for the last run to search for weak RFI and thus increased the maximum subset to size 256, although SERPent may kick out before this maximum is reached if the thresholds get too close to the median of the sample, to avoid flagging good data. 3 SERPent Requirements, Performance, and Future Improvements 3.1 Requirements SERPent has been run on a number of systems and seems to be fairly stable. Here are the versions which we are running the code on, and should probably be considered the ‘minimum’ requirements for the code to work. AIPS release 31DEC11 Python 2.6.5 Parseltongue 2.0 (with Obit 1.1.0) Numpy 1.6.1 9 3.2 Performance Here we document the performance of the early test runs of SERPent on old MERLIN data, early e-MERLIN commissioning data and RFI test data supplied by Rob Beswick (Jodrell Bank). Table 1 shows details on the datasets tested here. All tests have used SERPent version 31/07/12. Table 1: SERPent Performance Test Datasets Telescope MERLIN2 e-MERLIN e-MERLIN 3.2.1 Dataset Name M82V RFI Test Data: 1436+6336 COBRaS W1 2011: 0555+398 Size 212 MB 1.63 GB Band L L Visibilities 82692 5812 Sources 6 1 Baselines 21 10 IFs 1 12 Channels 31 512 Stokes 2 4 2.33 GB C 99149 1 10 4 128 4 RFI Flagging Performance SERPent has been tested on both L and C band observations and has been found to flag (almost all, need more recent runs of SERPent, as it has been significantly updated since these runs) all C band RFI and the majority of L band RFI. The remaining L band is usually weak broadband RFI or very weak RFI close to the median value of the sample. Firstly we present some results from L band data. Figure 1 shows some RFI test data of 0555+398 (data courtesy of Rob Beswick) with one baseline displayed via AIPS task SPFLG in time-frequency space. The first IF is completely wiped out with noisy data, and some weak broadband RFI remains in the central IFs. Almost everything else has been flagged, including some very intricate RFI which can not be done as accurately with more simplistic RFI flagging routines. Figure 1: AIPS SPFLG image of 0555+398, L band, baseline 7 − 8, stokes: RR, IF: 1 − 12 after SERPent flagging. The AIPS task REFLG was also deployed in this image. The vertical axis is time and horizontal axis is frequency. The L band results have shown that SERPent can flag complicated RFI in time-frequency space, and figure 2 shows this also applies to the C band with the infamous ‘wiggly’ RFI found in commissioning data. Note that this was very poor quality data and SERPent even started to flag some of the noise. However this is a good example of the thresholding method in action. 10 Figure 2: AIPS SPFLG image of 0555+398, C band, baseline 5 − 7, stokes: RR, IF: 2 before (left) and after SERPent flagging (right). The AIPS task REFLG was also deployed in this image. The vertical axis is time and horizontal axis is frequency. To add to the archive of results we present some Commissioning COBRaS flagging results for both the L and C band in figures 3 and 4. 3.2.2 Lovell Stationary Scan Performance A problem unique to the e-MERLIN array is the Lovell stationary scan. Due to the size of the Lovell telecope and the subsequent slew time, the Lovell telescope only participates in every alternative phase-cal scan, remaining stationary on the target for the other scans. The other antennas in the array are not affected. This results in the visibilities from baselines containing the Lovell telescope to have two different amplitude levels for the phase-cal. In most cases the phase-cal will be brighter than the target, thus when the Lovell is observing the phase-cal the received flux will be greater than when the Lovell does not participate in the phase-cal scan and remains on the target source. This behaviour can be seen using the IBLED task within AIPS on the phase-cal source as figure 5 clearly shows. This figure also displays another problem with early e-MERLIN commissioning data with multiple amplitude levels for scans throughout the observation. This property has been traced to hardware issues within the receivers and new filters appear to have resolved the issue for future observations. However, it was necessary to normalize this problem before flagging this dataset. In the main window each group of points represents one scan, for which there are three distinct amplitude levels. The highest two levels are scans where the Lovell telescope contributes to the observation (including the aforementioned filter issues affecting amplitude levels) and the lowest level scans are where the Lovell does not contribute. Across the entire observation (top panel) the Lovell stationary scans are consistent in magnitude and alternate between every other observation, 11 Figure 3: AIPS SPFLG image of a target pointing for COBRaS: 2033+4113, L band, baseline 6 − 8, all stokes combined, IFs: 12. The AIPS task REFLG was also deployed in this image. The vertical axis is time and horizontal axis is frequency. despite the varying amplitude levels of the Lovell on source scans, indicating that the Lovell dropout scans are indeed the cause of the lowest level scans in figure 5. If the array is e-MERLIN, SERPent will run an extra piece of code, which firstly determines the Lovell baselines. It makes a first run through all the integration times and isolates each scan, and evaluates the magnitude of each scan, the highest and lowest scan statistics and the integration time step. A second run again isolates each individual scan and tests the following condition: if the mean of the scan is between the lowest mean found in the previous run ±σ: then flag the entire scan. The results are written to a text file via the cPickle Python module and are combined with the main SumThreshold flagging results at a later time in the script. Figure 6 shows the IBLED task window on the same phase cal source as in figure 5 after the Lovell stationary scans have been removed by SERPent. 3.2.3 Speed Performance To increase computational performance we have parallelized SERPent by splitting the data into ‘jobs’ which are then evenly distributed across a number of CPUs. SERPent is parallelized in both baselines and IFs to maximize the even spread across CPUs and uses however many CPUs the user designates in the input file. Our initial tests on modest data sizes reveal an significant increase in performance which scales linearly with data size. The ultimate speed performance would come from a full multi-threading implementation of SERPent across a network. Whilst there are facilities available to us at UCL (4 Leviathan boards each with 16 CPUs and 100GB ram), I believe there a few available to others. It would also require a rewrite of the SERPent code which would take time, so for the forseeable future, this is not a likely option. We have analysed a number of datasets for computational performance on two different com12 Figure 4: AIPS SPFLG image of a target pointing for COBRaS: 2032+4113, C band, baseline 1 − 5, all stokes combined, IFs: 4. The AIPS task REFLG was also deployed in this image. The vertical axis is time and horizontal axis is frequency. puter systems. One is a common desktop computer, and the over is a more powerful system. Table 2 gives details on the computer systems we have tested SERPent performance on. Table 2: Computer Systems Computer Name Leviathan Desktop Memory (GB) 100 4 NCPUs 16 4 We present the computational performance on the ‘RFI Test Data’ (L band), due to it containing the most jobs (120) and being of modest enough size to perform multiple runs on a range of CPUs. Figure 7 demonstrates the time taken for each system to flag the same small dataset. As the number of CPUs increase, the gained performance decreases to a small amount, but there is still a significant gain even at high number of CPUs. Note that the full e-MERLIN datasizes may well increase to 100 times the size of this dataset, and thus the small gains here will magnify. Another way to demonstrate the importance of a high number of CPUs is given in figure 8. Here the relative performance ratio to a single CPU on the same system is shown to have two linear relations with a turn off point around 9 CPUs. This will be due to a factor-like effect from the number of jobs being processed over a certain number of CPUs. We can infer that using 16 CPUs on this dataset has increased the performanced by a factor of 7 compared to using only 1 CPU. Runs on other datasets gave similar results. It can also be seen that increasing the amount of memory also increased the computational 13 Figure 5: AIPS IBLED task window, displaying the phase-cal source: 2007+404, stokes LL (for greater clarification). The top panel shows all scans for the entire observation run, and the main central panel shows a small selection of scans for closer inspection, before running SERPent. performance, albeit by a smaller amount than the parallelization. Leviathan has 25x more Memory than the standard Desktop computer in our tests and is faster by a factor of 1.7, consistently when comparing between multiple number of CPUs for both computers and datasets. This shows that the limiting factor of running SERPent on interferometric datasets is the shear volume of data that needs processing and not a RAM issue. 3.3 Future Improvements and Work We now have a parallelized version of SERPent but we are looking for ways to further increase the performance (not in the near future however). Segregating the data further into smaller timescans is one possible route of inquiry. Currently SERPent ultilizes the AIPS verb REFLG and the various options it possesses. We have briefly played around with these options and have found some inputs which do a reasonable job of condensing the number of FG rows. This is definitely subject to future testing and may be included in the input file as another choice for the user as well as a default option. Currently the ‘default’ cparm options (1-7) are: [0, 4, 0.70, 0.70, 0.70, 0.70, 2], but this may change with more tests, and we welcome the user to try different options and give us feedback. On a related note, work still needs to be done on the number of FG rows in the extension tables. Currently different AIPS tasks have different limits on how many FG rows can be read in. AIPS has an absolute upper limit of around 1 million rows. Preliminary outputs from COBRaS commissioning data has the number of FG rows at around 6 million after running SERPent!!! Recent (15/06/2012) email correspondents with Eric (NRAO) have resulted in him conceeding that the AIPS limit may have to be increased to 10 million. This however does not resolve this issue and it may be up to us to write the REFLG task in python and add it as part of SERPent as we can not even load the FG table into AIPS to use the REFLG 14 Figure 6: AIPS IBLED task window, displaying the phase-cal source: 2007+404, stokes LL (for greater clarification). The top panel shows all scans for the entire observation run, and the main central panel shows a small selection of scans for closer inspection after running SERPent. The lowest level scans present in figure 5 have been removed. task. Another alternative would be to chop up the FG text file and input it in pieces into AIPS. 4 Other Notes on SERPent 4.1 SERPent Updates and Bug Fixes Here I’ve put some of the updates / bug fixes to SERPent, so anyone using an older version runs into these problems they know that they have been fixed or improved in a newer version. 25/09/12. More Aggressive Flagging. Many updates. I’ve reduced the kickout safety clause with the SumThreshold flagging sequence so the algorithm flags closer to the median. There are two levels; one more aggressive level for L and S band and another less aggressive for C band and higher frequencies. 25/09/12. Lovell-Mk2 Baseline Removal. SERPent detects whether this baseline exists and automatically removes it from the list of jobs to process and flags it at the end of the SERPent script. If anyone has a reason to keep this baseline, let me know. 25/09/12. Lovell Stationary Scans & Zero Amplitude In-scan Removals. This version contains a passage which detects and automatically removes the Lovell Stationary Scans for Lovell baselines when the user designates the phasecal variable in the input file as the name of the phase calibrator. Some of the COBRaS commissioning data also contained some in-scan zero level amplitude visibilities, probably from collecting data whilst the antenna were still slewing or a system failure. A zero level passage has been added with the user option 15 e-MERLIN RFI Test Dataset (L Band) - 1.63GB Leviathan 100GB Desktop 4GB 20000 Time Taken (secs) 15000 10000 5000 00 2 4 6 8 10 NCPUs 12 14 16 18 Figure 7: showing the time taken to flag the RFI TEST dataset (1.63GB) using a common Desktop computer and Leviathan over a range of CPUs. Each point is an average of 3 runs using the same number of CPUs. in the input file to decide whether it is needed. For a blind, automated run it’s best to select ‘yes’ ! 11/09/12. Further Parallelization. I’ve parallelized SERPent further by splitting the jobs via both baseline and IF. This creates number of baselines x number of IFs jobs which are distributed amongst the CPUs selected by the user. As e-MERLIN moves from commissioning observations to full Bandwidth Legacy datasets, the number of IFs (and thus jobs) will increase and this new parallelization will squeeze out a little more performance. 27/07/12. Parallelization. We have now implemented a simple parallelization via baselines into SERPent. This sends different baselines to different CPUs thus running multiple flagging runs in parallel. Please consult section 3.2.3 for more information. 12/06/12. Bug Fix. Many thanks to Megan, who spotted a bug in how the antenna numbers were read into the FG rows. Previous versions only worked for antenna numbers ¡ 10, but she found a simple solution to rsplit the antenna numbers from a string into a list, and read them in from there. 4.2 FAQ Due to the flooding of my inbox with questions concerning SERPent (sarcasm), I’ve anticipated some FAQs which people in the future may have. Q. Does SERPent work on arrays other than e-MERLIN? 16 8 Speed Ratio Relative to 1 CPU 7 e-MERLIN L Band (120 Jobs) NCPU Performance Leviathan 100GB Desktop 4GB 6 5 4 3 2 1 00 2 4 6 8 10 NCPUs 12 14 16 18 Figure 8: showing the speed relations of running SERPent on multiple CPUs on the Desktop and Leviathan relative to a single CPU on the same systems. Even at high number of CPUs there are significant gains in performance which should increase further beyond 16 CPUs. A. Whilst all of the tests were on MERLIN and e-MERLIN data, there should be no reason why it doesn’t work on other arrays such as the EVLA and EVN etc. As long as the data is in fits format and can be read into AIPS. This script does work on single source files and multi source files although you do need to SPLIT the multi-source file before running SERPent. Q. Can I change any of the parameters in the flagger definition. A. Of course. You are free to change anything you like to suit your needs. Please consult section 2.4 to help you set the parameters. Q. Why SERPent? A. Because the script is written in parseltongue which is a python based language and my legacy project is called COBRaS. I was just keeping in theme with all these previous acronyms. Q. How did you come to think of SERPent? A. Like many of my epiphanies to solve problems I encountered when writing this pypeline, I came up with the idea in the shower (unfortunately this is actually true). Thought of the word ‘serpent’ and like any respectable astronomer spent the rest of my time in the shower striving to match semi relevant words to the initials. Q. Why were you thinking about this in the shower? A. What I do in the shower, or why, is strictly private. 17 4.3 FUQ (Frequently Unasked Questions) Q. How can I ever repay you for your heroic selfless actions in writing this script? A. We ask for nothing. However if you ever see us in a pub near you without a drink (actually, even if we already have a drink), etiquete would dictate that you should probably buy us one... References [1] P. A. Fridman. Statistically Stable Estimates of Variance in Radio-Astronomy Observations as Tools for Radio-Frequency Interference Mitigation. ApJ, 135:1810–1824, May 2008. [2] N. Niamsuwan, J. T. Johnson, and S. W. Ellingson. Examination of a simple pulse-blanking technique for radio frequency interference mitigation. Radio Science, 40:5, June 2005. [3] A. R. Offringa, A. G. de Bruyn, M. Biehl, S. Zaroubi, G. Bernardi, and V. N. Pandey. Postcorrelation radio frequency interference classification methods. MNRAS, 405:155–167, June 2010. 18