SERPent Cookbook Version 25/09/12 Luke W. Peck & Danielle M. Fenech

advertisement
SERPent Cookbook
Version 25/09/12
Luke W. Peck∗ & Danielle M. Fenech†
October 8, 2012
∗
†
email: lwp@star.ucl.ac.uk
email: dmf@star.ucl.ac.uk
1
Contents
1 Introduction to SERPent
1.1 SumThreshold Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 SERPent’s Implementation of the SumThreshold Method . . . . . . . . . . . . . . .
1.3 Outputs from SERPent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
4
5
2 Running SERPent
2.1 File Inputs . . . . . . . . .
2.2 e-MERLIN Reduction Tools
2.3 Baseline Choice . . . . . . .
2.4 Flagging Options . . . . . .
.
.
.
.
6
6
7
8
8
.
.
.
.
.
.
9
9
10
10
11
12
14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 SERPent Requirements, Performance, and
3.1 Requirements . . . . . . . . . . . . . . . . .
3.2 Performance . . . . . . . . . . . . . . . . . .
3.2.1 RFI Flagging Performance . . . . . .
3.2.2 Lovell Stationary Scan Performance
3.2.3 Speed Performance . . . . . . . . . .
3.3 Future Improvements and Work . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Future Improvements
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Other Notes on SERPent
15
4.1 SERPent Updates and Bug Fixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 FUQ (Frequently Unasked Questions) . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2
*** Please note that SERPent is by no means a finished product and is still in it’s beta testing
stage at this time. ***
1
Introduction to SERPent
Due to the nature of modern interferometers becoming ever more sensitive and powerful, and thus
data sets becoming ever bigger, the need for automation of certain procedures in reduction and
calibration of interferometric data is vital. The main ‘bottleneck’ in this calibration procedure is the
removal of Radio Frequency Interference (RFI). The flagging of bad visibilities was traditionally
a manual task, whereby the astronomer would cycle through visibilities and highlight bad data
points. This was sufficient enough as the datasets were of the order of Megabytes (MB) and took
a ‘reasonable’ amount of the astronomers time, albeit being a tedious process. However, with
improvements in receivers, electronics, correlators and optical fibre networks and various aspects
of the interferometric setup, datasets are now on the order of Gigabytes (GB) and observations
now span a wide frequency range into bands which are not protected for radio astronomy and thus
increase the incidence of RFI.
The Scripted E-merlin Rfi-mitigation PypelinE for iNTerferometry (SERPent) was created to
tackle this problem for the RFI environment affecting e-MERLIN using Parseltongue; a python
based language which is able to call upon AIPS tasks.
1.1
SumThreshold Method
The most effective thresholding method was demonstrated by Offringa et al. 2010b [3] to be the
SumThreshold and this is the adopted RFI detection method. An overview of the method will be
given here, for a more in depth analysis of the method please see the afore-mentioned literature.
Threshold methods work on the basis that RFI increase visibility amplitudes for the times and
frequencies they are present. Therefore there will be a considerable difference compared to other
RFI-free visibility amplitudes, thus these RFI will be statistical outliers. If these RFI are above
a certain threshold condition then they are detected and flagged. The threshold level is dictated
by the statistics of the sample population, which can be the entire observation (all time scans,
frequency channels, baselines etc) or a smaller portion, for example: separate baselines and IFs.
This has the advantage of not only an increase in computational performance (Python is faster when
operating on many smaller chunks of data rather than one big chunk; i.e. Dynamic Programming),
but RFI may be independent of baseline and the distribution between different IFs may differ;
thus splitting the data makes the statistics more reliable. This is particularly relevant for L band
observations where the RFI is more problematic.
The SumThreshold method works on data which is separated by baselines and stokes and
arranged in a 2D array with the individual time scans and frequency channels comprising the
array axes i.e. time-frequency space. The frequency channels were further split by IFs due to
the arguments previously stated. The idea is that peak RFI and broadband RFI will be easily
detectable when the visibility amplitudes are arranged in time-frequency space. The e-MERLIN
correlator outputs three numbers associated with any single visibility: the real part, the complex
part and the weight of the visibility. When appending visibilities in the time-frequency space, if
the weight is greater than 0.0 i.e. data exists for that time and frequency, then the magnitude
of the real and complex part of the visibility is taken to constitute the amplitude. If the weight
is 0.0 or less i.e. no data exists for this baseline, time scan etc, then the amplitude is set to 0.0.
This has no effect on the statistics or threshold value, but acts as a substitute for that elemental
position within the array. The Python module NumPy was employed to create and manipulate the
3
2D arrays, as the module is written in Fortran (which is intrinsically faster than Python) and has
been optimized1 .
There are two concepts associated with the SumThreshold method: The threshold and the
subset size i.e. a small slice of the total elements (in this case visibitility amplitudes) in a certain
direction of the array (time or frequency). The difference between the SumThreshold method (a
type of combinatorial thresholding) and normal thresholding is that after each individual element
in the array has been tested against the first threshold level χ1 , the values of a group of elements
can be averaged and tested against a smaller threshold level χi , where i is the subset number i.e.
the number of elements averaged and tested. Empirically a small subset i = [1, 2, 4, 8, 16, 32, 64]
works well (Offringa et al. 2010b) [3]. A window of size i cycles through the array in one direction
(e.g. time) for every possible permutation for the given array and subset size. After each subset
cycle a binary array of identical size records the positions of any elements which are flagged. 0.0
denotes a normal visibility and 1.0 signifies a RFI in the time direction (2.0 for frequency direction
and higher values for any subsquent runs of the flagger). At the beginning of the next subset cycle
any element within the flag array whose value is greater than 0.0, the corresponding amplitude
in the visibility array is reduced to the threshold level χi which progressively gets smaller with
increasing subset size. If a group of elements of any subset size i is found to be greater than the
threshold level χi , then all elements within that window are flagged. This method is implemented
in both array directions (i.e. time and frequency).
1.2
SERPent’s Implementation of the SumThreshold Method
In addition to the SumThreshold methodology, certain clauses have been added to prevent the
algorithm to overflag the dataset. If any threshold level reaches the mean + varianceestimate the
flagging run for that direction stops. The flagging process can run multiple times at the cost of
computational time, and written in the code as default is a second run, if the maximum value
within the array is a certain factor of the median and if there are flags from the previous run. On
this second run all flagged visibilities from the first run are set to 0.0 in the visibility array so they
don’t skew the statistics and this run can then look for weaker RFIs which may remain. This may
be necessary as some RFI in the e-MERLIN commissioning data were found to be over 10, 000
times stronger than the astronomical signal and some weaker RFIs were still present. Note that
the first run subsets increase in size in binary steps up to 32, and the second run goes deeper to
256. This can easily be manually changed to lower values to save time if there isn’t much RFI in
the observations.
The first threshold level can be calculated by a range of methods and statistics. The variance
of a sample is an important component for this threshold and various methods are described and
tested by Fridman (2008) [1]. The author concluded that Exponential Weighting is the best method
from the point of view of LOSS: a measure of the difference in standard deviation of a robustly
estimated variance and a simple estimate, in the absence of outliers. But the Median Absolute
Deviation (MAD) and Median of Pairwise Averaged Squares are the most effective ways to remove
outliers, although they comment that both are not as efficient i.e. needs more samples to produce
the same power, as other methods. Since the sample size in any given observation from e-MERLIN
will be of adequate size, this is not such an issue. The breakdown point for MAD is also very
high (0.5) i.e. almost half the data may be contaminated by outliers (Fridman 2008) [1]. MAD is
adopted by this algorithm due to these robust properties. Again the author stresses that the type
1
It should be noted here that how this module is compiled and called upon on can have a significant effect on
performance.
4
and intensity of RFI, type of observation and the method of implementation are important factors
when deciding what estimate to use for any given interferometer.
The variance MAD used in the SERPent algorithm is defined by equation 1, where mediani (xi )
is the median of the original population. Each sample of the population is then modified by the
absolute value of the median subtracted from each sample. The median of this new absolute
median subtracted population is taken and multipled by a constant 1.4286 to make this estimation
consistent with that of an expected Guassian distribution.
M AD = 1.4826 medianj {|xj − mediani (xi ) |}
(1)
The first threshold level χ1 is thus determined by an estimate of the mean x̄, the variance σ
and an aggressiveness parameter β (equation 2) (Niamsuwan, Johnson & Ellingson 2005) [2]. Since
the median is less sensitive to outliers, it is preferred to the traditional mean in this equation (thus
x̄ = median) and the MAD to the traditional standard deviation for the variance for similar reasons
(σ = M AD). If the data is Guassian in nature then the MAD value will be similar to the standard
deviation (and the median to the mean). A range of values for β were tested until a stable value
were found for multiple observations and frequencies of around β = 25. Increasing the value of β
reduces the aggressiveness of the threshold and decreasing the value increases the aggressiveness.
χ1 = x̄ + βσ
(2)
The subsequent threshold levels are determined by equation 3 where ρ = 1.5, this empirically
works well for the SumThreshold method (Offringa et al. 2010b) [3] and defines how ‘coarse’ the
difference in threshold levels, and i is the subset value.
χi =
χ1
ρlog2 i
(3)
In summary, SERPent firstly calculates the median and MAD for each IF, baseline, stokes and
then removes any strong individual visibility which is over the first threshold. It then sets any
flagged visibility amplitude’s to zero and recalculates the statistics. Then it starts the ‘first’ run
of the SumThreshold method in both time and frequency directions. After this is completed it
again sets any flagged visibility amplitude’s to zero and recalculates the statistics. Then the second
SumThreshold run is performed to try and remove some weak RFI
1.3
Outputs from SERPent
SERPent reads visibilities directly from AIPS, finds the RFI and records them in a numpy array.
This numpy array is then pickled using the cPickle module in Python, and named according to
its source name, baseline and IF number. These Pickle files will be located wherever the user
designated as the output folder (See section 2.1). A corresponding dictionary input is created
containing this same pickle name as the key, and all the information i.e. source number, baseline,
time information for time axis, which will be used when creating the FG rows.
A dummy text file is then created as the Pickle files are read and the FG rows are created.
Once all the Pickle files have been read and the FG rows appended to the dummy file, the actual
FG text file is created and the FG extension table header is written. Then all the rows from the
dummy are appended below the header in the real FG text file, the dummy file is then deleted.
Then SERPent runs the AIPS verb REFLG, which condenses FG rows in an efficient manner
and outputs another FG text file with the condensed rows.
5
The outputs that will be found in the user’s directory are: the Pickle files, the original FG
text file by the name <name-of-source>.fg, and the REFLG FG text file by the name <name-ofsource> r.fg. SERPent will also automatically read in these flag files and thus the user will have 2
FG tables for their source in AIPS.
2
Running SERPent
There are only two files associated with SERPent: SERPent.py - The main script file where all
the business is conducted, and SERPent input.py - The input file where the user will have to
input details concerning AIPS data information and what they want SERPent to do. To run the
script make sure both files are in the same directory as each other and simply type in the terminal:
parseltongue SERPent.py and some printouts will appear on your terminal. The next few
sections will describe the SERPent input file and what the user needs to enter.
2.1
File Inputs
There are six variables in total which require details about your observation, whether you require
and log file containing timed SERPent runs, the number of CPUs available on your computer and
your designated directory path. Below is an example of what you’ll see.
# Data Information:
AIPS_user_number = 101
Name = '1436+6336'
Klass = 'SPLIT '
Disk = 1
Seq = 1
#
#
#
#
#
The
The
The
The
The
AIPS user number the data is on.
uvdata name of the catalogue.
uvdata klass of the catalogue.
uvdata disk number of the catalogue.
uvdata sequence number of the catalogue.
# Log Information:
write2log = 'no'
# 'yes' only if using the multi.py script file to write timing
# details to a log file (only used for performance testing)
# Parallelization Information:
NCPU = 4
#
#
#
#
#
Define here the number of CPUs you want to use.
Parallelization is currently implemented on a baseline
basis and thus the maximum number of CPUs ultilized will be
the number of baselines in the data. i.e. for e-MERLIN a
maximum of 21 CPUs will be used by SERPent (21 baselines).
# Directory Information:
path2folder = '/home/lwp/aips_test/'
6
The first five variables are to do with AIPS. AIPS user number is the AIPS user number of
the catalogue where your observation is loaded to and should be an integer. Name is the name of
your observation (probably the name of the source for single source files) and is a string. Klass
is the class name of your observation and is a string. Disk is the disk number your catalogue
is located on and is an integer. Seq is the sequence number of the observation you want to run
SERPent on and is an integer.
write2log is a variable for keeping a log of multiple runs for performance tests with SERPent
using different number of CPUs, sources etc. For the majority of users, this can be ignored.
NCPU defines the number of CPUs you want to run SERPent on. For information concerning
how the number of CPUs affect the performance of SERPent, please consult section 3.2.3. The
final variable path2folder defines the directory where you want all the SERPent outputs such as
the Pickle files and FG text files to go and is a string.
SERPent can flag multi-source files (e.g. Klass = ‘MULTTB’) in it’s current condition, but
with the following constraints. That all the sources have been SPLIT from the multi-source file
and are on the same disk number and the same sequence number as the multi-source file. All the
Names of the following SPLIT files are that of the source’s names in the SU table in the multisource file, and that you input the details of the multi-source file in the variables in the input file.
It is recommended however that SERPent is ran on single source files, as the files will be big for
e-MERLIN / new generation interferometers and will take some time to run.
2.2
e-MERLIN Reduction Tools
The next couple of variables are related to other reduction passages written into SERPent as a
consequence of issues arising in our COBRaS e-MERLIN commissioning data. Currently there are
two extra reduction tools available to SERPent in addition to the SumThresholding RFI flagging
sequence, one for removing the Lovell Stationary Scans and another for removing In-scan zero level
amplitudes.
# Phase Cal Information
phasecal = '2007_0718'
#
#
#
#
#
#
#
If one of the sources (multi files) or source is
the phase cal, please write the name of the source
as this variable. Else put 'no'.
This information is used for the Lovell Stationary
Scans passage of SERPent if the source is the phasecal
designated here, the telescope is e-MERLIN and the
baseline contains the Lovell telescope.
# Execute the Zero Level code to flag telescope dropouts and other system failures
# where the visibilities drop to around 0 Jy in the same scans where good data is
# present.
zero_level = 'yes'
# To execute this passage set this variable to 'yes'
# else set it to 'no'.
phasecal defines the phase calibrator for your observation. If this is a single source file and the
source is the phase calibrator then this should be the same as the Name input above. If you are
running a multi-source file then this should be the name of the catalogue which corresponds to the
7
phase calibrator. The Lovell Stationary Scan will only run if the array is e-MERLIN, on baselines
which contain the Lovell telescope and on data which is the phase calibrator.
The zero level variable refers to a passage which will search for zero or low level amplitudes
within scans which contain good data. This will act upon all of the data. Select either ‘yes’ or ‘no’
for this option.
2.3
Baseline Choice
There are two variables in the input file which dictate what baselines are to be flagged by SERPent.
Below is what you’ll confronted with in the SERPent input file.
# Baseline Information for Flagging
which_baselines = 'choose'
# Variable to define whether to flag all
# baselines or a select few.
# Options are: 'all' or 'choose'.
baselines = ['5-7', '7-8']
# chosen baselines for flagging in the correct format.
# the order of the baselines does not matter.
The first variable is called which baselines, and has two recognisable inputs: ‘choose’ or ‘all’,
both which are strings. The ‘all’ choice will select all baselines in the observation to be processed
for flagging and will ignore the second variable concerning baselines (below). The ‘choose’ option
will allow the user to select specific baselines for flagging. The user will have to already know which
baselines they want flagging and the associated antenna numbers.
If the user has selected for which baselines = ‘choose’, then the second baseline variable will
be read by the script and will need the necessary information. The baselines variable is a python
list which requires strings in the strict format: antenna number − antenna number, i.e. to include
the baseline with antennae 5 and 7, the item in the list would take the form: ‘5-7’.
2.4
Flagging Options
The final part of the input file concerns the flagging options. This is to enable the user complete
control over the flagging options available, but without having to dig around in the code to modify
them. There is a ‘default’ setting which flags well so its not essential to change anything, but if
you want to tinker with the parameters then see below for a description of the variables you’ll see
in the input file.
# Flagger Options:
flagging_options = 'choose'
#
#
#
#
#
aggressiveness_first_run = 25
variable to define whether to use the flagging
options below or the default options in the
SERPent.py file default ignores whatever
variables are set in this file
Options are: 'choose' or 'default'
# How aggressive the first run is
# Note: a lower number is more aggressive
8
max_subset_first_run = 32
# Maximum subset for the first run of SERPent
# Note: should be a binary number: 1,2,4,8,16...
aggressiveness_second_run = 25
# How aggressive the second run is
max_subset_second_run = 256
# Maximum subset for the second run of SERPent
rho = 1.5
# Difference in coarseness between each
# threshold level
The first variable flagging options decides whether SERPent should use it’s inbuilt parameter
values or whether it should read the user defined ones. For the record the values displayed above
are the current SERPent ‘default’ settings. This variable has two options which are: ‘choose’ or
‘default’, both which are strings. ‘default’ will ignore the user input parameter values and use the
SERPent default values. ‘choose’ will then enable SERPent to read the rest of the variables in this
section.
The variable aggressiveness first run defines the aggressiveness of the first run of the SumThreshold method. A higher value actually makes the run less aggressive, and a lower value more
aggressive. This value should be an integer, although a floating point is also acceptable. The
max subset first run variable defines the maximum size a subset will take (see section 1.1 for
more information). This needs to be a binary number (1, 2, 4, 8, 16, 32, 64...) and thus an integer
as the flagging will start with a subset size of 1 and then double until the maximum subset size
is met. As the thresholds are constantly lowered by equation 3, the larger subsets test more for
constant weaker RFI in both time and frequency.
The next two variables aggressiveness second run and max subset second run are the
same as the first run parameters, except they correspond to the second run of the SumThreshold
method. Here the default setting is the same for the aggressiveness as for the first run but different
for the second run’s maximum subset. Here we allowed for the last run to search for weak RFI
and thus increased the maximum subset to size 256, although SERPent may kick out before this
maximum is reached if the thresholds get too close to the median of the sample, to avoid flagging
good data.
3
SERPent Requirements, Performance, and Future Improvements
3.1
Requirements
SERPent has been run on a number of systems and seems to be fairly stable. Here are the versions
which we are running the code on, and should probably be considered the ‘minimum’ requirements
for the code to work.
ˆ AIPS release 31DEC11
ˆ Python 2.6.5
ˆ Parseltongue 2.0 (with Obit 1.1.0)
ˆ Numpy 1.6.1
9
3.2
Performance
Here we document the performance of the early test runs of SERPent on old MERLIN data, early
e-MERLIN commissioning data and RFI test data supplied by Rob Beswick (Jodrell Bank). Table
1 shows details on the datasets tested here. All tests have used SERPent version 31/07/12.
Table 1: SERPent Performance Test Datasets
Telescope
MERLIN2
e-MERLIN
e-MERLIN
3.2.1
Dataset Name
M82V
RFI Test Data:
1436+6336
COBRaS W1 2011:
0555+398
Size
212 MB
1.63 GB
Band
L
L
Visibilities
82692
5812
Sources
6
1
Baselines
21
10
IFs
1
12
Channels
31
512
Stokes
2
4
2.33 GB
C
99149
1
10
4
128
4
RFI Flagging Performance
SERPent has been tested on both L and C band observations and has been found to flag (almost
all, need more recent runs of SERPent, as it has been significantly updated since these runs) all C
band RFI and the majority of L band RFI. The remaining L band is usually weak broadband RFI
or very weak RFI close to the median value of the sample.
Firstly we present some results from L band data. Figure 1 shows some RFI test data of
0555+398 (data courtesy of Rob Beswick) with one baseline displayed via AIPS task SPFLG
in time-frequency space. The first IF is completely wiped out with noisy data, and some weak
broadband RFI remains in the central IFs. Almost everything else has been flagged, including
some very intricate RFI which can not be done as accurately with more simplistic RFI flagging
routines.
Figure 1: AIPS SPFLG image of 0555+398, L band, baseline 7 − 8, stokes: RR, IF: 1 − 12 after SERPent flagging.
The AIPS task REFLG was also deployed in this image. The vertical axis is time and horizontal axis is frequency.
The L band results have shown that SERPent can flag complicated RFI in time-frequency
space, and figure 2 shows this also applies to the C band with the infamous ‘wiggly’ RFI found in
commissioning data. Note that this was very poor quality data and SERPent even started to flag
some of the noise. However this is a good example of the thresholding method in action.
10
Figure 2: AIPS SPFLG image of 0555+398, C band, baseline 5 − 7, stokes: RR, IF: 2 before (left) and after SERPent
flagging (right). The AIPS task REFLG was also deployed in this image. The vertical axis is time and horizontal
axis is frequency.
To add to the archive of results we present some Commissioning COBRaS flagging results for
both the L and C band in figures 3 and 4.
3.2.2
Lovell Stationary Scan Performance
A problem unique to the e-MERLIN array is the Lovell stationary scan. Due to the size of the Lovell
telecope and the subsequent slew time, the Lovell telescope only participates in every alternative
phase-cal scan, remaining stationary on the target for the other scans. The other antennas in the
array are not affected. This results in the visibilities from baselines containing the Lovell telescope
to have two different amplitude levels for the phase-cal. In most cases the phase-cal will be brighter
than the target, thus when the Lovell is observing the phase-cal the received flux will be greater
than when the Lovell does not participate in the phase-cal scan and remains on the target source.
This behaviour can be seen using the IBLED task within AIPS on the phase-cal source as figure
5 clearly shows. This figure also displays another problem with early e-MERLIN commissioning
data with multiple amplitude levels for scans throughout the observation. This property has been
traced to hardware issues within the receivers and new filters appear to have resolved the issue
for future observations. However, it was necessary to normalize this problem before flagging this
dataset.
In the main window each group of points represents one scan, for which there are three distinct
amplitude levels. The highest two levels are scans where the Lovell telescope contributes to the
observation (including the aforementioned filter issues affecting amplitude levels) and the lowest
level scans are where the Lovell does not contribute. Across the entire observation (top panel) the
Lovell stationary scans are consistent in magnitude and alternate between every other observation,
11
Figure 3: AIPS SPFLG image of a target pointing for COBRaS: 2033+4113, L band, baseline 6 − 8, all stokes
combined, IFs: 12. The AIPS task REFLG was also deployed in this image. The vertical axis is time and horizontal
axis is frequency.
despite the varying amplitude levels of the Lovell on source scans, indicating that the Lovell dropout
scans are indeed the cause of the lowest level scans in figure 5.
If the array is e-MERLIN, SERPent will run an extra piece of code, which firstly determines
the Lovell baselines. It makes a first run through all the integration times and isolates each scan,
and evaluates the magnitude of each scan, the highest and lowest scan statistics and the integration
time step. A second run again isolates each individual scan and tests the following condition: if
the mean of the scan is between the lowest mean found in the previous run ±σ: then flag the entire
scan. The results are written to a text file via the cPickle Python module and are combined with
the main SumThreshold flagging results at a later time in the script. Figure 6 shows the IBLED
task window on the same phase cal source as in figure 5 after the Lovell stationary scans have been
removed by SERPent.
3.2.3
Speed Performance
To increase computational performance we have parallelized SERPent by splitting the data into
‘jobs’ which are then evenly distributed across a number of CPUs. SERPent is parallelized in both
baselines and IFs to maximize the even spread across CPUs and uses however many CPUs the user
designates in the input file. Our initial tests on modest data sizes reveal an significant increase in
performance which scales linearly with data size.
The ultimate speed performance would come from a full multi-threading implementation of
SERPent across a network. Whilst there are facilities available to us at UCL (4 Leviathan boards
each with 16 CPUs and 100GB ram), I believe there a few available to others. It would also require
a rewrite of the SERPent code which would take time, so for the forseeable future, this is not a
likely option.
We have analysed a number of datasets for computational performance on two different com12
Figure 4: AIPS SPFLG image of a target pointing for COBRaS: 2032+4113, C band, baseline 1 − 5, all stokes
combined, IFs: 4. The AIPS task REFLG was also deployed in this image. The vertical axis is time and horizontal
axis is frequency.
puter systems. One is a common desktop computer, and the over is a more powerful system. Table
2 gives details on the computer systems we have tested SERPent performance on.
Table 2: Computer Systems
Computer Name
Leviathan
Desktop
Memory (GB)
100
4
NCPUs
16
4
We present the computational performance on the ‘RFI Test Data’ (L band), due to it containing
the most jobs (120) and being of modest enough size to perform multiple runs on a range of CPUs.
Figure 7 demonstrates the time taken for each system to flag the same small dataset. As the
number of CPUs increase, the gained performance decreases to a small amount, but there is still
a significant gain even at high number of CPUs. Note that the full e-MERLIN datasizes may well
increase to 100 times the size of this dataset, and thus the small gains here will magnify.
Another way to demonstrate the importance of a high number of CPUs is given in figure 8.
Here the relative performance ratio to a single CPU on the same system is shown to have two linear
relations with a turn off point around 9 CPUs. This will be due to a factor-like effect from the
number of jobs being processed over a certain number of CPUs. We can infer that using 16 CPUs
on this dataset has increased the performanced by a factor of 7 compared to using only 1 CPU.
Runs on other datasets gave similar results.
It can also be seen that increasing the amount of memory also increased the computational
13
Figure 5: AIPS IBLED task window, displaying the phase-cal source: 2007+404, stokes LL (for greater clarification).
The top panel shows all scans for the entire observation run, and the main central panel shows a small selection of
scans for closer inspection, before running SERPent.
performance, albeit by a smaller amount than the parallelization. Leviathan has 25x more Memory
than the standard Desktop computer in our tests and is faster by a factor of 1.7, consistently when
comparing between multiple number of CPUs for both computers and datasets. This shows that
the limiting factor of running SERPent on interferometric datasets is the shear volume of data that
needs processing and not a RAM issue.
3.3
Future Improvements and Work
ˆ We now have a parallelized version of SERPent but we are looking for ways to further increase
the performance (not in the near future however). Segregating the data further into smaller
timescans is one possible route of inquiry.
ˆ Currently SERPent ultilizes the AIPS verb REFLG and the various options it possesses.
We have briefly played around with these options and have found some inputs which do a
reasonable job of condensing the number of FG rows. This is definitely subject to future
testing and may be included in the input file as another choice for the user as well as a
default option. Currently the ‘default’ cparm options (1-7) are: [0, 4, 0.70, 0.70, 0.70, 0.70,
2], but this may change with more tests, and we welcome the user to try different options
and give us feedback.
ˆ On a related note, work still needs to be done on the number of FG rows in the extension
tables. Currently different AIPS tasks have different limits on how many FG rows can be read
in. AIPS has an absolute upper limit of around 1 million rows. Preliminary outputs from
COBRaS commissioning data has the number of FG rows at around 6 million after running
SERPent!!! Recent (15/06/2012) email correspondents with Eric (NRAO) have resulted in
him conceeding that the AIPS limit may have to be increased to 10 million. This however
does not resolve this issue and it may be up to us to write the REFLG task in python and
add it as part of SERPent as we can not even load the FG table into AIPS to use the REFLG
14
Figure 6: AIPS IBLED task window, displaying the phase-cal source: 2007+404, stokes LL (for greater clarification).
The top panel shows all scans for the entire observation run, and the main central panel shows a small selection of
scans for closer inspection after running SERPent. The lowest level scans present in figure 5 have been removed.
task. Another alternative would be to chop up the FG text file and input it in pieces into
AIPS.
4
Other Notes on SERPent
4.1
SERPent Updates and Bug Fixes
Here I’ve put some of the updates / bug fixes to SERPent, so anyone using an older version runs
into these problems they know that they have been fixed or improved in a newer version.
ˆ 25/09/12. More Aggressive Flagging.
Many updates. I’ve reduced the kickout safety clause with the SumThreshold flagging sequence so the algorithm flags closer to the median. There are two levels; one more aggressive
level for L and S band and another less aggressive for C band and higher frequencies.
ˆ 25/09/12. Lovell-Mk2 Baseline Removal.
SERPent detects whether this baseline exists and automatically removes it from the list of
jobs to process and flags it at the end of the SERPent script. If anyone has a reason to keep
this baseline, let me know.
ˆ 25/09/12. Lovell Stationary Scans & Zero Amplitude In-scan Removals.
This version contains a passage which detects and automatically removes the Lovell Stationary
Scans for Lovell baselines when the user designates the phasecal variable in the input file as
the name of the phase calibrator. Some of the COBRaS commissioning data also contained
some in-scan zero level amplitude visibilities, probably from collecting data whilst the antenna
were still slewing or a system failure. A zero level passage has been added with the user option
15
e-MERLIN RFI Test Dataset (L Band) - 1.63GB
Leviathan 100GB
Desktop 4GB
20000
Time Taken (secs)
15000
10000
5000
00
2
4
6
8
10
NCPUs
12
14
16
18
Figure 7: showing the time taken to flag the RFI TEST dataset (1.63GB) using a common Desktop computer and
Leviathan over a range of CPUs. Each point is an average of 3 runs using the same number of CPUs.
in the input file to decide whether it is needed. For a blind, automated run it’s best to select
‘yes’ !
ˆ 11/09/12. Further Parallelization.
I’ve parallelized SERPent further by splitting the jobs via both baseline and IF. This creates
number of baselines x number of IFs jobs which are distributed amongst the CPUs selected by
the user. As e-MERLIN moves from commissioning observations to full Bandwidth Legacy
datasets, the number of IFs (and thus jobs) will increase and this new parallelization will
squeeze out a little more performance.
ˆ 27/07/12. Parallelization.
We have now implemented a simple parallelization via baselines into SERPent. This sends
different baselines to different CPUs thus running multiple flagging runs in parallel. Please
consult section 3.2.3 for more information.
ˆ 12/06/12. Bug Fix.
Many thanks to Megan, who spotted a bug in how the antenna numbers were read into the
FG rows. Previous versions only worked for antenna numbers ¡ 10, but she found a simple
solution to rsplit the antenna numbers from a string into a list, and read them in from there.
4.2
FAQ
Due to the flooding of my inbox with questions concerning SERPent (sarcasm), I’ve anticipated
some FAQs which people in the future may have.
Q.
Does SERPent work on arrays other than e-MERLIN?
16
8
Speed Ratio Relative to 1 CPU
7
e-MERLIN L Band (120 Jobs) NCPU Performance
Leviathan 100GB
Desktop 4GB
6
5
4
3
2
1
00
2
4
6
8
10
NCPUs
12
14
16
18
Figure 8: showing the speed relations of running SERPent on multiple CPUs on the Desktop and Leviathan relative
to a single CPU on the same systems. Even at high number of CPUs there are significant gains in performance which
should increase further beyond 16 CPUs.
A. Whilst all of the tests were on MERLIN and e-MERLIN data, there should be no reason why
it doesn’t work on other arrays such as the EVLA and EVN etc. As long as the data is in
fits format and can be read into AIPS. This script does work on single source files and multi
source files although you do need to SPLIT the multi-source file before running SERPent.
Q.
Can I change any of the parameters in the flagger definition.
A. Of course. You are free to change anything you like to suit your needs. Please consult section
2.4 to help you set the parameters.
Q.
Why SERPent?
A. Because the script is written in parseltongue which is a python based language and my legacy
project is called COBRaS. I was just keeping in theme with all these previous acronyms.
Q.
How did you come to think of SERPent?
A.
Like many of my epiphanies to solve problems I encountered when writing this pypeline, I
came up with the idea in the shower (unfortunately this is actually true). Thought of the
word ‘serpent’ and like any respectable astronomer spent the rest of my time in the shower
striving to match semi relevant words to the initials.
Q.
Why were you thinking about this in the shower?
A.
What I do in the shower, or why, is strictly private.
17
4.3
FUQ (Frequently Unasked Questions)
Q.
How can I ever repay you for your heroic selfless actions in writing this script?
A.
We ask for nothing. However if you ever see us in a pub near you without a drink (actually,
even if we already have a drink), etiquete would dictate that you should probably buy us
one...
References
[1] P. A. Fridman. Statistically Stable Estimates of Variance in Radio-Astronomy Observations as
Tools for Radio-Frequency Interference Mitigation. ApJ, 135:1810–1824, May 2008.
[2] N. Niamsuwan, J. T. Johnson, and S. W. Ellingson. Examination of a simple pulse-blanking
technique for radio frequency interference mitigation. Radio Science, 40:5, June 2005.
[3] A. R. Offringa, A. G. de Bruyn, M. Biehl, S. Zaroubi, G. Bernardi, and V. N. Pandey. Postcorrelation radio frequency interference classification methods. MNRAS, 405:155–167, June
2010.
18
Download