C/67-2 DATA A uAaDLIUG

advertisement
C/67-2
COMPLTR APPROAClES FOR THE uAaDLIUG
OF LARCE SOCIAL 5CIEN;'CL DATA FILES
A Progress Report to the National Science Foundation
on Grant GS-727
Submitted by
Ithiel de Sola Pool, James M. Beshers
Stuart McIntosh, and David Griffel
Center for International Studies
M.I.T.
January 1967
ACKNOWLUDGEMENTS
In addition to the National Science Foundation which supported tais
project, we wish to extend our thanks to Project MAC,
an IT research
program sponsored by the Advanced Research Projects Agency,
Defense,
Department of
under Office of Naval Research Contract Number N-402(01),
made exceptional computer facilities available to us,
wtich
and tothe Comcom
project of the MIT Center for International Studies, [supported by ARPA,
under Air Force Office of Scientific Research Contract Number AF 49(63j)
1237] which supported much of the initial program development.
Summary
This is
a reoort on an experiment in computer methods
for handling large social science data files.
Data processing methods in
the social sciences have
been heavily conditioned by the limitations of the data
processing equipment available, such as punch cards and
permit on-line time-
Third generation computers
tapes.
shared interactive
data analysis in
le
entirely new ways,
soulit to use these new capabilities in a system that
would enable sAojal scientists to do thinps that were previously impractical.
Among the most important objectives of the system is
to
permit social scientists to work on data from a whole library
of data simultaneously,
rather than on single studies.
we call working on multi-srurce
objective was
data.)
to permit social scientists
Another imoortant
to worK simultane-
ously on data at different levels of aggregation.
ample,
a
file
of voting statistics
(This
For ex-
by precinct and a file
-f
individual responses to a public opinion poll might be combined to examine how non-voters in a one-party neighborhood
differ in attitudes from non-voters in a contested neighborhood0
(This we call working on multi-level data.)
Another objective
was to Dermit researchers who do not know computer Drogram-
ming to work on-line on the comnuter.
To meet these objectives a data analysis system called
the Admins system was developed.
The system consists
of four
-2-
sub-system as follows:
(1) The Organizer: a sub-system for creation of machineexecutable codebooks.
(2)
The Processor:
a sub-system to bring the data into
correspondence with that codebook.
(3)
The Structurer: a
sub-system for inverting data
files.
(4) The Cross-analyzer: a sub-syste
for analvzing
data.
The development of the Admins system
cal exercise.
It
has not been a theoreti-
has been developed in
the environment
of an
actual data archive used by students and faculty members at
Approximately 15 users
MTT for the conduct of their research.
are now using the system.
The Organizer Sub-System
The codebooks and data as they come to our data archive
are full
of errors,
Experience has shown that error correction
takes a good half of the analyst's time.
The job of the ror-
ganizer sub-system is to enable the researcher to produce an
error-free machine-executable codebook covering that nart of
the data which he is
going to use right awav.
executable codebook we call an "adform",
(This
machine-
short for adrinistra-
tive form. )
The researcher types at the console the name by whic-i he
wants to designate a question.
answer, descriptions
typ)es
tae
will
(That
be labels
text that corresponds
along with question
on outnut tables.)
to that
nave,
tae
noner
A
then
l
or
1
ii ilii 1
lll11
-
-3-
possible answers, and where this data is to be found on the
data storage medium (e.g. card and column).
any rules to which the data must conform,
He can also type
e.g. single-punch,
multiple-punch, if non-voter is punched vote cannot be punched,
etc.
Por each question the researcher provides that minimum
information which will permit unambiguous interpretation of
the data format.
The computer then checks for errors in the codebook such
as failure to provide all
the needed information or contradic-
tory information provided e.g.
same punches.
user,
two questions listed in the
The errors are printed out, corrected by the
until after
2-3 cycles he is
likely to have a corlete
and clean adform.
lie needs to do this only with questions he is about to
use because he can always add adforms later.
Errors, of course, still remain but not errors in the
format of the adform.
That may remain are suostantive errors
or failure of the adform to conform to the data.
To check the
latter matter we turn to the next sub-system.
The Processor Sub-System
Now the user puts data onto the disc.
looked only at the codebook,
Up to now he has
not at data.
'She processor checks the data to see if
tAat consist
of discrepancy
For examnle,
are there multiple punches in
variable.
between
the data and
Are there nunches somewhere
To save
time,
the
processor
there are errors
the adform.
a single column
that should be blank.
checks errors
in
various
aYs
-4-
tnat
the
user may siecify.
1Jth record,
test
He may,
the first
for example,
test
everv
hundred records,, print out the
first hundred errors then stop, disregard less than three
errors, etc.
As the processor scans records it also computes the marFinals, so by the time a clean data file, consonant with the
adform,
is
produced,
a complete listing
of marginals
is
also
available to guide the user in his future analysis.
The Analy zer Sub S stem*
The analyzer sub-system permits
to any set of records.
educated"
"Union",
all
the user to give a name
For example we may name as "collere-
Dersons wno said they had attended
"intersection",
college.
and other Boolean commands permit
new indexes to be constructed,
for example the name "old-boys"
may be used for tiose "college-educated" who are also "ld
Each of these names designates an index,
ers
a list
of noint-
to those persons who have that characteristic,
-y building indexes out of indexes
comnolex indexes
in
i.e.,
can be constructed.
an index name and ask
The confused
the computer to tell
tions that compose that index, or for that
structions in which that index is used,
track of all
The
summarve
out of indexes baffling
user can tyr'e
him the construcmatter all
tne
The comnuter keeps
that the analyst has donec
inversion of
the files
con-
need not be explained
in
this
-5-
The analyzer produces labelled cross-tabulation tables of
one variable against another,
about an in-
or other statistics
dex if desired statistical tests can be applied.
There are
over 50 different instructions the user can pive the analyzer
If
sub--system.
he forpets what an instruction is
definition,
and the system will print out its
he can ask
and the proper
format for it.
Multi-source and multi-level data are easily handled by
the analyzer tnanks to the pointers and naminp conventions
which can be equated across sources.
File Manaement
Let us note that the same Admins system that permits improved analysis of data about respondents and their charactercan be turned onto the data archive
istics
the library management system.
itself
to become
To the computer, a study and
its characteristics, a question and its characteris'ics, or q
respondent and his characteristics are all the same sort of
thinp.
Thus the Admins system can be used for tne construc-
tion of a catalopue of the data archive,
the construction of
indexes to it, the creation of complex indexes to the collection, and the searchinp of the indexes for any particular
kind of data.
The
Future
Further steps in the development of Admins include:
(1)
Research on how Admins
is
used by its
users to
-6-
see how it
can be simplified and made more natural
to use.
(2)
Improvement of file
management capabilities.
(This we call macro-orpanization.)
(3) Introducinp a small scope to pive the user the advantages of a facility like a pencil and paner
rather than just a facility like a typewriter.
Part I
PURPOSES OF THE IR(ECT
On June 1, 1965 we comenced work on a pilot study of "Computer
Approaches for the Handling of Large Social Science Data Files".
In the
proposal for the project that had been submitted to the National Science
Foundation we made the following statements about our purposes and plans:
"The needs of the social sciences are not being fully met
by present trends in computer methods. This project is designed
to see that these needs are met, and thereby, we believe, to
drastically change the approach of social scientists to the
analysis of data.
"The social sciences are among those sciences which are
data-rich and theory-poor. Typical social research studies use
large bodies of statistical records such as social surveys, the
census, records of economic transactions, etc. Computation
centers and computer systems have been designed largely to serve
the needs of natural scientists and engineers who emphasize
computation rather than manipulation of large data files.
"The new system on which we are working and which we will
continue to develop has a number of important characteristics.
1.
It is part of the on-line time-shared system being pioneered at MIT by Project MAC .
The fact that we will be working on an on-line timeshared system also means the introduction into social
science analysis of naturalistic question and answer
modes of man-machine interaction, with quick response
rather than the present long time lags and overmassive
data production.
2, The research methods that result from our work are
relatively computer-independent and program-independent
That is to say, they can be used by social scientists
who are not themselves programmers.
3.
Our systems provide for the handling of large archives
of multi-level data.
4.
structures in
The systems being developed will use list
storage to facilitate use of the multi-source multi-level
data.
5.
The purpose is to create social data bases that facilitate
the construction and testing of computer models of social
systems."
2-..
Part 2
PROJECT ACTIVITIES
Our approach to finding better methods for handling social science
data was largely empirical--learning by doing.
We established a function-
ing prototype data archive with over 1000 surveys and with a score or two
of users.
At the same time we explored the problems of current data hand-
ling in a series of seminars and conferences,
first year of the project.
particularly during the
Three different types of seminar and meetings
were held.
Firstly, two meetings of the Technical Committee of the Council of
Social S:ience Data Archives were held here at MIT.
The first of these
was the inaugural meeting of the Technical Committee at which general
problems of data handling were outlined.
The second meeting, held this
fall, was on the subject of a telecommunications experiment which is being
supported by NSF and in which Berkeley, The University of Michigan, the
Roper Center, and MIT are participating.
A second series of meetings were seminars of persons in the Cambridge
community involved in computer data handling.
Among the activities from
which persons came were Project INIREX (concerned with automated library
systems),
the TIP system (concerned with retrieval of physics literature),
some economists concerned with economic series, the General Inquirer group
from Harvard (concerned with content
analysis data), and city planners
concerned with urban and demographic data as well as survey research specialists.
The seninars were largely devoted to the presentation of specific
data handling programs and problems.
The third series of seminars were user seminars.
We listened to
presentations by data .users from different substantive fields of study.
Each user described their substantive analysis problems .
The discussions
allowed interaction between these substantive users and the data base
oriented people who had participated in the previous seminar series.
Common problems that were revealed in these discussions deeply affected
our understanding of user requirements.
Let us list
some of the sessions.
Professors Lerner, Beshers, Tilly,
and others discussed examples of large, complex data*files.
Professor
Beshers vent into the problems of multi-level data, including particularly
census data, and also aggregate data such as that found in the city and
county data books.
Professors Charles Tilly and Stephen Thurnstrom of
Harvard University and Gilbert Shapiro of Boston College presented examples
of computer methods for historical archives such as records of political
disturbances and of social mobility in the nineteenth century.
Among
other presentations three dealt with problems arising in the analysis of
more or less free formated text (which in this report we call lightly
structured data.).
list
Professor Frank Bonilla and Peter Bos described their
processing approach to qualitative data analysis which has been par-
tially developed under the auspices of this project.
Professor Philip
Stone of Harvard presented his programs used for content analysis; Professor Joseph Weizenbaum described the ELIZA program which represents
natural language conversation.,
Professor Carl Overhage described the
INTREX project and its approach to data problems.
When we first started vorking under the 19F grant in 1965, we had
already begun development of a data handling system in connection with work
that we were doing at MIT on several research projects that required data
processing.
We continued with that effort to develop a modern datsa handling
We named it the MITSAS system,
system.
The basic objective of the MITSAS system was to produce tables (and
apply statistical tests to the data in them) rapidly, on demand of the
user, with the user able to draw the data from any one of a large number
of files (for example, surveys) in a large archive.
Other longer range
objectives were to facilitate analysis of multi-level data (e.g. combining
analysis of survey data with analysis of related census data) and analysing
loosely structured textual flows.
However, the initial problem to which we
addressed ourselves was to break two bottlenecks:
one, the need to pro-
duce tables quickly and easily on demand by persons not competent to write
computer programs, and, two, to take into an archive a large flow of raw
data.
The first of these bottlenecks was successfully broken by the development by Noel Morris of a CBDSSTAB program which with great speed produces a considerable number of tables in bulk processing mode and which
will print out such tables on-line at remote consoles one at a time as
selected.
any format.
The program is designed to accept survey data in virtually
Included in the system are facilities for doing simple re-
coding on input data and producing output in standard format on magnetic
tape.
The system is divided into several sections:
the cross tabulation
routines read and process survey input data and write magnetic tape containing tabular information; the labelling routine accepts information
about the variables used in the input data; the statistical package computes requested statistical measures.; the output routine produces designated
tabular output in readable formato
The main merits of the CROIAB program are its very high speed and
the relative flexibility and simp. icity of the control cards used to designate the variables to be run, t,
etc.
group codes and rearrange them if desired,
Although for reasons about to be discussed, the basic approach of
the MITSAS system has been aban&ned in favor of a new and improved approach,
the CROSSTAB program continues tc be extensively used and has been the
basis for further progranming developments for particular applications in
data processing.
It
has proved especially useful to persons desiring to
process large amounts of uniforimly formatted data.
In has been used much
more in batch processing mode than on-line although the on-line production
of tables is provided for.
The other bottleneck to Vich the MITSAS system addressed itself was
that of accepting large amounts of data into an archive.
lem was to get that data into
scientist.
The basic prob-
;hape where it could be used by the social
The philosophy of our approach all along has been to get a new
survey into the computer files with an absolute minium of human manipulation.
When an old survey comes to an archive from some original source,
what arrives are two pieces of equipent:
deck of cards or a tape.
(1) a codebook, and, (2) a
Sometims other things come along too such as
the original questionnaire separated from the codebook, or reports written
about the study, or memorandum information about the data such as the
number of cards in the deck, the sampling methods used, etc.
If
a couple
of thousand surveys come to a small archive, as is entirely possible these
days, the odds are that all but a few hundred of these will not be used
by anyone for years; it is also. certain that there will be large numbers
of errors in the material received.
There is no guarantee that the codebook
and the card deck actually match 100%.
There are likely to be errors and
omissions in the codebook as well as keypunching errors in the cards.
find errors in,
To
clean, edit, and later on otherwise manipulate the code-
books to provide machine readable labels, etc, can be by ordinary methods
a mtter of weeks of work for even a single survey.
The objective of the
MITBAS system was to reduce this labor to manageable proportions in a
number of ways.
We desired that it
should take no more than an hour to
get the codebook and forms for an average survey ready for the keypunch
operator.
This obviously meant that the survey went into the computer
uncleaned and full
It
of errors, inconsistencies and ambiguities.
soon became apparent that the scale of our new venture imposed
radical constraints upon our system.
The conventional methods for handling
single social surveys, which we were attempting to extend, simple could not
meet the problems posed by a large heterogeneous set of surveys.
data problem was insurmountable if
The "bad"
one were to attempt to re-format and
"clean" each set of data as it came in.
The place to straighten out most
of this trouble was on-line, on the console, after a real user had appeared.
To avoid an intolerable volume of wasted effort, cleaning and editing operations should be done in response to client demand rather than automatically
Even more important, the procedures for dealing with errors are not invariant between users; each will want to make his own decisions.
Nonetheless,
a certain amount of initial checking was felt to be necessary in order to
place the data in the archive and some rapidly usable tools were programed
for this purpose.
The MITSAS system was a collection of separate programs, each designed
to meet a current problem in survey processing.
M4ITSAS system included:
As far as developed the
(1) An X-ray program which gives a hole count of
2-6
the punches in the card images.
This is used initially to ascertain which
columns are multi-punched, vhich areas of the cards are used, etc.
It
is
also used to establish that the different decks in a single study are
complete, and that punches occur in the areas in which the codebook says
they should occur.
(2)
A tape map program which tells us in what order
the cards in a study are arranged on the tape.
(3)
A sort program to
rearrange the cards on a tape into a desired order, or to correct an error
in card order.
(4)
An edit program which is used to recode and re-organize
a data deck into packed binary format, by which new variables can be created
or placed in a desirable form, e.g. eliminating multiple punching.
The recoding capability provided by the edit program was designed to
achieve the following objectives:
(1) Recode multiple punches (this vas
intended to take much of the burden off cross tabulation and statistieal
routines).
(2) Redo poorly coded multiple level filtration questions (a
filtration question is one where the code for a certain response is dependent upon a response made to a previous question, e.g. "if
reason why").
nonvoter, ask
(3) Collapse data poor questions into more compact ones .
(4) Clean errors in dirty surveys.
(5) Sort disorganized surveys and
inform the user of incomplete respondent decks.
(6) Rearrange structural
format of the entire survey, i.e. group together questions pertaining to
similar topics on the same card.
A generalized editing system has to meet two of the more prevalent
problemi
in social science data:
(1) that of adaption to our methods of
obsolete conventions concerning the storage medium (e.g, punch card format),
and, (2) discrepancy errors in the data itself.
The editing system needs
to contain a problem oriented language with sufficient power to edit and/or
recode both convention discrepancies and error discrepancies,
An editing
system needs also to contain an inspection capability for comparing actual
data with a protocol and providing feedback for proper editing decisionso
These were the objectives of our first approach to an edit routine.
Despite considerable progress in developing the MITSAS system our
experience with that effort and the ideas generated in our seminar discussions led us to abandon that first approach.
The interfaces between separate
component programs proved extremely difficult to design when not integrated
into a unified system.
recognize that it
Furthermore, and even more important, we came to
is at least premature and possible destructive of creative
analysis to standardize structures for data content.
The MITSAS system from
the beginning had sought to maintain the original data in something very
nearly like its original format.
But, in response to constraints that
arose from the requirements of second generation computers, we did anticipate
at least a modest amount of editing to force the data into economical forms
for storage,, for retrieval from tape, and for simplification in analysis.
But as we noted above even this editing involved intolerable amounts of
human effort; furthermore, the availability of third generation computer
resources in Project MAC,
with random access to large disk files, suggested
a quite different approach free from some of the restrictions of MITSAS.
It
made
it
possible for the user to define data structures to meet his
own needs rather than conforming to system requirements.
The goal of a standardized and generally acceptable data structure may
never be achieved and in fact, in a dynamic science, probably should not be
aspired to.
Furthermore, the cost of trying to implement such standards
is considerable.
For example, it
took a competent technical assistant over
a year to rearrange about 100 surveys according to the wishes of an experienced survey analyst0
-
U
In the light of these considerations we scrapped our first attempt
(as must generally be done in developing any complex computer system))
and started over again.
The new approach which we call the ADML!E system has been developed
by Stuart McIntosh and David Griffel.
Their system harnesses computer-
usable interactivity to its fullest extent, in order to allow the user
to re-structure data in accordance with his own desires and to re-organize
the naming of his data.
The system, in order not to lose information and
to permit replication of transformations earlier made in the data, keeps
a record of all changes made and keeps the codebook and the data in correspondence with each other.
other criteria too.
The system was designed to meet a number of
Analysis of what social scientists do when working
with a data base yielded the folloving design criteria for the system:
(1)
The system must have the capability, under flexible error
checking procedures of keeping
(a)
the data prototype contained in the codebook, and
(b)
the data filesin correspondence with each other.
The user must have the ability to change either data or
prototype as he judges necessary.
(2)
The system must allow a social scientist to build indexes from
the data both within a single data file and across data files.
The indexes may be information indexes, a listing of the location of responses bearing on some subject, for example, social
mobility; or they may be social science indices, e.g. a composite measure of social mobility.
These indexes as retrieved
from the computer are a major analysis tool.
The manipulation
of these indexes may take the form of co-occurrence (for example,
crocs tabulation) tables or tree construction.
(3)
The system design should be embodied in a computer system such
as Project MAC where there is highly responsive interaction
between man and machine.
(4)
The system must be user oriented.
The goal is to place a
social scientist on-line, where he may interact with his data
without the aid of programmer, clerical, or technical interpreters.
(5)
The system must allow the social science user to provide considerable feedback into its design and embodiment.
The ADMIE system was designed to meet the above criteria.
In rough
outline the substance of the system consists of four sub-systems as
follows:
(1)
The Organizer is a subsystem which permits creation of machineexecutable codebooks, both for data processing and data
auditing.
(2)
The Processor is a sub-system which uses the executable
codebook, to transform the data to bring it
into correspondence
with that codebook.
(3)
The Structurer is a sub-system for inverting data files in a
way that will permit highly efficient analysis with third
generation computer systems.
(4)
The Cross-analyzer is a sub-system which operating in an interactive mode builds indexes and also produces co-occurrence tables
both within and across files a
At present, the ADMIS system is being used as a working prototype
and is constantly being improved as a result of procedural research user
feedback.
The development of the ADMI1E system has not been a theoretial
exercise0
It
has been developed in the environment of an actual data
archive used by students and faculty members at MIT for the conduct of their
research.
The MIT data archive now contains over a thousand foreign surveys
with several hundred American surveys on order.
The overwhelming majority
of these materials come from the Roper Pblic Opinion Research Center in
Williamstown,
The MIT archive is a member of the International Survey
Library Arsociated established by the Roper Center.
Our foreign survey
collection contains the equivalent of appro imately 1,250,000 cards,
We are developing a computer usable catalog of this collection of
codebooks.
We can use the ADMINS system to analyze our collection so as
to find the codebook description which fits the analyst Ua particular interest.
We can also use ADMINS to analyze the data file described by a par-
ticular codebook.
Also where the codebooks have been designed by CENIS
researebers we are making the codebooks machine readable and then developing
methods for producing adforms, i. e. machine-executable codebooks from these
machine readable codeboks.
Further, we are developing computer use
classified indexes to these machine readable codebooks wiich we will tntegrate into the toper item index,
We
ill
be able to transfer this exper-
ience to our total -collection when all of the codebooks become computer
usable.
Attached is a list
in Appendix 2-A of the MIT CENIS surveys whose
codebooks we are making computer usable in support of the required analysiso
Finally, there have been several classroom sessions and enumerable
tutorial sessions with the fifteen or so most active users who are learning
to use the ADMINS system for their particular substantive analysis.
of ADMINS users is attached, Appendix 2-B.
A list
Appendix 2-A
Pofessor Frank Bonilla in his study of the Venezuelan Elite has
biographic and career data for 200 respondents on which he is doing a
trend analysis using Admins.
Professor Daniel Lerner administered attitudinal surveys to elite
panels in three Eurorean countries (England, France, Germany) over five
time periods (1955,
1956, 1959, 1961, 1965).
The analysis of this con-
temporary multi-source data has already begun on Admins.
This research
involves analyzing attitude change towards the concept of the European
community vis a via the Atlantic community within these three countries
over the designated time period.
Professor Ithiel Pool is analyzing (in conjunction with a mass
media simulation of the communist bloc) eighteen surveys administered by
Radio Free Europe over five eastern European countries during the late
fifties and early sixties.
a multi-source nature,
Here too we have contemporary social data of
This research involves analyzing the effect of mass
media and communications---specifically Radio Free Europ--on political attitude change in five eastern European countries over the designated time
period,
Professor Jose Silva (in conjunction with the Cendes project at
MIT CIS) has a survey administered to forty social groups (eag.., Utudents,
priests) in Venezuela in the early sixties,
'Re vishes to build social
indices which are applicable across these many groups
.
Professor Silva
has already used an earlier version of Admins to cross two source files
These social
(e.g., social groups) building four social sciences indexes.
indices (e g., propensity to violence, want-get vs. satisfaction) when built
across the differing social groups vill provide initialization parameters
for a dynamic social simulation of Venezuelan society.
Professors Ithiel Pool and George Angell have biographic and
attitudinal surveys administered to 200 MIT sophomores taking an experimental
introductory course in social science, and 100 MIT sophomores in a control
group.
The attitudinal surveys were administered both before and after the
course was given.
As well results from psychological tests, histories at
MIT and admissions data for these 300 were available.
Here we have, five
different data sources prepared at different times, however describing the
same population; an integrated analysis of this data is planned on Admins.
This research is studying the political and social attitudes of MIT undergraduates--who are viewed as a prototype group for science undergraduates
at other universities--and the effect of a sophisticated introductory social
science course on these attitudes.
In the elite European panel data already described there exists
a cross reference file of people who repeated over the time periods, i.e.,,
we have a record of the attitude changes over time of individuals.
fessor Morton Gorden wishes to analyze the 'repeaters",
Pro-
which are a sub-
group of the entire panel, on Admins using the multi-level capabilities
of the analyzer to relate the identical individuals over time.
Professor James Beshers has urban data for the Boston area organized
by individual, by family unit, and by residence district.
In his study of
migration he needs to trace individual movement by relating individual,
family and residence subfiles. Stumarized migration data will be input to
a dynamic Markhov model of migration flov
vill feed back into the data analysis.
Results produced by these nodels
2B-l
Appendix 2-B
ADMINS USERS
Name
c
Research To
2
J. Beshers
Urban Planning
P, Alaman
Urban Planning
I. Pool
Eastern European Surveys
G. Angell
Educational Data
F. Bonilla
Venezuelan Elite Study
J. Silva
Venezuelan Survey
D. Lerner
European Elite Interviews
M. Gorden
European Elite Interviews
F. Frey
Turkish Peasant Survey
A. Kessler
Turkish Peasant Survey
Course 17.91
Student Research
Course 17-92
Student Research
David Griffel
Data Archive
Stuart McIntosh
Data Archive
90WOMOMON
3-1
Part 3
THE CURRUET ADMINS SYSTEM
Our purpose in the design of the current ADMINS system is to analyze
data at a console, starting with the rawest of materials and ending up with
analyzed measures and arrays.
To do this ADMINS provides highly interactive
sub-systems that will (1) allow the necessary clerical operations so that
the social scientist can re-stracture the content of his data by recoding
his variables and by re-grouping them; (2) provide a re-organizing capability by allowing the social scientist to name the grouped data according
to his purpose; (3) allow statistical manipulations on the named data files.
The data structures that we deal with are basically matrices.
There
are N records, each record describing an individual or social aggregate
of some sort.
by a codebook.
The structure of the data in each of these records is specified
Our objective is to bring this codebook 'scaffolding'
into
correspondence with the data by a series of clerical operations that are
interactive.
That is,
we can perform a clerical operation,
refer to the
result of this operation, and with this new evidence proceed to another
clerical operation.
We must also arrange our clerical operations so that
we can work in more than one file of data at the same time, because our
analysis
ill
require that we can refer to and use data from different
files, create new files, name new data sub-sets and indeed completely reorganize the original data according to our current purpose.
We have to be
able to save the results of our operations in a public and explicit way so
that they can be resumed andtor replicated, and at all times the codebook
and data must be in correspondence.
These activities must be computer
based (not people based); what is required of people is a knowledge of the
clerical operations, an ability to name the results of their operations
and a purpose that they can make public v1s-a-vis specific data.
lowing descriptive narrative indicates how ADMIU
The fol~
is used for the analysis
of social data.
The Organizer Sub-System
The codebook that describes the data does not usually come to as in
computer usable form.
In order to make it
computer usable we type in at
the console the questions and answers in the codebook for that portion of
the data wbich we wish to analyze, and add to this a variety of control
statements.
The information typed at the console we call an adform,
for administrative form.
short
This is normally typed selectively by the ques-
tions of interest to the user, rather than typing one codebook out from
beginning to end, though it can be done either way.
to the Organizer sub-system,
which he
The adform is input
If the researcher starts by typing only that
ants to use right away be does not limit himself later on because
data resulting from different adforms can be combined for analysis at willo
The codebook describes the state of the data as it ought to be,
Assuming
there were no errors in the actual recording of the data, the codebook will
be an exact de3cription of the data,
in the codeboook descriptions as well"
Of course, there might have been errors
Because errors are inevitable and
in fact turn out empirically to consume a major portion of the analyst'a
time, auditing capabilities arm provided
The function of the audit is to
take the codebook description of the data as it ought to be, and find if
the data compares exactly with the codebook description or not.
When it
not, there will be an error message of an audit nature telling us of the
discrepancy (the lack of correspondence) between what the codebook says
there ought to be against what the data is saying there is.
does
3-3
We cannot emphasize too much the importance of providing error correcting operations.
If we learned anything from the series of seminars
and data analysis experiments that we conducted it
was that errors of one
sort or another take at least half of the time of data analysts.
The particular computer system configuration that we use, the MAC system here at MIT, has several different programs for editing textual input
data of moderate size,.
One of these programs is called EDL.
program we use to edit adforms.
It
is this
In other words, if we make typographic
errors of one kind or another typing in our adform, we use the various
change,
delete, retype, printing features of EDL to allow us to pick up
our typographic errors as we go along and change them.
editing in the generic sense.
editing also.
This is called
There are many other uses of the word
For example, the word editing is used sometims to apply
to the change that one makes in transforming an input code (i.e. the list
of responses under a question as it cones in)
into an output code (i.e,
the revised grouping of these for analysis)
There are programs called
editing routines that do this and this is a very different kind of editing
to which we refer to below.
In the ADI4US system the social scientist never alters the actual
source data, even if he considers it
for his use if
he so desires.
to have errors in it.
He alters it
He does sch altering of data after he has
processed the codebook and the data file under the organizer-processor loop
to find out discrepancies, i ee lack of correspondence between the adform
and the data.
The analyst can then either change the codebook or change the
data as he concludes he ought.
The function of altering the data can also
be called editing and we have programs which will allow us to so alter the
data.
npaut
The generic term that we have applied to the act of altering the
record to the output record for sall of the respondents in a given condition
is "transformation" and the term we have used for changing a particular
record condition after one discovers an error in the data ve have called
altering.
The output from the organiser program which has taken the aform as
input, is,
in effect, a maehine-executable codebook that kowseverythiug
there is to knov about the state of a particular data file.
program is a kind of application compiler and oae is,
This organiser
in effect, processing
(this normative "data') which is another use of the word procesasing
the adform is input to this applicatio
i.e.
compiler program called the organ-
izer where one first of all organizes in what we cal
a diagnostic mode:
In this mode one gets back two kinds of error mssageis.
If there are some
syntactical errors In the adfonm, that is one has made some kind of transfomation statement or audit statement incorrectly according to tht ayntactical conventions of adforming, one will get an error message,
One also
gets error messages of another type for the organizer checks on adfom
incoherency as well.
For example, one has nine subject descriptions, nine
entries, but ten codes mentioned for transformation,
This will be described
as an error in the diagnostic mode of organizing the adform
There are a large variety of messages that the computer gives about
syntactical and coherence errors when it
It
is running in the diagnostic mode.
describes these errors in effect by running a
starts at the beginning and continues until,, in
historical commentary which
effect, the complexity that
ensues by one error sitting on top of another so disorients the beast that
it
stops 'talking',,
However, these error messages are isolated for the user
by the question ,ategory in which they occurred and for that part of the
statemnt within the question,
ie.
with which they are concerned.
It
the adfom statement within the question,
is quite easy for the person to go to
this error message at the console and relate it to the particular part of
the adform where the trouble occurred.
As the first set of errors are
eliminated, deeper ones will be flagged on the next round,
After two or three rounds of this organizer diagnostic loop one eventually clears out aU of one"s syntactical and coherence errors
One then
puter informs one that the adform is within itself errer-free.
runs the adfona in
a non-diagnostic mode.
This results in
The com-
outputing a
machine-executable codebook, which contains programs for auditing and
transforming data, tables of subject description (e.g. question and ansver
text), tables of format locations and so on.
The Processor Sabb-ystem
ome data on the di$sk
After o rganiing the adform one then usually pats
Note that up to now ve have worked only on the codebook without having to
have any data
We now wish to process data under this machine-executable
codebook, first to look for errors in correspondence between the two,
have a program called the processor that is,
We
in effect, a program that runs
the dWAa under the controls in the executable codebook.
This process pro-
gram has many different types of error statements and different types of
modes of use
One can sample items of the data as one is processing tbem,
one can set for errors by category, one can set for the total numbers of
errors, one can run in silence mode but say with error verifications every
100 errors, one can operate in dummy mode and so on.
That is
,
there are
many different types of control available during the running of the processor.
3-6
The provision of this variety of controls can save enormous amounts
of computer time.
Instead of running from beginning of the file to end
and tabulating every error, one can tell
the program to scan the data
until N errors have been found, stop there and report.
Thus, corrections
can be made as soon as enough data exists to spot them.
The result of the running of the processor is what we call an error
An error report is in effect an analysis of errors by data cate-
report.
gory.
For every category of question which is being processed, we have
the errors which occurred in this category by type of error.
is very useful.
This report
If we have, say, 150 errors and we found that 149 were
all due to the same discrepancy then one has a very useful overview as to
what one wishes to do.
Whether one wishes to change the codebook for the
149 errors which seems the likely thing to do when it
is probably an error
of interpretation, or whether one wishes to go in and alter the data; there
are tools for either.
There is the edit of data--the alter instruction
which we discussed previously when one wishes to change the data.
is EDL if
There
one wishes to change the adform with the error report in hand.
There are usually many errors, one has to think whether one is going back
to change the adform in one of many different kinds of ways or whether
one is going to alter the data somehow.
When one has made the necessary changes to codebook and/or data, one
runs the data again under control of the now corrected adform.
Another
type of summary file that one gets from this processing of the data under
codebook control is what is called the marginals, that is the aggregate
frequency of each entry under each question or category
to pocess .
that we have chosen
The marginals are in effect the initial information necessary
3-7
for one to implement one' s analysis plan.
The marginals are obtained in a
report form where the frequency for each entry appears alongside the ordinary
language subject descriptions by which it
is designated in the adform.
Such processing of data could just as well be done without the user
sitting at the console if the data were in good shape, if
errors of one kind or another.
there were no
However, our experience to date is--or to
say it more particularly--we have never yet processed a data file where
this was the case.
Errors in data go from a variety of unique, scattered,
individual type errors that are not particularly errors of quantity per se
(they are errors of consistency) to the other end of the spectrum where the
data are not at all what the codebook says they are supposed to be, that is,
the data are not there or are not there in the code format that one expects
them to be in.
In either case, the interactive capability helps one to
learn very quickly what way the data process is going, whether one wants
to quit and go back to do something else again in the adform, or get some
other data, or to learn as one goes finding what exactly is wrong with any
given particular data and decide what to do with it.
In the ADMIIS system,
at the same moment that one has finally established that one's data are
clean and correct as far as one cares (for minor substantive errors can be
labelled as such and left), one has one's marginal automatically.
In order to accomplish the processing the user has to specify for the
data he is interested in,
it
ought to be in.
and if
and only the data he is interested in,
He need not talk about the data he is not interested in
the data he is interested in is not as it
will tell
him so0
only in the way it
what state
ought to be the program
He does not have to state in what way it
ought to be.
ought not to be,
The user specifies only the data he is
3-8
interested in and the condition that it
ought to be in.
The computer, or
more correctly the AMDIS program designed for this purpose, is progranmed
to ignore the data that the user is not interested in and to tell him what
of the data he is
be.
interested in is not in the condition that it
ought to
This means that the user does not have to state all of the possible
'ought not' conditions for data, but only the "ought" conditions.
It
should also be noted that there is an append feature that allows
one to process data at different times and combine the resulto
The append
feature would be used in processing under the following circumwtances.
Let
us presume that we had previously done a survey in two interviewing waves
and that we only wished to look at one wave of this survey, and then append
the other wave later.
Or, let us assume that we wished to select from a
particular survey in some specified--random or otherwise--vay *ome of the
survey respondents to process and later go and look at some of the other
survey respondents and append them also to the previously processed data.
This permits one to complete one portion of the analysis vithcut waiting
to establish that there are no errors in other portions of the data first.
In addition it permits the saving of computer time by sampling as one is
processing, by random choice over ID numbers,
for example.
The result of the process operation is that we have a file of marginal
frequencies and a report file in which there are zero errors if we carried
the correspondencing of codebook and data to the point of clearing out all
of the errors.
We also have a file of data, and, of course, from the result
of the Organizer sub-system we have a subject description file for the data
(e.g. the codebook) and a file of, in effect, basic control information about
the data stateo
3-9
The subject description file, the control information file, the marginals (which we called the aggregates file), and the errors file are all
in a sense intermediate files generated by ADMIENS for use by the ADMIE
system during the operations that we call organizing and processing.
We
needed an error report file to help us to decide what to do with errors0
We needed an aggregates file in order to help us decide what to do about
analysis, we needed a subject description file because it
can be used in
analysis, in effect to label the data that we are analyzing, and we needed
a control information file to assist the system to know everything about
the state of the data.
The user in carrying out the activities that generated these files has
not had to think about computer technique problems.
He has focussed solely
on substantive questions about his data and what he wants to do with it.
We have designed into the ADNMI
system the responsibility for the system
to attend to problems of form and packing and not tie the user as he is
tied in conventional cross tabulation systems to the very tedioa3, tiresome,
detailed, finicky specifications of form and packing.
We now come to some rather technical points, but one which lies at the
heart of the efficiency of the ADMIE system.
The usual fo:m in which data
has been stored and handled in virtually all social science data processing
systems--since the days of the punch card counter-sorter-is what we call
an item record file.
For each respondent, or other unit ,f analysis, we
keep a file of each item about him in some set sequence, e. g., the IBM card
is about an individual and records his answer to each question in turn,
is logically obviously possible,
and
for
certain purposes it
It
is much more
efficient to invert the method of record keeping, to make the unit record a
particulax answer (or other ctegory) and under tbat Wo list
all,
the individuals who, gave such an answer.
Tie
in sequece
Inverted :file is
what
we call a category file,
The advantages of working with a category file will become clarer
as we talk abott the problems of socal
eussin of the Analyzer sub-system,
Eere let us zimply take an extreme
Su.ppose In a cross section sample of the population one wished to
case
findthe
requxir
Ph .Jos who were non-voters; An analysis using item records would
the prLoceing oxf every record to ascertain if
a Ph.D . and if he wva a non-voter,
would go to the relatively short li.t
list
science data analysis in the dis-
of non -voter
the respondent was
An analysis using category records
of Ph.D.-
and to the relatively ahort
and make up a new very s'hort list
of the intersection
of these tvwo saort lista
In an interactive operation at a cosole where the analyst asks one
spec.fic question at a time and thea goes on1.o another specifie question
depending on the anser to the firtit
-',.a almost certain that much efftc-
iency will be gained by keping the data in category form,, The analysis is
proceetding in tens of categories no more than a half dozen or so at a
tim.
The researcher ought to be able to pick up a category without having
the computer scan every One,
For tbat reason
once the Organizer and Processor operations have put
the data into shape for analysis, but befoxre the anlysis begins, we invert
the data file--that is,
the process file.
A process file which is an item
record file in the ADWLES system along with the control information files
from the organizer output and the subject description file and the aggregation file that was the basis for the marginals is input to a program
which is called the structure program, which could equally be called the
3-11
We
This program outputs what we call category records.
invert program.
in effect take the items from. the item record file, and invert them to
categories; within each category is the control information for this
category, the subject description for this category, the umber of items
that exist and the responses for each item within this category; this
with the aggregation of responses for all the items under this category
compose a category record.
It
is these category records which are input
to the analyser.
The Analyser Sub-System
Once a file has been inverted, the ADMIN
system's Analyser is able to
go to the disk and read into core just the categories necessary for a specific
operation.
It
is secure in knoving that as soon as these categories have
been used for this specific purpose it
can delete them in core because it
always has readily accessible copies of them on the disk.
This swinging of
categories in and out of core allows a lot of the flexibility which characterizes the analyser sub-system.
Such a mode of operation which makas full
use of the available 'fast" memory that the computer has, is almost inpossible in a tape based system.
This is due to the fact that a magnetic
tape is a serial access device, that is,
in order to read a section of
tape, a tape head has to be mechanically moved along the tape till
appropriate section is reached and then the reading can begin.
however, it
the
With a disk,
is as if one had many tape heads moving along different parts of
the disk, all under control of a program residing in core.
how a disk really works but it
(This is not
is a useful way to think about it.)
An
efficient system is the one that has to bring into core only that material
which one wishes to analyse, i.e. the wanted categories.
The basic thing to understand about what one is doing when one
analyzes a file is this:
one is seeking out particular characteristick
of one or more categories, combining the characteristics thus sought, and
then identifying the individuals who have the intersection of characteristics., This process can become exceedingly complex as one becomes more
and more elegant In the combinations of characteristics that one pursues.
What one is doing when one analyzes a file, is,
that one is re-structuring
the content, and re-organizing by giving new names.
The other way that one proceeds in
analysis is
that as one goes along
one brings statistical tests to the measurements that one ban isolated by
the first kind of analysis.
One need not do this but one can.
For example,
one can go to a personnel file looking for the characteristics of various
different types of people according to some purpose, looking at their
education, their language skills, their geographical area experience, and
so on.
One can come up with the individuals who have certain distinctive
characteristics, and leave it
this measureint at aL.
at that .
However, if
One brings no statistical test to
one is looking at a file from the
point of view of social science, one normally brings some variety of
statitical
tests to these measurements according to the scientific
purpose one has.
The activity of isolating characteristics we call indexing0
indexing?
What is
An index answers the question: Where are the people located who
have a particular characteristic we are pursuing
This location is,
of
course, the location in the record, not the location out there where real
live people live.
In other words, when we know where in the record these
3-13
who have a particular characteristic and how many of then there are
people ar
who have this particular characteristic we have what we call an index to that
particular characteristic.
In any particular category record, there are two types of characteristics that one uses.
of the record,
that is
There is the characteristic that describes the form
(e.g. the ID number of the record); that is,
relevant only because there exists a record.
a characteristic
However, most of the
characteristics described are characteristics of the content of the record;
they tell
you what occupation, what age,
what name, what social security
number, and so on characterizes the object (here a person) described by
the record.
In ADIl
we can re-structure the file according to what is in the
content of the record.
This in something which is most difficult on con-
ventional analysis systems.
embodiment, but disk is.
pointers .
Magnetic tape is not an addressable media of
Re-structuring implies re-structuring of data
If you cannot address a tape flexibly, you cannot re-stru cture
the data on it without actually physically moving data around which, if
only for economic reasons, is impractica l-
However, a disk is addressable
so we can in effect simulate the re-structuring of the file by really moving
data pointers around--pointers which addressably reference data existent on
the disk.
The generic term to be
taed
for this activity is that of indexing.
An index states that such and such characteristics of a record exist at a
particular place.
When you have many indexes you are in effect looking at
the records in the file from the point of view of those which you have
indexed, as opposed to the point of view by which they are in fact physically
in the file
0
Let us give an example of an index before we go any further.
If som
respondents were asked, "what is your religion?" and the responses that
they were allowed to chose from were names of different religions, for
example, Protestant, Ctholic, Jewish, Mormon, and no religious affiliations, then an index to the Protestant religion would be an index to the
code representing the Protestant religion in the particular question
asked; it would be an index listing the persons who took this option in
response to this religion question.
The ADMINS naming mechanisms are used to re-organize such indexes
according to the user's purposes.
If
one has indexed Protestants and has
also indexed persons with university level of education then the combination
of the characteristics "Protestant and university educated" would be the
result of the intersection of these two indexes.
This new index could be
named "Establishment" and could be called upon whenever it
is needed by
That is the name in this case may be the name for
referring to that namea
a concept in a social theory under which the data are to be re-organized.
The term intersection is borrowed from the language of set theory in
mathematics,
For example, we could have a set, say set A,
names of men wearing green shoes.
men wearing white shirts.
containing the
A second set, B, could be the names of
The intersection of these two set A and B, would
be the names of men wearing green shoes and white shirts.
Notice that an
index of the contents of the set is the names of the individuals which
belong to it.
That is,
given an indexed set, we have a way of tracing back
to the individuals who make it
up.
In an ADMINS index, the analogy to these
names of individual elements are pointers to the individual items.
The
basic indexes are defined in terms of characteristics recorded in the
3-15
categories and entries describing each item.
The instruction "intersection"
can be used to combine indexes constructed in this way to obtain an index
which contains pointers to people who were in all of the original indexes.
The intersection instruction can, of course, be used on indexes which were
constructed by previous intersection instructions.
we had two simple indexes.
That is,
In our earlier example
we had two indexes which were built
up from the category and entries of the individual responses.
The first
was an index to Protestants and the second was an index to university
educated people.
The intersection of these two indexes gave us an index
which pointed to people who were both Protestants and who had university
training.
This index once constructed was named and could be referenced
in further indexing instructions.
Another instruction in the analywer,, whose name we have again borrowed
from the language of set theory is union,
Whereas, intersect gave us an
ANDing of the original indexes, the union instruction gives an ORing of ihe
original indexes.
To return to our example,
if we unioned an index of
Protestants with the index of university-trained people, we would obtain
an index that had the following types of people in it:
Protestants who
did not have university education, Protestants who did have university
education, and university educated people who are not Protestant.
That is,
each member of the new index must have been in at least one of the original
indexes, and perhaps may have been in both of them.
The complement instruc-
tion is used to construct an index whose members are not members of the
original index.
except that it
The relative complement is quite similar to complement
only deals with a subset of our total population.
M6
The effect we have then is the following
by referencing categories and their entries.
One can build simple lrdexes
One can thean build
o
ructon
basd on these indexes using the indexing instructions; intersect, union,
complment and relative complement,
may be input to fturther instractionso
The results of these constructions
This process can continue indf Initely
until the purpose of the user is Satisfied,
The index to a particular cbaracteristic of a category is not th
type of simple index that may be constructedc
One may construct an index
that is the eategory itself, one way construct a
of a particular category.
onlUy
index to a nw*rical value
Nvertheless, one knows what has been idexed and
one knows what further operation one visbes to implement on the construction
After the user gives a particular indexing instruction he sees the nuwmber o
people in the index (raw figure and percentage cf population) and the nan
he has assigned to the index.
This information is- immedlately printed <n
the console in an interactive mode for him
quent indexing natuxe.
omake decsione of r subse-
When the index is the resuit of an intersection
the computer also retarns the statistical significance,
for example, the
probabilistic measure of non-randcmness tor this particular intersectionc
Suppose a user does not have his adform by his side and cannot remember what all the categories and entries represent.
The "subject descrip-
tion" instruction will allow the user to get the subject description and
marginaU for a particular category, either in toto, or selectively for the
question, or for some of the entries for this question as required,
The analyser also has a directory capability, i.e.
a capability
ft keep-
ing a record of the names of the ongoing constructions as the user purues
a particular analysisz
We will talk about this in detail later
structions we mean, of course, that which results fzm using the
y, cot
theaoretic instructionso
(Wit call the instruction 1cross, whI'ich -is typed
at the console to invoke the analyzer, a command.
We call &U of thi
instructions which are used in the analysis, and we have mentioned so far,
index, intersect, union,, complement,
description--instructions
)
relative complement, and subject
Costructions can be very complex and ewh
constriction is known by the name which is assigned to it
by the userc
In the classified directory, previoily mentioned, one has in affect a
contents list
of the constructions and their names.
There are other
instructions for gaining access to the information in the classified
directory, 'which we will discuss later,
Tb
construction of complex indexes way, if
it
user, be perceived as passing one's vay down a tree.
is convenient to the
However, when one
has arrived at a particuLar cluster of characteristics,
(that is,
a com-
plex index) one may wish to take the complex indexes and make them into
the columns of a table0
One may chose to put in the rows of thi& table,
for exmple, twc categories, that is,
two questions with their entries
By invoking the table instruction one gets the result of the intersection
of the indexes in the columns with the category entries in the row
This information may be supported by statistical tests of these interseetions such as simple percentages of one kind or another or a pbabilisti
measure of non-randomness or any other statistical test the user may thnk
appropriate,
One mday also put indexes in the rows of the table.
The way
the user uses the tabling facility is to put up in tthe column as many
indexes as he can keep in analysis attention &pace against the various
categories and indexes in the rows., bringing tests with which he feels
comfortable to these mesurements.
He keeps taking the results of the
intersections (that are in the cells of the table) which are of interest
3-i8
to him, making new complex indexes out of them, putting them back up as columns
of the new table, introducing other indexes or the categories to the rows, and
working through this in effect data sieving cycle in order to get out of the
data these significances that he is seeking.
Just what does the users have to do to, say, get from the computer a
table with five columns with indexes in them and two or three questions plus
some indexes in the rows?
The user builds each of the give indexes according to the discussion
that has gone previously.
Similarly for any indexes in the rows
of course, thereby named his indexes.
He has,
He has nothing to do to bring in the
categories except to type their names in the relevant row instruction,,
He then
names a particular statistical test that he wishes to invoke with the lstat e'
instruction, invokes the
table is
constructed.
active decision making.
table" instruction, and consequent upon this the
ie may have the table come out at the console. for interIt may at the same time be being filed on the disk
file of the given name that he has allocated for this table.
The naming restrictions for files are the restrictions of the MAC
system.
We have two names of six characters each for a file.
This, however,
should not be confused with the naming conventions for indexes in the Analyzer.>
In effect, the naming conventions for indexes can be considered from two
points of view.
We call a name for an index that is six characters or less,
a symbol, and a name that is greater than six characters a name.
index can have associated with it a symbol and/or a name.
Thus, an
Furthermore, the
way one chooses to use the symbols and names may be considered from the point
of view of faceted use .
That is to say, symbols may be used rather in the
way the Dewey decimal classification uses symbols,,
That is,
one can break down
3--19
a symbol into sections.
The difference between sections of the symbol string
are facets of the concept one is naming.
Likewise, one may do the same using
names, with periods delineating variable length sections.
One could have three
word forms separated by two periods that constitute a name for an index.
Each
one of these word forms is a facet within the particular use of the name for
this index.
rather similar to the Ranganathan Colon classification
This is
use of names.
The analyzer has thirty or fourty error messages as to what is incorrect instruction usage which may occur under the pressure of analysis.
We
mentioned the existence of coherence errors when we were discussing adforming.
it
Likewise,
is possible to make coherence errors during analysis,,
one instructs the analyzer to build an index to a non-existern
For example,
entry in a
category, say to the ninth entry in a category which only has five.
Or one
instructs the analyzer to build an index to a category but this category does
not exist.
Or by the type of syntactic expressions one uses in instructing
the analyzer to build an index one is in effect saying that one believes the
category one is
referencing is nominal whereas the analyzer knows it
numerical values.
not exist.
contains
Or one references the name of an index but this index does
Or one builds a complex index, but this complex index has already
been built, in which case the analyzer just tells you it has been built and
gives you the name you previously assigned it.
If the analyzer detects an
error, an informative but polite comment is firmly printed and the user is
alowed to continue.
Although we havc- just begun to talk about the analyzer,
already we have
referred to quite a few different instructions with different options in their
syntax.
How can a forgetful user keep track in his mind of all the possible
options the syntax offers, especially during actual analysis?
There is an instruction that the user can give.
list
The console will then
the various instructions available in the analyzer.
There is also a
"syntax' instruction which, when invoked along with the name of an operational instruction, will give the appropriate syntax for the use of this
particular instruction.
We have mentioned on several previous occasions the classified
directory.
This is a very important feature of the analyser.
We have to
remind ourselves that when we are building indexes to the original file of
individuals we are in effect simulating the construction of a new file of
the individuals who exhibit the relevant characteristics which define these
complex indexes.
As we build many of these indexes we have to invent
means of keeping a usable record of them.
means by which we do this.
Eome
The classified directory is the
In a classified directory there is a record of
all of one "a constructions and the names for each of these constructions.
The classified directory also contains information as to how to rebuil4
these constructions on request; under control of a complex memory purging
algorithm,
the analyzer throws away all of the pointers to the locations
of the individuals who exhibit particular characteristics,
classified directory it
but using the
knows how to rebuild these constructions.
Further-
more, as the only way a user has of accessing the indexes he has constructed
is through the names he has given to these indexes, the classified directory
must keep these names in an accessible fashion, and there can be many, many
names.
Therefore, we need several instructions for accessing the names in
the classified directory in selective ways and we also need procedures which
look at the classified directory in order to rebuild the constructions as
required, so as to rebuild the old constructions that were originally named.
If the classified directory were only concerned with the analysis of one file
this might appear to be a relatively simple housekeeping matter.
However,
as the analyser has been designed to work many files at the same time or
many files discontinuously, that is to say, work one file, stop working
this file, go work another file, come back and work the file you were
working previously and so on, it is evident that the classified directory
assumes a very important role as it
activities.
is responsible for recording all these
Furthermore, when a user has completed a certain amount of
analysis and ceases to work, one has to have economic and intelligent ways
of saving the results of his operations for further use when he comes back
to the analyzer and wishes to start off analysis exactly where he left off
before, instead of going back to the beginning again and redoing work that
he has already done.
The simplest instruction for seeing the contents of the classified
directory is the list
instruction which requests that the analyzer print
the classified directory out on the console,
That is,
for each index,
it will print out the name, the symbol, the number of people falling under
that index, and the line number where this index was constructed.
The line
number can be used as a cross reference to the line in the console typescript where the instruction was given.
The next level of complexity would
be to ask for specific indexes by name, that is,
give the 'iste l instruction
followed by names an4or symbols and the computer returns the listing just
for those names and symbols.
Alternatively, using the facet feature one
could ask for indexes, in a particular facet of which there are certain
word forms, and another facet of which may bear any word form.
for symbols.
The 'list construction'
tion is similar to the list
instruction, that is,
the
Likewise
listec
instruc-
instruction except for each index printed out
additional information is provided.
That is,
the listc instruction displays
the actual sequence of instructions that were used in order to build up
3- 22
this construction ,
Agas*in,
one may use the implicit oprerations requiring
indexes with certain explicit facets and leaving implicit other ftaceta
The
9allist
instruction is again similar to the list
instruction except
that for each indea printed out the user is also told in
this index bas been constructed, as opposed to the list
hich source file
and liste instrac-
tion where the printing out is ordered by source file.
One may think of the classified directory as a hierarchical stxucture in the sense that every index may be imbedded under same more complex
index; and, i.
turn,
Ader any conplex index exists saw simpler Indexes
from which the comilex index was constructed.
There are instructions whicA
allow one to view the classified directory in this way.
The source l instruc-
tion WIl. cause the aalyzer to print back those indexes which were the
sources of the construction the user referred to by name in his sources
instruction,
For exmuple,, the sources of 1establishment
',
ie.
univerity
educated protestants is the index of name 'protesatants& and the index of
name "university educated".
That is., the sources of an index are those
indexes one level below the source index in the hierarchy.
inverse of this instriction is the "resuilt' instruction.
In effect, the
This shoa
for
those indexes request'd which indexes are constructions based on these indexes.
For example, one of the results of Iprotestants I are Iestablishment
And one of the results of the indez
ment*.
Iniversity educated'
is also 'establish-
One may also use the faceting capability of the names to build a
hierarchy of names ea
display this hierarchy ith
structions using the implicit request, that is,
the list
and listc in-
requesting display of
indexes where one or more facets are specified and the other facets are left
open
The axalyzer also contains inestructions for various recodrg operations deemed essential for the analysis process.
For example., one may
take two category records and attempt to equate tbem item by item resuling
in a new category which is the ratio of the two previous ones,,
For example,
one has the abialty to build a category where the value for each item is
the number of indexes out of a grup of indexes in which that iwa was
present.
logic.
Tbi
iIs useful in index contruction using majority or threShold
That is,
classifying a person as faling under a certain concept if
he posesed a majority of a group of prescribed cbracteristics
exmple,
For
seven indexes which represent seven indicators of pro-
If we hav
pensity to violence, one may chose to dscribe a person as. being prone to
violence if
he is
characterised by at least three of these withotx
regard to
which three.
user vould like to firm-up
Usually, after some complex analysis tb
and consolidate the concepts ht has investigated.
an adform which, instead of referencing tb
He may do thi
by writaG
raw data as his previous adtorm
did, refereaces the category records which were the oitput of his previous
adform aad combines the entries of these categorrie
in
has pointed cut would best sipport his analyais purpose,
ays
haibt his analyai
These combinations
may be of entries within a category or may draw upon entries from Several
categories.
That is,
the uiser may build. a macro-category by referencing
the entries in several different questions.
It
i
this operation of building
new ceategories trhich causes us to use the generic term category, as opposed
to the specific term question.
I
analyzing suivey dat& the user Lo. only
dealing with questions for his first pass through the system..
adform, however, does not contain questions, but tontvan
His second
categories, where
each category may be .he combination of several. different resposez to
different questions.
3>
We see frm this that the ADMINS system is able to accept ae irpxutv
output from the ADMIS syterm, and since it
this indfinitely, that is,
can do this once it, can &
on can go around the loop of adforming, anal-'
yzing the data processed under the a&form, building a higher level adform
based on the analysis, etc., etc.
If the user only wishes to re-group
or re-categorire one or two or three of his concepts, he may do this while
he is
in the analyzer by using the re-categorize instruction which takes In
several indees and makes them the entries under a macro-category which is
constracted and embodied in a category record.
In ADHINS one bas the ability to work several source files in a paraliel
analysis.
This is constrained, of course, by the structure of the
codebooks for the data.,
A very simple example would be where two different
groups of individuals such as Venezuelan students and priests have been intervieved nad the interviews were similar.
student file and the priests'
One may wish to analyze the
file in parallel, building parallel indexes,
giving parallel names to this indexing construction as one goes.
Because
the same codebcok in effisct is applicable to both of these files one has
the siplest example of the ue of a clasaified directory for working two
files)
({Whre codebookk are concerned with the sawe type of substantive
qwstions, but. the strmcture is not exactly the same from codebook to
codebook, one can by intelligerat use of the e4form, design a kind of template that ts
eequate across these codebooks.
Nevertbless, when one is
in the analyzer building indexes to different files, one may be confronted
with a question in ore ccdebook which is slightly different in structure to
a question in another0
One can then build an index to eath of these two
questions where the construction of each of tbese indees is somewhat different, but if
conceptuafly they are considered to be similar tty
may be
namd with the same name.
In order to give effect to these operations one
would work a file, build the index and name it,
and then one would go in
and work the other file, building the index and give that the same name.
One could then put this same named index in the column of a table and work
the two files in parallel for similar categories in the rows of the table.
The point here is that an index, a construction, is a property associated
with a file, while name can be a more global property associated with many
files, pointing to different constructions, yet treated similarly when these
files are worled in parallel.
Conceivably one would wish to change the name of a construction in
one file.
There is an instruction called 'reclassify'
which allows one to
change the name of a construction in a particular file whereas the instruction
enamel would rename the constructions in all the files.
The *reclassifyo
instruction points up the fact that in naming different constructions in
different files with the same name one is in effect classifying them as
falling under the same concept.
The analyzer understands this, that is,
the analyzer understands that when you use the same name for different constructions then the different constructions have similar conceptual meaning
to you.
The form of printing of tables of index results, etc. are geared
to this fact.
That is,
when one gets a table while working more than one
file one gets for each row of the table a sumary line representing the
size of the intersection in each of the files being worked.
In the analyzer one can do all of the recoding transformations that
one would have done in the adform.
However, the structure of the question
or even the semantic content of the question although considered to be conceptually similar from file to file may actually be very different.
fore, to do this the work required in keeping track of the different
There-
3-26
constructions and of the naming in the analyzer is quite considerable.
Therefore, if
one wants one can build a kind of template to the codebooks
at the outset as parts of one 's substantive analysis plan which maks it
easier to design an organization of the record analysis plan using the many
contingent features in the analyzer programs.
It
should be emphasized, however, that even if
the codebooks are
not semantically equivalent as long as the user has an analysis purpose
which related them conceptually, one can write adforms which are dissimilar
to embody different codebooks.
One can organize and process the data files
separately, but can analyze them in parallel after one has independently
built up indexes to the same conceptual level in each of the files.
An example of multi-source file analysis is the following.
If one bad
elite panel interviews administered in Britain during the years 1955, again
in 1956, again in 1959, and again in 1961 and 1965, and likewise for France
and Germany, one would have fifteen files which recorded elite attitudes
over five time periods in three countries.
One could analyze in parallel
with respect to time, one could analyze in parallel with respect to country,
and one could analyze all fifteen files in parallel.
In effect, any long-
itudinal sample can be viewed as amenable to this multi-source analysis
where one is looking for change in some variable over sam time period.
Specifically, one analyzes multi-source data with the analyser by using
the 'vork' instruction to focus the analyzer on the particular source file
or source files one wishes to work.
When one is working more than one source
file all the relevant analyzer instructions:
index, intersect, union, com-
plement, tables, etc. are applied in parallel to all the files that one is
working.
So, if effect, each analyzer instruction is multi-source in
3-27
in that when the work list as set by the work instruction contains more
than one source file name, these instructions operate in parallel on those
source files named.
All of the instructions that one has given to the analyzer are saved
in an instruction file.
This means that one could, if one wished, read the
instruction file into a different analyzer, and in effect in one fell swoop
give all of the instructions to the analyzer that one had given stepwise
in a previous analysis.
One can amend or edit the instruction file in the
same way as one could amend or edit the adform, as described previously.
One can also read in the instruction file but be working a different file
under these instructions from the file on which they were built.
Thus one
could have built a want-get-satisfaction index to a sample of students in
one analyzer and one could read the instruction file that built this index
into another analyzer while working on the priests.
The combination of
using the classified directory that results from analysis plus the reading
in of instruction files from other analyses allows for great flexibility
in the use of the analyzer.
Let us close the discussion of the analyzer by illustrating its use
in the analysis of multi-level files.
By a multi-level file we mean that
there exists some token in the main file around which a subfile can usefully
be constructed.
(The subfile may often be demographic in character.)
For
example, the main file could be of individuals, where each individual may
have a token corresponding to the region or district in which he lives.
One could construct a subfile for a region, by building an index to a region
token and subfiling categories for the items in this region index.
Another
example of subfiling would be, for example, building a kinship subfile
3-28
based on the main file.
That is,
for each individual in the main file, a
token related him to his family group.
We would wish to build a new file
of families taking representative information
off the family members and
perhaps aggregating certain information about the family members, e.g.
number of adult males .
As well as constructing subfiles one might have
subfile data in some raw state and one could write an adform for it,
ize it,
which it
versa.
and process it,
organ-
and analyze it with respect to the main file from
comes, relating tokens in the main file to the subtile and vice
All the power of the analyzer can be brought to bear on these multi-
level files.
For example, one builds indexes in them and asks for tables
relating these indexes and keeps a classified directory.
In extremely
complex cases, these multi-level files might be multi-source as well.
is,
That
for each source file one can build different subfiles, say based on
region or kinship.
In subfiling via a common bibliographic token, e.g. ID number of a
content token, e.g. region code, that is,
linking, we are in effect using
conon to the files that we are
iven cross reference relations to bring
together the characteristics of what were in the first place perhaps in
different files into some common file through these linking tokens of cross
reference relations.
4-1
Part 4
ADMINS SYSTEM - THE FTURE
Admins is not a one shot effort.
progressing thing.
As it develops it
It
is a changing, growing,
As long as it exists it will never stop developing.
goes through many stages and rites du passage.
Among
the most notable mileposts are, 1) When it first runs, relatively free of
bugs; 2) when it
first is used successfully by ordinary users who are con-
cerned with results, not with the system; 3) when it
becomes sufficiently
complete and reliable so that the primary R & D effort is in improving
user convenience,
i.e., making the user unaware of the requirements of
the system; 4) when it passes from a prototype scale of operations limited
to a few experimental users to being available on a mass production basis.
Admins has already passed milestones one and two.
reached three or four.
It
It
has not yet
is a working, functioning system used daily for
data analysis by all the main survey research projects at MIT because it
gives better results than the more conventional alternatives.
ever, still
It
is,
how-
clearly a prototype system, not yet fully embodying all of the
main planned features.
It
is not yet at the point where mass use is the
issue.
There are several major areas of development planned for the immediate
future:
1.
Development of the routines for analysis of sociometric data.
2.
Providing more built-in statistical and analytic packages.
3.
Development of an Organizer sub-system and Processor subsystem which will prepare textual data for analysis.
4.
Adapting the system to the incoming computers with their
new hardware and software capabilities.
5.
Solving some scale problems involved in the exponential growth
of files in a library in use, i.e. problems in the management
of an Admins archive.
6.
Providing the user with the quantum jump in convenience and
flexibility that a scope display could provide.
The major area of development for the longer run future will be in
putting Admins to work in large data archives in various places in government
and the universities.
For the moment, however, we confine ourselves to
discussion of the short-run plans.
the first in-
The development of Admins will be a two-fold effort:
volves the design and programming of further capabilities.
The second con-
cerns learning how one uses the powerful tools Admins offers.
1.
The design and programming of further capabilities.
a.
As procedural research feeds back into the program design
new instructions will be added to Admins and existing
Admins instructions modified to obtain closer fits to
user needs.
b.
A data-manipulation problem-oriented language will be
abstracted from the data manipulation parts of the
Analyzer sub-system.
The POL will consist of an inte-
grated package of sub-programs sitting under a host
procedure language (e.g. MAD).
The POL could be used
by sophisticated users to build information models (e.g.
a special purpose analyzer) or scientific models (e g.
a data-based complex simulation).
One of the authors
4-3
of Admins has already programmed a heuristic tree program
based on a primitive POL abstracted from an earlier version of the Analyzer.
c.
Admins will have to be re-programmed for the third
generation computer utilities (e.g.,
IBM 360/67).
MULICS, e.g.,
As Admins is almost completely programmed
in MAD and MAD is advertised as being available on the
third generation computers (although no decision bas yet
been reached as to what third generation language to reprogram Admins into) the change-over can be designed as
a gradual one.
It
would be naive, however, the expect
Admins-which is currently in MAD--to fit
computers with MAD compilers.
on these computers,
third generation
Although Admins might run
re-programming is necessary in order
that Admins take full advantage of the capabilities of
the new computer.
Furthermore, we have learned from experience, Admins
described in 'Admins - A Progress Report'
is three versions
removed from the original Admins of a year ago.
Re-
programming affords an opportunity to incorporate improvements that were impossible to merely add as patches to
existing programs.
2.
The following list
includes but a few of a host of procedural
problems raised by Admins.
The problem descriptions are neces-
sarily vague for problems are that which is still
stood.
not well under-
a.
In building classifications within a single source file
there are many substantive problems, but the methodological
process is understood because the classification need only
empirically fit
one set of data; however, when multi-
source data is introduced, and one's purpose is to construct a classification which gives an empirical fit
across the source files, the methodological procedures
involved are not yet fully clear.
b. The form of comparison one uses within a single source file
or single level of data is well understood.
Many simple
comparison tests, using percentages and ratios, exist.
What
is unclear as yet is how to apply comparisons across levels
of multi-level data or across source files of multi-source
data.
c.
Subfile construction can be simulated by using the set
theoretic operations in the Analyzer or alternatively
subfiles can be physically constructed using the multilevel instructions of the Analyzer.
We need to explore
the trade-offs between the two with respect to types of
applications.
In .effect the systems design and program design--in providing tools
to handle multi-level and multi-source data--have erected a structure within
which the above and many other related issues can be explored and perhaps
resolved.
We now need to feed back methodological experience to our a priori
notions embodied in computer programs.
4-5
The Administration of an Admins Based Archive
The marco-organization problems that arise with the large scale use
of the Admins system can be best understood if
the Admins archive administrator.
seen from the perspective of
At any time even in the small prototype
MIT archive Admins may be serving up to a dozen active users, some of whom
are processing input data files into Admins, all of whom are using up to a
half-dozen complex computer programs each of which generate report files
(e.g.,
any use of the organization sub-system requires the adform disk file
and generates four intermediary files; any use of the processor requires
three of these intermediary files plus an input data file and in turn generates two additional report files).
As users may be processing more than
one adform, and analysing category records from different source files, the
organization of the users personal files in itself presents a problem.
problem is
This
added to--from the administrators perspective--for he is respon-
sible for data integrity and the problem is multiplied by the number of
active users.
The macro-organization of the data presents a more involved problem
than that of programs.
For example, in current Admins, data integrity and
privacy is maintained by a double linkage scheme similar to that used to
maintain program uniqueness.
Admins programs constantly change--reflecting Admins growth and development--documentation is altered and certain report and data file a become,
as a result of these changes, obsolete.
Users must be informed of and
protected from a constantly changing system.
The problems of a changing system are taken into account in the macroorganization of the system.
In current Admins each core-image has a unique
existence in a common file directory.
Admins users link to these core-images
through the administrators file directory, thereby insuring unique existence
of the core-images.
files and is
Documentation exists in disk files in the same common
referenced by the user interrogating a core-image which reads
the appropriate documentation from a disk file and presents it to the user
(e.g., the 'Syntax'
instruction of the Analyzer).
Programs and documentations
change simultaneously and users are informed of these changes by interrogating
their core-images.
All Admins commands keep records of their usage, permitting the administrator to get a general usage picture.
He is provided with tools for a
more exacting scrutinization of usage that he feels is worth following up.
Much remains to be done in putting these records at the service of
the system administrator.
We intend to develop a process catalog which will
serve the following functions.
1.
The process catalog will contain records of all current Admins
usage as well as retrospective records of past Admins usage.
These records will be integrated with the computer utility
backup and retrieval systems to allow both current and retrospective retrieval of report and data files.
2.
We have already begun design and implementation (using 1000
survey codebooks as test data) of a subject and data format
catalog of hard copy prototype plus conecmitant tape stored
data.
This collection catalog will be related to the process
catalog in the following manner.
Users will interrogate a file
of. catalog 'cards'
containing catalog information lifted
from hard copy prototypes (eg., codebooks) and--when our
loosely structured capability becomes operational--derived
from 150 word subject descriptions.
This interrogation will
be done using the Analyser sub-system.
'analyzed'
When the user has
the subject catalog and decided on a data file(s)
for substantive analysis, part of the collection catalog will
become the initial entries--for that user--in the process
catalog.
As the user proceeds the process catalog
ill
record the location of report and data files, the various states
of data that the user produces, etc.
3.
Adforms, data states and classified directories meeting prescribed standards may become public data in the sense that
other users may subsequently use them and build upon them.
For, in effect, if a data file of interest to several users
exists, one user may have designed the adform and processed
the data and built a classified directory.
A subsequent user,
it may be the same one or a different one, may choose to
proceed by building private analyses on this public classified
directory.
The function of the process catalog is to physically
co-ordinate the administrative decisions required for such
activity to proceed.
In addition to developing further the macro-organization of the system,
we need to conduct a good deal of research on how the system is used by real
users--procedural research.
Most of the needed procedural research is for
the purpose of better fitting Admins to its users.
The system provides
4P
4-8
on-line documentation of the syntax of commands and on-line explanations
of what the commands do, the system also provides remark files for user
comments.
These are intended to encourage users to feed back on the
basis of their experience into the system design.
The system was from the
beginning designed with an eye on a set of specifications of substantive
user requirements.
However, empirical procedural research, observing what
users actually do, is the only way the Admins system can be massaged for
finer and finer fits to user requirements.
from users.
Feedback must be encouraged
The burden at present rests on the system designer and while
this is necessary it is not sufficient.
The way we have tried to encourage users to take an interest in
procedural research is quite simple.
Users have bad problems with their
data analysis.
We encourage them to bring us examples that have previously
bothered them.
From such discussion we get ideas of a data management
function which are required that we have not yet contemplated.
Usually,
we can tell the user how to use the system to achieve what they have in
mind.
We have also independently underiaken data analysis as a procedural
research venture and for exercising as yet unexercised parts of programs.
When the user is struggling we are able to help out and at the same time get
some insights as to how to better fit the system to user requirements.
Even
so, it is difficult to get users to feed back, and there seem to be two main
reasons.
Just as the Admins system as a whole evolves over time, so too for
any one user its use evolves over time.
The system must be simple enough
to use so that the beginning user need invest no more than fifteen to twenty
hours in it before he begins to get better and faster results than he was
getting before by alternative approaches.
Of course, he will at that point
have no idea of many of the available instructions and many of the powers
of the system.
These he learns in the process of use, finding more and
more possibilities as he gains experience.
It
is like learning a language.
The procedural research problem can also be understood from the frame of
reference of how Admins was designed.
As we are concerned with re-struc-
turing and re-organizing item records containing categories of entry data
and relating item records to one another in a file, therefore we have
abstracted from many records processing applications many basic procedures
which we relate together in our Admins information system.
A complex system
will have several command sub-systems and many clerical operation instructions
within each sub-system.
We can provide aids as to appropriate syntax, ex-
planation of the function of instructions and informative error messages,
but none of this tells a user how to put all the bits and pieces together
for his particular use, either at the planning level or at the operational
level.
Procedural research means that the system designers must simulate
users and work at making the system easier to use.
persuading users to take it
They must also work at
in easy stages in order to master a complex
information system, however as the information system is merely a tool for
a substantive user who understands substantive information very well, impatience with 'going back to school' to learn clerical operations is
quite
understandable.
Third Generation Computer Configuration Considerations
As the computer utility is very much with us today let us try to
place Admins in relation to it.
The third generation computer utility will
consist of a large central computer configuration, integrating a few processors,
core memories and disk and drum storage devices.
The hierarchy of storage
4-10
devices will include a small associative memory, core memory, drum, disk,
and tape.
Hundreds of remote devices--typewriters, display scopes, small
computers, fast printers--will be linked to the main configuration.
Res-
ponsibility for control of input, user queuing, and user storage references
will lie with the supervisory and I/O co-ordination elements of the utility.
The utility will offer various computer languages--compilers,
special procedure languages (e.g., symbol manipulation).
assemblers,
As well the user
will be provided with a host of commands for creating and maintaining his
disk files.
Storage will be organised to accomodate sharing of files (con-
taining either programs or data) between users.
Sophisticated compilers
will allow programming of "pure procedures' making programs shareable even
during actual execution.
location 'invisible'
All storage will be organized so as to make data
to the uninterested users.
The third generation utility
computer system will place previously unheard of but properly channeled
computational power at a user's fingertips.
Project MAC at MIT has been among the principals involved in the
development of the third generation computer utility.
Admins is currently
operational on the MAC CTSS which is a 7090 based pilot of the MUWICS
system currently under develo!pment at MAC--which is based on a third generation GE645 computer.
CTSS as with all pilots has been constantly changing
and improving over its four year life span; as is rare, however, with pilots,
CTSS is
sufficiently reliable that hundreds of users in the MIT community
and all over the U.S.
depend on it
for their computer needs.
Admins was
designed to be an interactive system sitting on top of CTSS, using the
CTSS file system and progrmuing languages as its base.
E
4-11
CTSS was designed as a base for application system as is evident in
the Admins command structure, the reliance of Admins on CTSS file maintenance procedures and the macro-organization of Admins.
As CTS8 bas grown and developed to become MULTICS, Admins will grow
and develop in order to rest on top of MILTICS. Some of the program design
of Admins has learned from MULTICS and offers a preview of MULTICS capabilities.
of
For example, MULTICS will use a segoent-paging structure at all levels
emry.
This affords many advantages one of which is relocatability across
all storage levels.
Also the MULWICS file directory structure will offer
complex macro'list structuring capabilities where the atoms are the users files.
The Admins classified directory is a paged memory embodying a list
structure
controlling Admins' basic data atoms--category records.
If the third generation computer utility is as described--a sophisticated interactive programing system--this leaves open the area of applications systems.
Admins will be an information utility for social data--
a system for data preparation and analysis--sitting on top of a programming
utility--a system for program preparation and running.
An information utility
comparable in scope and power to Admins is only possible if
it
can rest on a
prograiming utility comparable in scope and depth to CTSS (and eventually
MdULTICS).
IBM huas said it
computer.
MULTICS,
will offer a 360 model 67 as a time-shared utility
To date all released specifications show it very similar to
as NIT Computation Center will obtain the IBM 360/67 its influence
will also be relfected in Admins.
4-12
Admins to date does not use a display scope as part of its I/O
configuration.
We intend to obtain a scope and the necessary smail satellite
coMuter (e.g., PDP-7) to run the scope.
We plan to integrate this visual
capability into Admins for sight decision making which in dealing with text,
such as codebooks or tables is far more rapid and natural than line by
line printout.
typewriter.
It
is like giving the analyst a pencil and pad instead of a
A CONMUNITY INFOBMATION SYSTEM
by
David Griffel and Stuart McIntosh
The key issue is to decide on what records to start with in a computer based data management system.
The inputs must be computer usable,
the item records may contain data on individuals or of aggregates, bibliographic cross reference tokens must exist between files if new files
are to be constructed by extracting data from different source files.
ADMINS as it stands now can comfortably handle many different types
of file.
The problem of rationalizing categories of income, categories
of age, categories of region, etc. across various input files which is
essential for any type of useful comparison can also be handled by
ADMINS.
The ability to re-structure and re-organize files is basically
what one has to do in a community information system. If the system is
to be people based it is evident that the people are going to spend most
of their time re-categorizing categories in order to get them in step from
file to file, and then use a background computer system for processing.
If the system is to be computer based, it does not seem feasible that it
be background only.
For example, look ahead information must be available
in order to standardize on category merging and on making the subfile
linkages which at the outset is a feedback process.
We can say this with
our present ADMINS experience behind us and our ADMINS-development experience in front of us.
A perusal of the enclosed ADMINS-development
2
paper vill indicate that we will have a capability of designing the community records application and of testing it
out on representative selections
of the data base.
The final production system vill require its own resource in terms
of processing capability and storage capability.
Users can be at a distance,
e.g. in their institution whilst the data operation may be an installation
elsewhere.
People based systems normally can not deal with file integration
of such complexity in any way that is economic to support.
It
is possible
that a well designed computer based data management system will do the job.
Download