Summary lecture notes on Data, Files and Records. Data These are raw facts and figures (using numeric, alphabetic and special character symbols) about business activities, transactions, events and happenings. For example, data can be, recorded hours worked by an employee, invoices, receipts, orders etc Data itself is meaningless unless manipulated and transformed (processed) into meaningful forms (information) Data processing stages Data processing stages refers to the process of collecting data, at the point of activity up to the time data is transformed into information. Raw data is assumed to pass through 5 stages as shown below. DATA ORIGINATION Data verification DATA PREPARATION software transcription DATA INPUT DATA PROCESSING 6 steps DATA OUTPUT i) classify ii) sorting iii)simplifying iv)calculations Sources of Data (origination) Data can originate from the following sources i) business transactions,-invoices, receipts, customer orders etc ii) census( counting of people and distribution of wealth) iii) observation(observing some happenings and recorded details) iv) experiments (observations and recordings about the progress of an experiment) Prepared by Sinkala Henry. University of Lusaka v) market survey(sampling to test the reaction of a product or service being introduced) Verification Data verification is a process of checking for errors in input data before it is carried forward for processing. Data verification is carried out at the Data preparation stage. Verification is normally carried out by more than one person or machine. The same data is compared and checked character by character against one another. If there is any difference the machine halts (shows error message) and allows the operator to make corrections. Types of Errors detected by verification i. ii. iii. iv. Transcription errors- this is wrongly copying of characters (common one) Transposition errors-this is wrongly switching of characters. ( letter O and Zero, I and L) Omission Double transposition There are Six (6) Steps required to process a transaction and they include: i) data entry/ capture ii) validation iii) data processing iv) storage v) output generation vi) query support Data entry/ capture This is the collection of raw data from the outside world ( it can be manually or by using data capture devices) into an information system. Examples of data entry may include: entering of hours worked from workers' timesheets in order to know how many hours each person worked in that month Conducting a survey of customer's opinions and entering the data (in form of a questionnaire). Using a form on a website to collect visitors' opinions Entering of students records into an Information System. . Note the difference between capture and input. Input simply means loading of the acquired data into an information system, such as, Data keyed-in from documents by keyboard operators in either off-line or on-line mode Prepared by Sinkala Henry. University of Lusaka Directly typing the hours from the timecards/sheets into a spreadsheet. Reading of source documents by automatic reading devices - MICR, OCR,OMR, Bar Code Reader etc Other typical input devices include: Keyboards, mice, flatbed scanners, bar code readers, joysticks, digital data tablets (for graphic drawing), electronic cash registers Validation Validation is ensuring inputted data is of the right type and within reasonable limits. Databases and spreadsheets can have validation rules built into data fields to reject impossible entries. Validation can include: SEE C.S FRENCH 148 A field is any item of data within a record. It is made up of a number of characters. E.g name, a date, an amount, gender (sex) etc Type of validation checks. Size checks. - Fields are checked to ensure that they contain the right number of characters e.g. customer No. Specified as 6 numerical characters. This means that the program will validate for 6 numerical characters. Presence check. –Data items are checked to ensure that all the fields are present and have been entered. Range checks also called limit checks. Check for numbers and codes to ensure that they are entered within the permissible range e.g. an organization may decide to give a discount in the range of 20% to 30 %. This implies no discount less than 20 % or more 30 % will be entered. Character check. Fields are checked to ensure that they contain only characters of correct type e.g. no letters on numeric fields. Format check. Field are checked to ensure that the correct formats be followed, with letters and number in correct order e.g. the date could either be British or American. Prepared by Sinkala Henry. University of Lusaka Reasonableness. - Product quantities are checked to ensure that they not abnormally high or low (the amount of goods ordered are reasonable). Check digit verification: (to validate a credit card number) Are a means of using arithmetical relationships between a last digit of No and the other digit of a Numbers. The check digit checks against transposition errors. When a number. is input into the computer the validation programs perform the same calculation on number as was performed when the check digit was generated in the first place. Consistency (numeric fields contains numbers only, certain characters may be disallowed e.g?*, $ in name field) Processing (manipulation) stage At this stage DATA is converted into INFORMATION. i) This can be described as processed raw facts and figures which have been transformed into meaningful forms. ii) Information is the processed data. It is meaningful and allows an organization to make decisions and solve problems. Such as Surveys data being converted into graphs, Calculating wages from hours worked Typical processing software includes word processors, spreadsheets, databases, payroll systems, etc Storage After data has been processed, it is vital that it should be stored. Typical storage devices (magnetic and optical) include: hard disk (fast access, big capacity) floppy disk (slow access, low capacity) Flash disk Tapes (QIC quarter inch cassette tape) Optical Prepared by Sinkala Henry. University of Lusaka With storage the main issues include speed, reliability and the capacity of storage. Output Is the final generation of information to the outside world. Output generation is facilitated by output devices which include: CRT (cathode ray tube) monitors and LCD displays printers sound cards/speakers printers plotters Query At this stage information is queried for further verification or for conformity (agreeing with certain accepted standards or norms). General Qualities of good information: For information to be useful, it must possess several characteristics attributes. These attributes add value or increase the potential of information. i) Accurate ii) Meaningful iii) Reliable/ reliability iv) Easy to use v) User targeted vi) Relevant vii) Timely viii) Complete ix) Error free x) verifiable Prepared by Sinkala Henry. University of Lusaka a. Accurate- For information to be useful, it must be exactly and precise. This can help the decision makers to make accurate decisions. b. Meaningful- Good information must be meaningful and easy to understand, not to be misleading. c. Reliable- Good information must accurately represent the events or activities of an organization. In addition, it must be easy to follow and retrievable. d. Easy to Use-(Understandable) - Good information must be user friendly. Information must not be complicated because it may not be too useful as people may not really understand it. e. User Targeted- information must be specific, brief and user targeted. f. Timely- Good information should provided in good time and received at the right time g. Relevant –good information should make a difference to the decision maker by reducing uncertainty or by adding increased knowledge or value to the decision maker. h. Complete-must include all relevant data or aspects of the data that are required. i. Error free-data sets must contain a minimum number of errors. j. Verifiable- generated information of the same process must be the same or within reasonable variance. Task: v alue of information Role of information in business Prepared by Sinkala Henry. University of Lusaka Data organization Refers to the way we store and arrange data, in order to make it easy for storage, manipulation and retrieval of information. Information is organized in unit form according to the data hierarchy structure shown below FILE RECORDS FIELDS FIELDS RECORDS FIELDS FIELDS RECORDS FIELDS FIELDS FIELDS …….. Character character character character character ………………………… A file consists of a number of records. A file holds data that is required for providing information. Some files are processed at regular intervals to provide information (e.g payroll file) and others will hold data that is required at regular intervals (e.g a file with price items) There are two common ways of viewing files: a) Logical files. A logical file is a file viewed in terms of what data items its records contain and what processing operations may be performed upon the file. (i.e entities and attributes) i) Entity: entities are real world things (objects, people, events etc) about which there are need to record data. (E.g. an item of stock, employee, transaction). Entities can be tangible or intangible. Prepared by Sinkala Henry. University of Lusaka ii) Attribute: these are individual properties/ characteristics that describe and identify an entity. E.g. attributes of an invoice-(name, address, customer number, quality, price, description) b) Physical file. A physical file is a file viewed in terms of how the data is stored on a storage device and how the processing operations are made possible.(i.e physical records, fields, characters-a physical feature) A character is the smallest element in a file and it can be alphabetic, numeric or special. A field is any item of data within a record. It is made up of a number of characters. E.g name, a date, an amount, gender (sex) etc A record is a collection of data pertaining to one entity ( A record is made up of a number of related fields. E.g. a customer’s record, employee payroll. Eg a bank’s customer file may contain a single record with all account numbers, branch number, name, address, phone number and current balance.) A record can be recognized or identified by the record KEY Field A key field is a data element of field used to identify each record on a file. It is unique and it should not be a duplicate/ duplicated. Examples include a bank account number, an employee ID number, student’s ID number, invoices number, etc Types of files i) Master file ii) Transaction file (movement file) iii) Reference file Other file types: Archive file, Back-up file, Program file, Data file, Work file, Scratch file i) Master file always contain data. They contain up-to date information on a set of similar entities. These files are fairy permanent in nature. E.g. an employee file, customer ledger payroll file. They have a regular updating feature to reflect what is happening in an organization. ii) Transaction (movement) file contain data that record events. Records in a transaction file are placed in time order and these are processed to up-date the related Master file records. An incoming delivery file is a transaction file and might be used to update the company’s master Stock file. Similarly a Sales ledger will be from all the Orders received. Prepared by Sinkala Henry. University of Lusaka iii) Reference (Temporary) file is a file with a reasonable amount of permanency. This file is deleted when the processing is complete. This can be data used for reference purposes, for example price lists, tables of rates of pay, VAT rates, names and addresses of customers/ business partners, etc. Grand father-father-son method of updating The method used to update a master file will depend on its organization. In order to update a sequential master file. The transactions must first be sorted into the same sequence as the master file. A record from each file is then read and the record keys compared. If the master record has a matching transaction it is copied across unchanged to the new master file and another one read into memory until a match is found. The master record can then be updated in memory and another transaction record read. If it is for the same record, the master record in memory is updated again and so on until a transaction from a different record is read. Then the updated master record can be written to new file and another one read into memory and compared with the current transaction. This method of updating is known as the grandfather-father-son method, with a new file being created each time an updated is carried out. Normally at least three generations of master file are kept for back up purposes. If the least version of the master file is corrupted or destroyed, it can be recreated by-running the previous update, using the old master file and matching transactions. Day 1 Transaction file Old master File Tape A (1 st generation) Update New master file After the update on day 1, Tape A is the ‘Father’ tape and Tape B is the ‘son’ tape Tape B (2 nd generation) Day 2 Old master File Tape B (2 st generation) Update New master file Prepared by Sinkala Henry. Tape B (3 rd generation) After the update on day 1, Tape A is the ‘Father’ tape and Tape B is University othe f Lu‘son’ saka tape File access File access is the process of getting to a file by searching, locating and retrieving a particular record from a file. File access depend on the following: a) Storage type – ( DASD or SASD) b) The organization method (i.e file organization) 1. Direct Access Storage Device (DASD) This type of device allows any particular item of data on the disk to be read directly, without having to read all the rest of the recorded data. These types of storage devices are based on spinning disks upon which data is recorded. Within each category of direct access device, there are a greater number of variations in design, performance, capacity and cost. These devices include: -Magnetic disk drives -Optical storage A) Magnetic disk drives: this is the most common form of disk based storage. All PCs rely on this form. Two of the basic types of magnetic disk are: i) Hard Disk drives-Also called Winchester ii) Floppy disk drives -limited storage capacity (floppy 1.44Mb, and ZIP Disk 100-250MB) not in use B) Optical drives: Optical drives use storage techniques based on light instead of relying on the principles of magnetism. They use reflected light to read data, based on the Compact Disc. Tiny pits are burned or pressed into the thin coating of metal or material deposited on a Disc. The pit patterns represent the stream of digital data that are used to encode images and sounds. They can store audio, video, text and program instructions. Prepared by Sinkala Henry. University of Lusaka They include: CDs and DVDs CD-ROM (Compact Disc-Read only Memory) CD-R (Compact Disc- Recordable- data can not be overwritten once it is recorded) CD-RW (Compact Disc-Rewritable-data can be overwritten time and again) The CDs have a maximum of 700MB data storage, or 80 minutes of Audio play. DVD- (Digital Versatile Disc) DVD discs store so much data s compared with CD discs because both sides can be used, along with sophisticated data compression technologies. Standard DVD discs store data from 4.7GB to 9.4GB. However, with the advert of multimedia files with video graphics and sounds, new optical discs are being developed. E.g. BLUE-RAY. Blue-Ray can have maximum capacity storage of 50GB (TDK) and 25 GB (Sony, Philips, Fujifilm etc) Blue-ray Disc (also known as Blu-ray or BD) is an optical disc storage media format. Its main uses are high-definition video and data storage. The disc has the same dimensions as a standard DVD or CD. The name Blu-ray Disc is derived from the blue laser (violet coloured) used to read and write this type of disc. Because of its shorter wavelength (405 nm), substantially more data can be stored on a Blu-ray Disc than on the DVD format, which uses a red (650 nm) laser. A dual layer Blu-ray Disc can store 50 GB, almost six times the capacity of a dual layer DVD. Blu-ray Disc was developed by the Blu-ray Disc Association, a group of companies representing consumer electronics, computer hardware, and motion picture production. As of July 2, 2008 more than 650nm Blu-ray Disc films have been commercially released in the United States and more than 410nm Blu-ray Disc titles have been released in Japan. During the high definition optical disc format war, Blu-ray Disc competed with the HD DVD format. On February 19, 2008, Toshiba — the main company supporting HD DVD — announced it would no longer develop, manufacture and market HD DVD players and recorders, leading almost all other HD DVD supporters to follow suit, effectively ending the format war. 2. Serial Access Storage Devices. Prepared by Sinkala Henry. University of Lusaka Serial access devices are those in which a particular item of data can only be read after reading all the intervening items of data. The only important serial access storage device in current use is the magnetic tape. There are three basic types of Magnetic tape device in use. However, due to huge demand in storage capacity, they are not in use in most organization unless otherwise. Reel –to- reel tape device Cartridge tape device (e.g. an 8mm tape of 112m long can store up to 5GB of data) Digital audio tape device. File Organization File organization is a physical placement or arrangement of records within the file. The way in which files are stored has a direct bearing on how quickly the data contained can be accessed. Records on a file can be ordered or unordered. File organization depend upon: The way in which the file is going to be used. The number of records to be processed each time the file is updated. Whether individual records need to be accessed quickly or not The type of storage device chosen Ways in which files can be organized. 1. Serial File Organisation There is no sequence or order to records that are stored in a serial file. They are stored in the order they are received and new records are added at the end of the file. Prepared by Sinkala Henry. University of Lusaka In order to access a record in a serial file, the whole file has to be read from the beginning until the desired record is located. 2. Sequential File Organisation. Sequential files are organized in such a way that the records are stored according to the order of the values of a chosen record attribute (field). The records can be in ascending or descending order, based upon the attribute value. E.g. records in your bank’s customer file may be stored in customer account number order in an ascending order. E.g. arranged in alphabetical order. The field used for sequencing the record is mainly the primary key of the record (such as the customer account number) or a combination of other fields. 3. Indexed Sequential File Organisation Prepared by Sinkala Henry. University of Lusaka As with sequential files, indexed sequential file records are stored in a sequence that reflects the value in the key fields of each record. The file contains an index with pointers to certain data records in the file. The index helps in locating a record in the data file. Indexed sequential files are used where individual records need to be accessed very quickly without having to start searching from the beginning. They reside on direct access storage devices 4. Direct File Organisation. (Random File Organisation) . The records are not stored in any particular sequence. Instead a mathematical relationship is established between a record key value and the address of its physical location on the storage media.Just as with Sequential files, direct files rely on direct access storage device. This is also known as random file organization Random file organization allows extremely fast access to individual records, but the file cannot be processed sequentially. It is suitable on line enquiry systems where a fast response is required. To access a record on a random file, its address is calculated from the record key using the hash algorithm. The record at that address is then read. If it is not required record, then the next record is read and examined until either the record is found, or a blank record is encountered. To add a record to a random file, its address is calculated using the hashing algorithm and then the relevant block is read into the memory, if the block is empty, the record is written to the file. other wise , the next bock is read and examined until an empty space is found To delete a record from a random file, the record must not be physically deleted because this would result in any records which caused collisions being inaccessible. there fore, the record is flagged as deleted by setting an extra field ( e.g. a Boolean field) in the record to indicate that this is a deleted record Factors which may influence the file organization Prepared by Sinkala Henry. University of Lusaka This refers to various factors or circumstances which may have a barring on the method adopted on file organization. The following are some of the factors: Storage medium (DASD or SASD) Processing method (e.g. real time processing function where Direct file organization) Volume of data Hit rate Type of file (transaction are usually serially organized, while master are sequentially or directly organized) Hit rate Hit rate measures the proportion of records being accessed during a particular run. It is calculated by dividing the number of records accessed by the total number of records on the file, and multiplying by 100 to express the result as a percentage. For example, if 270 employee records out of 300 on the file are accessed on a particular payroll run, the hit rate is 200/300*100=90%. In general, if the hit rate is high, (say over 70%) whenever the file is processed, it is efficient to use a sequential file organization, whereas if the hit rate is low, it is preferable to use a randomly organized file where records can be updated in any sequence. If the hit rate is sometimes low an indexed sequential file organization is appropriate. Hit rate = (transaction file record /master file record) X 100 ©2013 Prepared by Sinkala Henry. University of Lusaka