Chapter 18 © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. BUSINESS ANALYTICS: DATA ANALYSIS AND DECISION MAKING Importing Data into Excel Introduction Any statistical analysis presumes that you have the appropriate data in a format suitable for analysis. The data might exist: In an Excel® file—which might still need to be rearranged to get it in the form of a rectangular data set. In a text file (or ASCII file)—which is any file that can be opened and read in a text editor such as Notepad; it can be imported into Excel using Excel’s text import wizard. In a relational database (such as Access, SQL Server, Oracle)—which can be imported into Excel by forming a query using the Microsoft Query package. A query specifies exactly which data you want to import. On the Web—which can be imported into Excel by creating a query and then running it in Excel. Once the data is imported, it may need to be cleansed to fix wrong values. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Rearranging Excel Data Even if the data already exists in Excel, it may need to be rearranged in the form of a data set—a rectangular array of data with observations in rows, variables in columns, and variable names in the top row. Sometimes simple cutting and pasting works. In other cases, advanced Excel functions are required. In all cases, it is best to map out a plan and then decide how to implement it. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.1: Baseball Salaries Original.xlsx Objective: To rearrange the data from the baseball Web queries into a single data set. Solution: Data on baseball salaries was imported into Excel from a Web site, with a separate Web query for each of the 30 teams. The results for a typical team are shown to the right, with only a few players listed. To rearrange all of the data into four long columns with the headings Player, Team, Salary, and Position, follow these steps: 1. Insert a blank column before column B, and enter the label Team in cell B2. 2. Cut the Arizona Diamondbacks team name from cell A1 and paste it next to the first Arizona player in cell B3. Then copy it down for the other Arizona players. 3. Repeat step 2 for each of the other teams. 4. Delete unnecessary rows of labels for the other teams. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.2: CPI.xlsx (slide 1 of 2) Objective: To rearrange the monthly data into two long columns, one with month-year and one with the CPI. Solution: Monthly data on the Consumer Price Index (CPI) was imported from the Web using a Web query. A few rows appear below. The desired results after rearranging are shown to the right. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.2: CPI.xlsx (slide 2 of 2) To rearrange the data into two long columns, follow these steps: Create the range name Data (for all the CPI values, not the headings in row 1 or column A). Add a new worksheet for the rearranged data, create the column headings in row 1, enter 1 in cells A2 and B2, and enter 1913 in cell C2. To generate the recurring pattern of 1 to 12 in column B, enter the formula =IF(B2<12,B2+1,1) in cell B3 and copy this down as far as necessary. To generate the pattern in column A, enter the formula =IF(B3=1,A2+1,A2) in cell A3 and copy it down. To generate the years in column C, enter the formula =IF(B3=1,C2+1,C2) in cell C3 and copy it down. To generate the month-year values in column D, enter the formula =DATE(C2,B2,1) in cell D2 and copy it down. To generate the CPI values in column E, enter the formula =INDEX(Data,A2,B2) in cell E2 and copy it down. Copy columns D and E and paste them over themselves as values. Then delete columns A-C. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Importing Text Data Most software packages store documents in a proprietary binary format that is readable only by that package. However, most software packages allow you to save a file as a text file, usually with a .txt extension. A text file can be fixed width or delimited. With fixed width, each variable’s value starts and stops at fixed positions (columns) in the line. With delimited data, there is a delimiter character, usually a tab, comma, semicolon, or space, that separates the values in a line. Each line of data has the same length, and the columns line up. The lines are typically of different lengths and do not line up nicely. Excel can import either fixed-width or delimited formats with its text import wizard. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.3: srn_tmp.txt (slide 1 of 2) Objective: To import the fixed-width text file data into Excel by using Excel’s text import wizard. Solution: The text file srn_tmp.txt was downloaded from the Web. It contains state, regional, and national (srn) annual data (1895-2005) on temperature. A small portion of the data is shown below. The Web site from which the text file was downloaded also has a data dictionary, srn_data.txt, that indicates what the variables are and how they are stored in columns. Part of this data dictionary is shown to the right. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.3: srn_tmp.txt (slide 2 of 2) Open the srn_tmp.txt file within Excel, and the first step of the text import wizard appears. Select Fixed width. The second step of the wizard allows you to separate (or parse) the columns as listed in the data dictionary. Click on the third, fourth, and fifth positions on the ruler. The last step of the wizard allows you to fine-tune the import, column by column, but bypass this step and simply click Finish. The data is imported into Excel, as shown below. Create column headings in row 1, using the data dictionary as a guide. Use “Save As” to save the .txt file as an .xlsx (or .xls) file. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.4: Mobile Subscriptions.txt (slide 1 of 2) Objective: To see how delimited text data can be imported into Excel with the import text wizard. Solution: Annual data by country on the number of mobile subscribers during 2002-2009 was downloaded from the Web into the file Mobile Subscriptions.txt. A portion of the file is shown below. This time there are column headings in row 1, but the ragged lines indicate that this file must be delimited, not fixed width. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.4: Mobile Subscriptions.txt Open the Mobile Subscriptions.txt file in Excel, and check the Delimited option in step 1 of the wizard. In step 2, Excel guesses that the file is tab-delimited, which is correct, so click on Finish to accept this. Some of the imported data is shown below. (slide 2 of 2) The Flags column contains no data, so it can be deleted. Column A can also be deleted because it includes a constant value, Mobile Subscribers. The numbers in column D can be reformatted to Numeric with 0 decimals. Use “Save As” to save the file in .xlsx (.xls) format. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Comments About Importing Text Data If a text file is comma-delimited and is saved as a .csv file, you can open it directly into Excel, without the import text wizard. Excel automatically parses the values between the commas into columns. Sometimes you will find a table of data on the Web where there is no option to save it in some type of format, text or otherwise. You can try copying and pasting the data into Excel, but it is possible that everything will be pasted into a single column. If this happens, highlight the data in this column and click on the Text to Columns button on Excel’s Data ribbon. The purpose of this button is to parse delimited data in a single column into several columns. Whenever you parse text data into Excel columns, there is always the chance that the data will not line up properly—that is, the data will get into the wrong columns. Look closely at the parsed data before proceeding. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Importing Relational Database Data Database packages such as Access, SQL Server, Oracle, and many others are extremely complex and powerful packages. For database creation, querying, manipulation, and reporting, they have many advantages over spreadsheets. However, they are not nearly as powerful as spreadsheets for statistical analysis. It is often necessary to import data from a database package into Excel, where the statistical analysis can be performed. Microsoft includes software called Microsoft Query in its Office suite that makes the importing relatively easy. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Introduction to Relational Databases The Excel “databases” are often called flat files or, more simply, tables. They are also called single-table databases, where table is the database term for a rectangular range of data, with rows corresponding to records and columns corresponding to fields. Flat files are fine for relatively simple database applications, but they are not powerful enough for more complex applications. A relational database is a set of related tables, where each table is a rectangular arrangement of fields and records, and the tables are linked explicitly. The linked fields are called keys. A primary key must contain unique values, whereas a foreign key can contain duplicate values. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Using Microsoft Query There are two ways to import data into Excel in Excel 2007 or later. The first method uses the From Access button in the Get External Data group on the Data ribbon. It is limited to importing whole tables or saved queries. The second method employs the Microsoft Query software, which allows you to import all or part of the data from many database packages into Excel. 1. 2. 3. It comes with the Office package, but may need to be installed. Once Microsoft Query is installed, importing data from Access (or any other supported database package) is essentially a three-step process: Define the source, so that Excel knows what type of database the data is in and where the data is located. Use Microsoft Query to define a query. Return the data to Excel. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.5: Shirt Orders.mdb (slide 1 of 3) Objective: To illustrate how Microsoft Query can be used to import the results of queries on the Shirt Orders database into Excel. Solution: Fine Shirt Company has created an Access database file that has information on its sales to its customers during the period of 2005 through 2009. There are three related tables in this database, Customers, Products, and Orders, with a link between the CustomerID fields in the Customers and Orders tables and a link between the ProductID fields in the Products and Orders tables, as shown in the diagram below. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.5: Shirt Orders.mdb (slide 2 of 3) The entire import procedure is done within Excel and Microsoft Query, not Access. First, define a data source. In the Choose Data Source dialog box, make sure the Shirt Orders item is selected and the bottom checkbox is unchecked, and click OK. This brings up the Add Tables dialog box and begins the second step, where you define the query. Open a blank spreadsheet in Excel and select From Microsoft Query from the From Other Sources dropdown menu on the Data ribbon. Select the top <New Data Source> item in the Choose Data Source dialog box and then click OK. Fill in the Create New Data Source dialog box, and then the ODBC Microsoft Access Setup dialog box to indicate which database file you want to use. Click OK to return to the Choose Data Source dialog box. Specify which tables are relevant for the query, which fields you want to return to Excel, and which records meet the criteria you spell out. The final step is to get these data back into Excel. Select the Return Data to Microsoft Excel menu item from the File menu. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.5: Shirt Orders.mdb The results of one query (Find all of the records in the Orders table that correspond to orders for at least 80 units made by the customer Shirts R Us for the product Longsleeve Tunic, and return the dates and units ordered for these orders) are shown to the right. Once the results of the query data are returned to Excel, you can then begin the statistical analysis of the data. (slide 3 of 3) The data are still linked to the query. This means that you can refresh the data in Excel if the Access data change. You can also get back to Microsoft Query so that you can edit your query. If your ultimate goal is to create a pivot table based on the database data, you can do this directly, as shown in the next example. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.5 (Continued): Shirt Orders.mdb (slide 1 of 2) Objective: To illustrate how Microsoft Query can be used to import data directly into a pivot table. Solution: Fine Shirt Company would like to break down revenue from its various customers and products by using pivot tables. To base a pivot table on external data, go through Microsoft Query, not through the usual PivotTable button on the Insert menu. Get into Microsoft Query and define a query. When you select the Return Data to Microsoft Excel menu item from the File menu, you see a dialog box where you can specify the type of report you want and where you want it. Select PivotTable Report (or PivotChart and PivotTable Report). From here, you can create any pivot tables in the usual way. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.5 (Continued): Shirt Orders.mdb (slide 2 of 2) One possible pivot table, showing total revenue broken down by product, customer (using the Report Filter area at the top), and quarter of year, is shown below. You have the option of obtaining corresponding pivot charts automatically. The pivot table is linked to the query. This means that you can go back to Microsoft Query, edit the query, and return to Excel to update the pivot table. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. SQL Statements Queries represent a large part of the power behind relational databases. To standardize queries across packages, SQL (structured query language) was developed several decades ago. It is often called the “language of databases.” Behind each query developed in Microsoft Query is an SQL statement. The statements can be viewed by clicking the SQL button in the Query toolbar once you have created a query. The statements include keywords such as SELECT, FROM, WHERE, and AND, as shown in the example below. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Web Queries (slide 1 of 2) Web sites that contain data are structured in all sorts of ways, and the steps required to import the data into Excel for analysis vary greatly. Many Web sites provide buttons that allow you to download the data directly into Excel. On some Web sites, the only way to get the data into Excel is to cut and paste. Web query is an Excel tool that lies between these two extremes. Web queries search for HTML <Table> tags, find the corresponding data, and bring them into Excel in the usual row and column format. Many data sets on the Web import beautifully with Web queries, but some return virtually nothing. Sometimes you need to run several Web queries on the same basic site to get all of the data you want. In a URL, the part to the right of the question mark (if any) is called the query string, and it specifies exactly which data you want. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Web Queries (slide 2 of 2) To get data into Excel using a Web query: Make sure you have an active connection to the Web, and open a new workbook in Excel. Click the From Web button on the Data ribbon. Fill in the dialog box, the most important part of which is the URL (the address of the page) at the top. Once you enter the URL, click Go. You will see the Web page with yellow arrows next to all of the tables. Click any of these yellow arrows to change them to green checkmarks. The selected tables will then be imported into Excel. After you click Import, specify where to place the results. A link to the Web page remains, so you can refresh to obtain the latest data. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Cleansing Data You cannot count on real-world data sets to be perfect. This is especially the case when you obtain data from external sources such as the Web. It is your responsibility to correct any problems before you do any serious analysis. Cleansing data requires careful detective work to uncover all possible errors that might be present. Once an error is found, it is not always clear how to correct it (for example, missing data). Some subjectivity and common sense must be used when cleansing data sets. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.6: Data Cleansing.xlsx (slide 1 of 2) Objective: To find and fix errors in this company’s data set. Solution: The data file has data on 1500 customers of a particular company. A portion of these data appears below. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 18.6: Data Cleansing.xlsx The data set has a number of problems, all of which you might encounter in real data sets: Duplicate SSNs Duplicate birthdates (as a result of the code 9999 being used for missing data and then formatted as a date) Negative ages (as a result of the Y2K problem) Too many categories (as a result of a space being entered before a value in a text field) Quotes around numbers (causing them to be interpreted as text, not numbers) Suspiciously small incomes (as a result of incomes being entered without trailing zeroes) Suspicious outliers (as a result of the average spent being entered rather than the total amount spent) Use Excel tools to search for the suspicious data values. (slide 2 of 2) Sort on columns and look for unusual or duplicate values. Create a scatterplot of the data and look for suspicious outliers. Then use other Excel tools, such as Find and Replace, to fix the errors. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.