30 - NYU Stern School of Business

advertisement
Dealing with
Data!
Norman White
2011
1011101 0011001 00000001 ….
A proposal for a Stern Course that prepares student for dealing with real data
in the real world.
Background: Most college courses spend their time on the concepts and
techniques of analyzing data, but virtually no time on how to handle the data
and get it into a form to be analyzed. This course is focused on how one
deals with data, from the initial acquisition to its final analysis.
Topics include data acquisition, data cleaning and formatting, common data
formats, data representation and storage, data transformations, data base
management systems, “big data” or nosql solutions for storing and
analyzing data, common analysis tools including excel, sas and matlab, data
mining and data visualization.
This course should be valuable background for students in information
systems, operations, finance, marketing and accounting, as well as non-Stern
students in Computer Science, Economics, sociology, and any of the
sciences.
Lecture Outline
Week
Topic
1)
Course overview, Introduction to data, formats,
representation. Binary, character, floating point.
2)
Files, Records and fields, Sequential processing,
sorting and merging data, Random access
Homework. Simple sort merge reporting problem.
3)
Handling unstructured data, Converting text data
to common formats like csv, tab delimited, fixed
format, xml. Inputting data into Excel. Common
problems.
Homework. Load text file into Excel and analyze
4)
Common preprocessing tools, unix tools sed,grep,
cut, awk, perl, python etc.. Concept of pipeline
processing.
Homework. Use unix tools to convert unstructured
text file to a csv format file suitable for loading into
Excel.
5)
Relational data bases. Overview of features and
functions. E/R diagrams.
Homework. E/R diagram of business case
6)
Query languages, SQL, including joins and
aggregation features.
7)
SQL continued.
Homework. Use SQL on multitable data base to
answer questions.
8)
Business Analytics, Excel, SAS, matlab, Stata, R
9)
Mid-term
Homework. Final Project outline
10)
“Big Data”. How do we handle terabytes and
petabytes of unstructured data? Discussion of
Google file system, map reduce and hadoop.
Problems of handling web, social network data
and other high frequency data.
11)
“Big Data” analytics. How do we scale data base
systems, data mining and other analytical
techniques to handle massive data bases. Pig,
Mahout, Pegasus, Cassandra, Hive, HBASE.
(Discuss the pagerank problem)
Homework: Run Map-Reduce job to develop a
word count of trigrams in a large textual data set.
Or run Pegasus to analyze a large social network
12)
Data Visualization. A picture is worth a thousand
words. Show how large amounts of data can be
displayed using graphical techniques. Give
examples of some standard techniques. Treemap,
Tuftte,
13)
Final Project Presentations
Download