Government of Russian Federation
Federal State Autonomous Educational Institution of Higher Professional
Education
"National Research University
'Higher school of economics'
Faculty of Business Informatics
Discipline program
"Advanced methods of data analysis and big data in business intelligence " for direction 38.04.05 "Business Informatics", Master training
Program’s author:
Nikolay V. Markov, nikolay.markoff@gmail.com
Approved at the meeting of the Department of information and business in the sphere of information technologies «____»____________ 2014 г.
Head of Department, Svetlana V. Maltseva _____________________
Recommended by the EMS section of «Business Informatics» «____»____________ 2014 г.
Chairman, Y. V. Taratukhina ____________________
Moscow, 2014
This program can not be used by other parts of the university and other institutions of higher education without the permission of the department - developer of the program.
1.
Scope and normative references
This program of an academic discipline establishes minimum requirements for knowledge and skills of the student and determines the content and types of studies and reports.
The program is designed for teachers, leading this discipline, teaching assistants and students directions 38.04.05 "Business Informatics" Master training, students in the master's program "Big Data
Systems".
The program is developed in accordance with:
working curriculum of the University towards 080500.68 "Business Informatics" Master training for master's program «Big Data Systems», approved in 2014
2.
Goals for studying
Formation of the theoretical knowledge and practical skills in the collection, storage, processing and analysis of large data.
Develop skills and practical skills to analyze large data to tackle a wide range of applications, including analysis of corporate data, financial data from the data warehousing world markets, modeling data storage and processing, prediction of complex indicators.
3.
Student competences, generated as a result of studying
As a result, during the studying of the discipline a student should::
Understand the theory and fundamentals of storage, processing and analysis of big data, advanced tools for collection, storage, transmission and visualization of big data.
To be able to process and analyze large amounts of data using modern software packages IBM
InfoSphere.
Have the skills to use neural networks and fuzzy models for compression, processing and analysis of large data, as well as their continuing effectiveness.
As a result of the development of the discipline the student acquires the following competences:
Competence
GEF/NR
U code
СК-2
Descriptors - the main features of the development (indicators of achievement results)
Demonstrates
Forms and methods of teaching, contributing to the formation and development of competence
Lectures, workshops, homework
Ability to offer concepts, models, invent and test methods and tools of professional activity
The ability to apply the methods of system analysis and modeling to evaluate and design
Ability to develop and apply mathematical models to justify the design decisions in the field of ICT
Ability to organize self and collective research work at the enterprise and manage it
ПК-13
ПК-14
ПК-16
Owns and uses
Owns and uses
Owns and uses
Lectures, workshops, homework
Lectures, workshops, homework
Lectures, workshops, homework
2
4.
Place in the structure of the discipline of the educational program
As part of the master's program «Big Data Systems» this discipline is a compulsory subject.
For the proper development, students should:
know the content of the following disciplines: numerical methods, optimization methods, data analysis, discrete mathematics, theoretical foundations of computer science, computer systems, networks, telecommunications, information systems management and production company.
Be able to use mathematical and IT-tools for management tasks.
The main provisions of the discipline should be used for the further studying the discipline
"Elaboration and implementation of big data."
5.
Topical plan of an academic discipline
№ Topic name
1 Introduction to the analysis and management of large data
2 Data Management
3 Model of distributed file systems and databases computing
4 Search for similarities in the data
5 Analysis of streaming data
6 Link analysis
7 Frequent datasets analysis
8 Clustering algorithms and their applications
9 Neural networks and their applications
10 Advertising on the Web
11 Decision support system
12 Analysis of social network graphs
13 Reducing the dimension of data
14 Large scale machine learning
ИТОГО
Total hours
180
Classroom hours
Lecture s
Semin ars
Workshop s
Homewo rk
2
2
2
2
7
7
38
4
2
2
2
4
4
4
2
2
4
2
2
38
4
2
2
2
4
4
4
2
2
4
2
2
106
8
8
8
8
7
8
8
8
7
8
7
7
6.
Forms of students knowledge control
Type of control
Current
(week)
Form of control
Thesis
1
1 st
year
2
1
Parameters
Volume 25-20 pp., result evaluation – 2 weeks
Total
(week)
Exam 1 Oral exam, 20 min per student
6.1 Criteria for assessing the knowledge, skills
The student should demonstrate the knowledge of sections of the discipline and the ability to present the results of homework and tests in accordance with the required competencies.
Evaluation of all forms of monitoring are set on a 10-point scale.
3
On the final evaluation on a subject matter consists of ratings for:
work in practical classes - O1
control work - O2
response to the competition - O3 according to the formula: O = O1 + 0.2 * 0.4 * O2 + O3 0.4 *
7.
Program content
Topic 1. Introduction to the analysis and management of big data time.
What is big data? Characteristics of Big Data. Big data as one of the global challenges of our
Data analysis, basic principles and methods. Statistical modeling and simulation based on machine learning. Bonferroni principle. Hash functions and indexes. Base of natural algorithms.
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
1.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 2. Data Management
Data management foundation. Stages of working with data collection, compression, storage, processing, analysis. Principles of storage and data management. Data compression methods.
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 3. Model of distributed file systems and databases computing
4
Distributed file systems. Physical organization of computing nodes. Approach MapReduce: Maptask, Reduce-task. Algorithms using MapReduce and their applications. Matrix-vector multiplication, the operation of relational algebra operations on databases, grouping and aggregation.
Extensions to MapReduce. Flow systems. Communication cost models. The theory of complexity for MapReduce: dimension reduction and graph models.
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 4. Search for similarities in the data
Application of the Near-Neighbor search. Jaccard similarity in the data. Similarity in information. Collaborative filtering.
Splitting documents. k-splitting: the choice of dimension splitting hashing, splitting construction of words.
LSH - hashing. Measures of distances. Euclidean distance, the distance Jaccard, cosine distance. LS - function. LSH - family and their applications.
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 5. Analysis of streaming data.
5
Threading model of data. Management system of streaming data. Discretization of flow data.
Filter streams. Flageolet-Martin algorithm. Alon-Matias-Zhegedi algorithm. Datar-Gionisa-Indica-
Motwani algorithm (DGIM).
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 6. Link analysis.
PageRank algorithm. Earlier search engine, network structure. Transition matrix. Iterations of
PageRank using MapReduce.
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 7. Frequent datasets analysis
Determination and application of frequent datasets. Model of a market basket. Association rules. A-Priori algorithm. Monotony of data. Storing big data in memory. Park Chan-Yu algorithm.
Multi-level algorithm. Multihash algorithm. Algorithms for restricted access. Savasere-Omichinski-
Nebat algorithm. Toivonen algorithm. Counting the frequent datasets: sampling, hybrid methods.
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
6
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 8. Clustering algorithms and their applications
Introduction clustering algorithms: a point space distance. Clustering strategy. Hierarchical clustering in Euclidean and non-Euclidean spaces, its effectiveness. K-means algorithm. Bradley-
Fayyad Reina (BFR) algorithm. CURE algorithm. Cluster tree. GRGPF algorithm. Application of clustering algorithms in in-line and parallel computing.
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 9. Neural networks and their applications
Determining the structure and typology of neural networks. Kohonen maps. Neural network inverse distribution. Application of neural networks in economics, logistics, IT-sphere.
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
5.
Heaton C. Introduction to the Math of Neural Networks. Heaton Research, 2010.
7
Topic 10. Advertising on the Web
Algorithms for online and offline. Greedy algorithm. Competitive ratio. The problem of coincidences. Algorithm balance and the balance of the generalized algorithm.
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 11. Decision support system
Model of decision support system. Utility matrix. Making decisions based on the contents of the data. Identification of the properties and parameters of the data. Collaborative filtering. Measurement identity. Reduced dimension. UV - decomposition. Standard deviation.
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
Topic 12. Analysis of social network graphs
What is a social network? Social networks as graphs. Types of social networks. Clustering of social network graphs, distance in graphs. Girvan-Newman algorithm. Bipartite graphs and subgraphs.
Maximum likelihood algorithm.
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
8
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
8.
Literature
Basic literature
1.
Minelli M., Chambers M., Dhiraj A. Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses. John Wiley & Sons, 2012
2.
Ye N. The Handbook of Data Mining. Lawrence Erlbaum Associates, 2003
3.
Leskovec J., Rajaraman A., Jeffrey D. Ullman. Mining of Massive Datasets. Stanford
University, 2010
Additional literature
4.
Eaton C., Deutsch T., Deroos D., Lapis G., Zikopoulos P. Understanding Big Data.
Analytics for Enterprise Class Hadoop and Streaming Data
5.
Heaton C. Introduction to the Math of Neural Networks. Heaton Research, 2010.
9.
Knowledge control questions
1.
Big data. The problem of big data today.
2.
Data Management. Methods of data collection and data preparation. Principles of storage and data management.
3. Modeli distributed file systems. File system Google and Hadoop.
4. MapReduce. Paradigm, the essence of the structure.
5. Search of similarities. Similarity Jaccard. Splitting. LSH - hashing.
6. Stream data model. Flageolet Martin algorithm. The algorithm of Alon-Matias-Zhegedi.
Algorithm Datar-Gionis-Indic-Motwani (DGIM).
7. Link analysis. Page Rank.
8. Determination and application of frequent sets. Model of a market basket. Association rules. A-Priori algorithm.
9. Determination and application of frequent sets. Monotony of data. Storing large data in memory. Algorithm Park Chan-Yu. Multi-level algorithm.
9
10. Determination and application of frequent sets. Multi-level algorithm. Multihash algorithm. Algorithms for restricted access. Savasere-Omichinski-Nebat Algorithm.
Toivonen Algorithm. Counting the frequent data sets: sampling, hybrid methods.
11. Clustering algorithms. Clustering strategy. Hierarchical clustering in Euclidean and non-Euclidean spaces, its effectiveness. K-means algorithm.
12. Bradley-Fayyad-Rein Algorithm (BFR). CURE Algorithm. Cluster tree. GRGPF algorithm. Application of clustering algorithms with in-line and parallel computing.
13 Determination of the structure and typology of neural networks. Kohonen maps. Neural network inverse distribution. Application of neural networks in economics, logistics, ITsphere.
14 Algorithms for online and offline. Greedy. Competitive ratio. The problem of coincidences. Balance algorithm and the balance of the generalized algorithm.
15 The model of decision support system. Matrix utility. Collaborative filtering.
Measurement identity. UV - decomposition. Standard deviation.
16 What is a social network? Social networks as graphs. Types of social networks.
Clustering of social network graphs, distance in graphs. Girvan-Newman algorithm.
Maximum likelihood algorithm.
Developers:
NRU-HSE ________ _______ professor ________ _____ Nikolay V. Markov
(workplace) (position) (инициалы, фамилия)
10