Data Science Foundations: Big Data, ML, and Tools

Data Science Foundations Big Data: the dynamic, large and disparate volumes of data being created by people, tools, and machines. It requires new, innovative, and scalable technology to collect, host, and analytically process the vast amount of data gathered in order to derive real-time business insights that relate to consumers, risk, profit, performance, productivity management, and enhanced shareholder value. Machine Learning: a subset of AI that uses computer algorithms to analyze data and make intelligent decisions based on what it is learned without being explicitly programmed. Machine learning algorithms are trained with large sets of data and they learn from examples. They do not follow rules-based algorithms. Machine learning is what enables machines to solve problems on their own and make accurate predictions using the provided data. Deep learning: subset of machine learning; uses layered neural networks to simulate human decision-making. Deep learning algorithms can label and categorize information and identify patterns. It is what enables AI systems to continuously learn on the job and improve the quality and accuracy of results by determining whether decisions were correct. Data Science vs Data Scientist  A "data scientist" is an individual who finds solutions to problems by analyzing data of various sizes using the appropriate and available tools and then tells a story to their major stakeholders.  "Data science" is the art of taking Big Data and telling a story with it in order to reveal insights that assist organizations in making strategic decisions The “V”s of Big Data      Velocity – speed data accumulates (never stops) Volume – scale of the data (increase in amount stored) Variety – diversity of the data (structured = rows/columns, unstructured = videos/tweets) and comes from different sources – 80% estimated to be unstructured Veracity – quality and origin of data and conformity to facts and accuracy Value – ability and needed to turn data into value Data Science Writing Reports: 1. Cover page 2. Table of contents 3. Executive Summary 4. Introductory section 5. Methodology section 6. 7. 8. 9. 10. Results section Discussion section Conclusion section References Acknowledgment Terminology: “Data Sets” - Collection of data  Data structure o Tabular data (table) o Hierarchical (tree) o Network (graph – like connections on social media) o Raw data (like images, etc)  Private vs open data o Open data: datacatalogs.org, Kaggle, datasetsearch on google o Community data license agreement  CDLA-sharing – can use, but must share under same licensing terms  CDLA-permissive – can use, no obligations  Data asset eXchange (DAX) Methodology for Data Science From Problem to Approach: 1. What is the problem you are trying to solve? 2. How can you use data to answer the question (what are you REALLY asking? Do you want to decrease costs or increase productivity etc)  Business understanding: o Ask the RIGHT question o Get “buy in” from stakeholders  Analytic approach – how can you use the data to answer the question: o Probabilities = predictive o Show relationships = descriptive o Yes/no = classification Working with the Data: 3. 4. 5. 6. What data is needed to answer the question – remember costs vs benefits can be a factor here Where is the data sourced from and how will you get it? Is the data representative of the problem to be solved? What additional work is needed to manipulate and use the data?  Data Understanding: “which ingredients are required to make the perfect recipe” o Who, what, when ,where, why, how  Data Understanding – collection:  Data Understanding – descriptive statistics (increase data quality): o Does any data appear to be missing, if so does this mean something important?  Data preparation: “features” = characteristics that solve the problem – finding them and pulling them out can help Driving the Answer: 7. 8. 9. 10. How can we visualize this data to get and show the answer? Does the Model REALLY answer the question or does it need to be adjusted? Do we need different data etc? Can you put the model into practice? Can you get constructive feedback using this model? Tools for Data Science 1. Fully integrated – no programming needed  Open source: Knime, orange  Commercial: Watson studio with Watson openscale (through cloud), h2o.ai  Cloud: Azure Machine learning 2. Execution environments:  Open source: apache spark (most used) , apache flink (stream processing – processes real time data streams) 3. Data asset management/data lineage (needs to annotated with metadata)  Open source: Apache atlas, egeria, kylo  Commercial: informatica, IBM infosphere Information Governance  Cloud: amazon webservices DB, cloudant, DB2 4. Data Management  Open source: MySQL, postgreSQL,mongodb, CouchDB, apache Cassandra, HadoopFS, ceph, elasticsearch (text data)  Commercial: oracle, SQLsevere, IBMDB2 5. Data integration/transformation (ETL or ELT) – data refining/cleaning  Open source: Apache airflow, apache kafka, Kubeflow, sparkSQL, node-red  Commercial: informatica, IBM infosphere Datastage, talend, Watson studio data refinery  Cloud: Informatica, data refinery o Data refinery (makes it easy to cleanse) 6. Data visualization  Open source: Hue, kibana, apache superset  Commercial: tableau, Microsoft Power BI, IBM cognos analytics, Watson studio desktop  Cloud: datameer, cognos, Watson studio 7. Model building  Commercial: SPSS modeler and modeler flow, Sas Miner,  Cloud: Watson machine learning, AI platform training (google) 8. Model deployment  Open source: PredictionIO, Seldon, mleap, tensorflow/tensorflowlite  Commercial: SPSS  Cloud: SPSS modeler, Watson machine learning 9. Model Monitoring/assessment  Open source: ModelIDB, Prometheus, AI fairness 360, Adversarial robustness 360 = helps prevent attacks and makes model more robust, AI explainability 360 = explains the model  Cloud: amazon sagemaker model monitor, Watson openscale 10. Code asset management/version management:  Open source: github, gitlab, bitbucket 11. Development environment:  Open source: o jupyter (python), jupyterlab (next gen)  two-process model - client (interface for person to send code) & Kernel (Executes code and sends result for display – the client is the browser), Jupyter notebooks = your code, metadata, contents and outputs, apache zeppelin, Rstudio , spyder  Commercial: Watson Studio Libraries– collection of functions and methods that let you do things without writing the code yourself Python  SCIENTIFIC COMPUTING LIBRARIES o Pandas (data structures and tools) o Numpy (pandas built on this – for arrays and matrices)  VISUALIZATION LIBRARIES o Matplotlib (plots and graphs) o Seaborn (based on matplotlib - heat maps, time series, violin plots)  HIGH-LEVEL MACHINE LEARNING AND DEEP LEARNING o Sci-kit learn (built on numpy, scipy and matplotlib for statistical modeling including regression, classification, clustering, etc) o Keras (deep learning neural networks)  DEEP LEARNING LIBRARIES o Tensor flow (deep learning: production and deployment) o pyTorch(deep learning: regression, classification…) Libraries in other languages  APACHE SPARK – general purpose cluster-computing framework; can use many languages including scala and scala libraries like vegas, Big DL,  R LIBRARIES: – dplyer (data manipulation), stringr (string manipulation), ggplot/plotly/lattice/leaflet (data viz), caret (machine learning) – to install use command “install.packages(“package name”) Languages of data science:    Python, R, SQL (we will see these used in the following) - recommended first Scala, Java, C++, Julia some of most popular o Java – Hadoop: manages data processing and storage for big data applications running in clustered systems Go, Ruby, Visual Basic also have benefits Python Why Python is Great      It’s a high-level general-purpose programming language that can be applied to many different classes of problems. It has a large, standard library that provides tools suited to many different tasks, including but not limited to databases, automation, web scraping, text processing, image processing, machine learning, and data analytics. For data science, you can use Python's scientific computing libraries such as Pandas, NumPy, SciPy, and Matplotlib. For artificial intelligence, it has TensorFlow, PyTorch, Keras, and Scikit-learn. Python can also be used for Natural Language Processing (NLP) using the Natural Language Toolkit (NLTK). Comments   Use a “#” to put a comment in the line of code that won’t show up, really good practice to write out what you are doing and why just in case any one else looks at the code OR use “”” hello world “”” three quotations for MULTI LINE comments Types         Text Type: str (string – sequence of characters like numbers, letters or symbols) Numeric Types: int (no decimal), float (decimal), complex (number and letter) Sequence Types: list [], tuple (), range Mapping Type: dict {‘apple’: ‘banana’} Set Types: set {‘apple’,’banana’}, frozenset({‘apple’,’banana’}) Boolean Type: bool (Boolean values =True(1), False(0)  must capitalize!) Binary Types: bytes, bytearray, memoryview Python collections (arrays): o o o o List is a collection which is ordered and changeable. Allows duplicate members. Tuple is a collection which is ordered and unchangeable. Allows duplicate members. Set is a collection which is unordered and unindexed. No duplicate members. Dictionary is a collection which is unordered and changeable. No duplicate members. Expressions = Math Operations a + b addition (sum of a and b) -a (negative number of a or unary negation) a-b (b subtracted from a) a*b ( multiplication – product of a and b) a / b (a divided by b  always has a FLOAT answer a // b (a divided by b with an INTEGER answer) a** b (a raised to the power of b) +a (doesn’t mean anything just a regular number) Variables    Names: o Stores your values for example: my_variable=1 stores ‘my_variable’ as 1 for you to use in a function later o Cannot start with a number and can only contain letters, numbers and underscore o Can assign multiple values at a time (i.e. x,y,z=’banana’,’apple’,’cherry’  print(x)=’banana’, etc) Output: o Print() function Global variables: create the variable OUTSIDE the function then use it INSIDE a function you can also have global variable created by adding the global keyword next to a local function local  Casting: o Done through constructer function int(), str() global Global via keyword Strings      Strings are an array o Since they are an array you can loop them with the FOR function Square brackets can be used to access elements of a string (i.e. a=’hello world’, printa([0]) = h) Length uses the LEN function (i.e. a=’hi’, print(len(a)) results in 2) Can check using IF statements: txt = "The best things in life are free!" if "free" in txt: print("Yes, 'free' is present.") Concatenate strings: o Use (+) operator  a=’hi’ b=’there’ c=a+b print(c) result is hithere if want to add space between c=a + “ “ + b o Can’t combine strings and numbers through STRING INDEX: quantifies how many elements are in the string (for example ‘Kayla’ has elements 0,1,2,3,4 OR can be negative (5,-4,-3,-2,-1) STRING METHODS capitalize() casefold() center() count() encode() endswith() expandtabs() find() format() format_map() index() isalnum() isalpha() isdecimal() isdigit() isidentifier() islower() isnumeric() isprintable() isspace() istitle() isupper() join() ljust() lower() lstrip() maketrans() partition() replace() rfind() rindex() rjust() rpartition() rsplit() rstrip() split() Converts the first character to upper case Converts string into lower case Returns a centered string Returns the number of times a specified value occurs in a string Returns an encoded version of the string Returns true if the string ends with the specified value Sets the tab size of the string Searches the string for a specified value and returns the position of where it was found Formats specified values in a string Formats specified values in a string Searches the string for a specified value and returns the position of where it was found Returns True if all characters in the string are alphanumeric Returns True if all characters in the string are in the alphabet Returns True if all characters in the string are decimals Returns True if all characters in the string are digits Returns True if the string is an identifier Returns True if all characters in the string are lower case Returns True if all characters in the string are numeric Returns True if all characters in the string are printable Returns True if all characters in the string are whitespaces Returns True if the string follows the rules of a title Returns True if all characters in the string are upper case Joins the elements of an iterable to the end of the string Returns a left justified version of the string Converts a string into lower case Returns a left trim version of the string Returns a translation table to be used in translations Returns a tuple where the string is parted into three parts Returns a string where a specified value is replaced with a specified value Searches the string for a specified value and returns the last position of where it was found Searches the string for a specified value and returns the last position of where it was found Returns a right justified version of the string Returns a tuple where the string is parted into three parts Splits the string at the specified separator, and returns a list Returns a right trim version of the string Splits the string at the specified separator, and returns a list splitlines() startswith() strip() swapcase() title() translate() upper() zfill() Splits the string at line breaks and returns a list Returns true if the string starts with the specified value Returns a trimmed version of the string Swaps cases, lower case becomes upper case and vice versa Converts the first character of each word to upper case Returns a translated string Converts a string into upper case Fills the string with a specified number of 0 values at the beginning STRING STRIDING STRING SLICING ESCAPE SEQUENCE  \n = new line in the text in your string  \t= tab in your string  \\ = puts and actual backslash in your string Tuple ( ) Immutable = ordered and cannot change, allows duplicates    Comma separated – must have AT LEAST ONE COMMA TO BE A TUPLE and Can be any data type, can contain different data types Concatenate – combine (+ to add tuples and * to multiple tuples) - Can nest within the tuples Check if something in tuple thistuple = ("apple", "banana", "cherry") if "apple" in thistuple: print("Yes, 'apple' is in the fruits tuple") **The only way to change something in Tuple is to change it to a list then you can alter the list UNPACK TUPLE fruits = ("apple", "banana", "cherry") (green, yellow, red) = fruits print(green) results in apple print(yellow) results in banana print(red) results in cherry LOOP TUPLE o Use “FOR” loop thistuple = ("apple", "banana", "cherry") for x in thistuple: print(x) o Through Index – use range() and len() thistuple = ("apple", "banana", "cherry") for i in range(len(thistuple)): print(thistuple[i]) o Use “WHILE” loop thistuple = ("apple", "banana", "cherry") i = 0 while i < len(thistuple): print(thistuple[i]) i = i + 1 JOIN TUPLE by adding to tuples together with “+” or multiply with “*” which will just repeat the tuple that many times TUPLE METHODS Count() Returns number of times a value occurs in the tuple Index() Searches tuple for value and returns position where found – can do negative indexing as well and range of indexes Lists [ ] Mutable = can change, allow duplicates, can be any data type and different data types together NOTABLE FUNCTIONS:   LOOP   Len() function for length of elements Range()function Use “for” loop to loop through list thislist = ["apple", "banana", "cherry"] for x in thislist: print(x) Use range() function and len() function to create iterable to loop by their index number thislist = ["apple", "banana", "cherry"] for x in thislist: print(x)  Use “while” loop and len function  remember! Always increase the index by 1 after iteration!!! thislist = ["apple", "banana", "cherry"] i = 0 while i < len(thislist): print(thislist[i]) i = i + 1 LISTS METHODS append() Adds an element at the end of the list clear() Removes all the elements from the list copy() Returns a copy of the list – use this to make a copy since a list can be changed, the copy will let you change it without changing original list OR make a new list by just using list() function and make a new list count() Returns the number of elements with the specified value extend() Add the elements of a list (or any iterable), to the end of the current list list1 = ["a", "b" , "c"] list2 = [1, 2, 3] list1.extend(list2) print(list1) index() insert() List() pop() remove() reverse() sort() Returns the index of the first element with the specified value Adds an element at the specified position Used to make method to make a copy of a list [i.e. thislist=x mylist=list(thislist) ] Removes the element at the specified position Removes the first item with the specified value Reverses the order of the list Sorts the list in order, ascending To sort descending use argument (reverse=True) Case sensitive – solve this by using argument (key = str.lower) to make list uniform Dictionaries { } with keys and values   Used to store data values in key:value PAIRS- it is ordered, changeable and no duplicates can be any data type Example thisdict= {“brand”:”ford”, “model”:”mustang”,”year”:1964} FUNCTIONS TO USE WITH DICTIONARY:   Length using len() Type() ACCESSING ITEMS IN DICTIONARY:       Referring to the key name ie x = thisdict[“model”]  note square bracket .get() will give the same result .keys() will return list of all the keys in the dictionary .values() will return list of all the values in the dictionary .items () will return a Tuple of all the keys:values Check if a key exits with the IN keyword – if model in thisdict: print(”Yes this key is in this dictionary”) can also use it to return a true or false Boolean CHANGE ITEMS IN DICTIONARY:   Refer to its key – thisdict[“year”]= 2018 will change it from 1964 to 2018 .update() – thisdict.update({“year”:”2018”}) ADD/REMOVE ITEMS IN DICTIONARY:  Add: o o Make a new key – thisdict[“color”] = “red” will add a new key:value .update() will do the same thisdict.update({“color”:”red”})  Remove: o .pop() removes item with that key name o .popitem() removes last inserted item o Del keyword removes the model with that key name – del thisdict[“model”] – can also delete whole dictionary o .clear() empties the dictionary LOOP THROUGH A DICTIONARY:      FOR loop – for x in thisdict: print(x) – returns all the keys in the dictionary one by one FOR loop – for x in thisdict: print(thisdict[x]) .values() method - for x in thisdict.values(): print(x) prints all values .keys() method – for x in thisdict.keys(): print(x) prints all keys .items() method – for x,y in thisdict.items(): print(x,y) will print keys and values COPY DICTIONARY:   .copy() method – mydict = thisdict.copy() print(mydict) .dict() function – mydict =dict(thisdict) print(mydict) NESTED DICTIONARY – either create a dictionary that has 3 nested dictionaries within it OR create 3 separate dictionaries and combine into one new dictionary with all together DICTIONARY METHODS clear() copy() fromkeys() get() items() keys() pop() popitem() setdefault() update() values() Removes all the elements from the dictionary Returns a copy of the dictionary Returns a dictionary with the specified keys and value Returns the value of the specified key Returns a list containing a tuple for each key value pair Returns a list containing the dictionary's keys Removes the element with the specified key Removes the last inserted key-value pair Returns the value of the specified key. If the key does not exist: insert the key, with the specified value Updates the dictionary with the specified key-value pairs Returns a list of all the values in the dictionary Sets { }    Used to store multiple items in a single variable, can be any data type and multiple types, it is unordered and unindexed Constructor = set() Length of set with len() function ACCESS ITEMS:  find a specific using the “IN” keyword thisset = {"apple", "banana", "cherry"} print("banana" in thisset)  loop through the set using the “FOR” loop thisset = {"apple", "banana", "cherry"} for x in thisset: print(x) ADDING/REMOVE ITEMS – YOU CANNOT CHANGE A SET BUT YOU CAN ADD OR REMOVE FROM IT   Add:  Add specific elements by using .add() function  Add another set within the set with .update() function I.e. set a set b seta.update(setb) adds set b Remove:  .remove() method thisset.remove(“banana”)  if item doesn’t exist will return an error  .discard() method thisset.discard(“banana”)  if item doesn’t exist will NOT return an error  .pop() will remove LAST item – they are unordered so you won’t know what item gets removed thisset = {"apple", "banana", "cherry"} x = thisset.pop() print(x) print(thisset)   .clear() empties the set del “name of set” will delete the set all together LOOP SETS  Use “FOR” loop ie for x in thisset: print(x) will print all the elements in the set JOIN SETS      Union() = new set with all the values from the sets you combined Update() = will add the values from the set you choose into the other set **both union and update exclude duplicates!! Intersection_update() will keep only the items present in BOTH sets (keeps the duplicates!!) [i.e x.intersection_update(y) ] Intersection() for creating a NEW set that contains the duplicates from the sets [i.e. z = x.interaction(y) ] SET METHODS add() clear() copy() difference() difference_update() discard() intersection() intersection_update() isdisjoint() issubset() issuperset() pop() remove() symmetric_difference() symmetric_difference_update() union() update() Adds an element to the set Removes all the elements from the set Returns a copy of the set Returns a set containing the difference between two or more sets Removes the items in this set that are also included in another, specified set Remove the specified item Returns a set, that is the intersection of two other sets Removes the items in this set that are not present in other, specified set(s) Returns whether two sets have a intersection or not Returns whether another set contains this set or not Returns whether this set contains another set or not Removes an element from the set Removes the specified element Returns a set with the symmetric differences of two sets inserts the symmetric differences from this set and another Return a set containing the union of sets Update the set with the union of this set and others Conditions and Branching LOGIC OPERATORS AKA CONDITIONS: Equals: a == b Not Equals: a != b Less than: a < b Less than or equal to: a <= b Greater than: a > b Greater than or equal to: a >= b  How these branching – if/else: o “if statements” and loops  IF a = 33 b = 200 if b > a: print("b is greater than a")  ELIF – if the previous conditions were NOT true , then try this condition a = 33 b = 33 if b > a: print("b is greater than a") elif a == b: print("a and b are equal")  ELSE – catches anything that wasn’t caught with the previous conditions a = 200 b = 33 if b > a: print("b is greater than a") elif a == b: print("a and b are equal") else: print("a is greater than b")  Or can have ELSE statement without the ELIF a = 200 b = 33 if b > a: print("b is greater than a") else: print("b is not greater than a") o o Shorthand IF  if a > b: print("a is greater than b") Shorthand if…else (Ternary operators or conditional expressions) a = 2 b = 330 print("A") if a > b else print("B") o o o AND/ OR using logic operators above you can print a statement using and/or Nested If – can have multiple if statements ending with an else statement Pass statements – if you need to leave the content empty but don’t want to get an error Loops  WHILE Loops – can execute a set of statements as long as the condition is TRUE o Print i as long as i is less than 6 i = 1 while i < 6: print(i) i += 1 o o o You must set the variable (we set it to 1) and you MUST increment i or loop will go on forever Break statement can be used to stop the loop even if the statement is true (i.e. if i == 3: break) Continue statement can bused to continue loop if statement is false  Continue to the next iteration if i is 3: i = 0 while i < 6: i += 1 if i == 3: continue print(i)  o Use the else statement to run code when the condition is no longer true while I <6 print(i) else print (“no”) FOR Loops - used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string). o The for loop does not require an indexing variable to set beforehand. o Break statement fruits = ["apple", "banana", "cherry"] for x in fruits: print(x) if x == "banana": break o Continue statement fruits = ["apple", "banana", "cherry"] for x in fruits: if x == "banana": continue print(x) o Range() function for x in range(2, 30, 3): print(x) o ELSE in FOR loop **else will not be executed if stopped by break statement** for x in range(6): if x == 3: print(x) else: print("Finally finished!") o Nested Loops: The "inner loop" will be executed one time for each iteration of the "outer loop" adj = ["red", "big", "tasty"] fruits = ["apple", "banana", "cherry"] for x in adj: for y in fruits: print(x, y) o Pass statement – placeholder if no content so you don’t get an error Functions A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result. CREATING A FUNCTION  Python function is defined by using the def keyword: def my_function(): print("Hello from a function")   To call the function you just use the function name you created (i.e. my_function()) Arguments: Information can be passed into functions as arguments. Arguments are specified after the function name, inside the parentheses. You can add as many arguments as you want, just separate them with a comma. def my_function(fname): print(fname + " Refsnes") my_function("Emil") my_function("Tobias") my_function("Linus")   Parameter vs Argument: o A parameter is the variable listed inside the parentheses in the function definition. o An argument (args) is the value that is sent to the function when it is called.  However many arguments it expects in the function you must put i.e. (def myfun(fname, lname): your argument must be my_function(“Kayla”,”Morris”)  If you don’t know how many arguments you will have put a * before the parameter (i.e. def myfun(*name)) that way you can put a tuple of arguments in mygun(“kayla”,”kyle”,”brady”) o Keyword arguments (kwargs) lets you send key=value syntax [i.e. def myfun(child1, child2) myfun(child 1 = “brady”, child 2=”kyle”)]  If unknown how many arguments put two ** before the parameter o You can pass a list as an argument Return Values will give you that as a result (i.e. this will return whatever you put in my_function x 5) def my_function(x): return 5 * x  Recursion – function can call itself: Recursion is a common mathematical and programming concept. It means that a function calls itself. This has the benefit of meaning that you can loop through data to reach a result. o Be very careful with recursion as it can be quite easy to slip into writing a function which never terminates, or one that uses excess amounts of memory or processor power. o In this example, tri_recursion() is a function that we have defined to call itself ("recurse"). We use the k variable as the data, which decrements (-1) every time we recurse. The recursion ends when the condition is not greater than 0 (i.e. when it is 0). def tri_recursion(k): if(k > 0): result = k + tri_recursion(k - 1) print(result) else: result = 0 return result print("\n\nRecursion Example Results") tri_recursion(6) LAMBADA  A lambda function is a small anonymous function. A lambda function can take any number of arguments, but can only have one expression. Why use them? The power of lambda is better shown when you use them as an anonymous function inside another function. o Example: Use that function definition to make a function that always doubles the number you send in: def myfunc(n): return lambda a : a * n mydoubler = myfunc(2) print(mydoubler(11)) Result 22 Exception Handling (Try/Except)      The try block lets you test a block of code for errors. The except block lets you handle the error. The finally block lets you execute code, regardless of the result of the try- and except blocks. Try except else gives you an alternative code to run Raise allows you to set when to raise an error Classes and Objects Python is an object oriented programming language. A Class is like an object constructor, or a "blueprint" for creating objects.  CREATE A CLASS o Use ‘class’ keyword (i.e. class MyClass:)  Property underneath will be x=5 o Create object p1=MyClass() now you can print value of x with print(p1.x) o For classes to be useful they need to have a function called __init__ to initialize the class example: class Person: def __init__(self, name, age): self.name = name self.age = age p1 = Person("John", 36) print(p1.name) print(p1.age) o The Objects can also contain methods which are functions that belong to the object – create a method in the Person class – insert a function that prints a greeting and execute it on the p1 object: class Person: def __init__(self, name, age): self.name = name self.age = age def myfunc(self): print("Hello my name is " + self.name) p1 = Person("John", 36) p1.myfunc() o  The self-parameter is a reference to the current instance of the class, and is used to access variables that belongs to the class. It does not have to be named self , you can call it whatever you like, but it has to be the first parameter of any function in the class o You can:  Modify object properties (i.e. p1.age=40)  Delete object properties or object (i.e. del p1.age) or (del p1) INHERITANCE: Inheritance allows us to define a class that inherits all the methods and properties from another class. o Parent class is the class being inherited from, also called base class. o Child class is the class that inherits from another class, also called derived class. 1. Create a parent class: Define the Class Define the Function to Perform Define the variable ‘Person’ Execute the function 2. Create the child class Parent Class Above New Child Class Define new STUDENT variable & Execute 3. Use Super() function to inherit all the methods and properties from parent class 4. Add Properties, Add Methods get the Final: Define Class Parent class Define function Define Child Class Add Super() to inherit parent class attributes and then add its own properties below Add methods Define Variable and Execute   CREATE AN ITERATOR: iterator = object that contains countable number of values – it can be iterated upon, technically they are an object that implement the iterator protocol which are methods __iter__() and __next__() o As you have learned in the Python Classes/Objects chapter, all classes have a function called __init__(), which allows you to do some initializing when the object is being created. o Stopiteration Example The __iter__() method acts similar, you can do operations (initializing etc.), but must always return the iterator object itself. The __next__() method also allows you to do operations, and must return the next item in the sequence To prevent the iteration to go on forever, we can use the StopIteration statement. In the __next__() method, we can add a terminating condition to raise an error if the iteration is done a specified number of times (in this case 20): Create variables Create conditions and execute Working with Data in Python Reading and Writing Files MODULE       Module – create a module by saving your file in ‘.py’ form so you will have a file of the functions you want to use in other applications You can use the module by importing it (i.e. import mymodule) When using the function from the module call it through module_name.function_name You can create an alias when you call a module by using the ‘as’ (import mymodule as mx) There is a built-in function to list all the function names (or variable names) in a module. The dir() function: You can also import just the variables you want by using the ‘from’ action (from mymodule import person1) FILE HANDLING  Open() function is the key – methods for opening: o "r" - Read - Default value. Opens a file for reading, error if the file does not exist o "a" - Append - Opens a file for appending, creates the file if it does not exist o "w" - Write - Opens a file for writing, creates the file if it does not exist o "x" - Create - Creates the specified file, returns an error if the file exists  Also can state whether text (t) or binary (b – images)  Example: f = open("demofile.txt", "rt") – read in text state READ FILES – USE BUILT IN OPEN() AND READ()  If file located in same folder as python:      f = open("demofile.txt", "r") print(f.read()) If located somewhere else in computer: f = open("D:\\myfiles\welcome.txt", "r") print(f.read()) Read parts = .read(specify here) Read lines = .readline()  reads first line o Call it twice will read first two lines etc Loop through the lines f = open("demofile.txt", "r") for x in f: print(x) ALWAYS CLOSE THE FILE file_name.close() WRITE/CREATE FILES o .Write or .append f = open("demofile2.txt", "a") f.write("Now the file has more content!") f.close() o Create new o To create a new file in Python, use the open() method, with one of the following parameters: o "x" - Create - will create a file, returns an error if the file exist o "a" - Append - will create a file if the specified file does not exist o "w" - Write - will create a file if the specified file does not exist DELETE FILES o Import OS and delete o import os os.remove("demofile.txt") Python References Date/time A date in Python is not a data type of its own, but we can import a module named datetime to work with dates as date objects 1. Import datetime Directive %a %A %w %d %b %B %m %y %Y %H %I %p %M %S %f %z %Z %j %U %W %c %x %X %% %G %u %V create variable/function 3. change directive Description Weekday, short version Weekday, full version Weekday as a number 0-6, 0 is Sunday Day of month 01-31 Month name, short version Month name, full version Month as a number 01-12 Year, short version, without century Year, full version Hour 00-23 Hour 00-12 AM/PM Minute 00-59 Second 00-59 Microsecond 000000-999999 UTC offset Timezone Day number of year 001-366 Week number of year, Sunday as the first day of week, 00-53 Week number of year, Monday as the first day of week, 00-53 Local version of date and time Local version of date Local version of time A % character ISO 8601 year ISO 8601 weekday (1-7) ISO 8601 weeknumber (01-53) Keywords Keyword and as assert break class continue def del 2. Description A logical operator To create an alias For debugging To break out of a loop To define a class To continue to the next iteration of a loop To define a function To delete an object Example Wed Wednesday 3 31 Dec December 12 18 2018 17 05 PM 41 08 548513 +0100 CST 365 52 52 Mon Dec 31 17:41:00 2018 12/31/18 17:41:00 % 2018 1 01 elif else except False finally for from global if import in is lambda None nonlocal not or pass raise return True try while with yield Used in conditional statements, same as else if Used in conditional statements Used with exceptions, what to do when an exception occurs Boolean value, result of comparison operations Used with exceptions, a block of code that will be executed no matter if there is an exception or not To create a for loop To import specific parts of a module To declare a global variable To make a conditional statement To import a module To check if a value is present in a list, tuple, etc. To test if two variables are equal To create an anonymous function Represents a null value To declare a non-local variable A logical operator A logical operator A null statement, a statement that will do nothing To raise an exception To exit a function and return a value Boolean value, result of comparison operations To make a try...except statement To create a while loop Used to simplify exception handling To end a function, returns a generator Built-in Python Functions Function abs() all() any() ascii() bin() bool() bytearray() bytes() callable() chr() classmethod() compile() complex() delattr() dict() dir() divmod() enumerate() eval() exec() filter() float() format() frozenset() Description Returns the absolute value of a number Returns True if all items in an iterable object are true Returns True if any item in an iterable object is true Returns a readable version of an object. Replaces none-ascii characters with escape character Returns the binary version of a number Returns the boolean value of the specified object Returns an array of bytes Returns a bytes object Returns True if the specified object is callable, otherwise False Returns a character from the specified Unicode code. Converts a method into a class method Returns the specified source as an object, ready to be executed Returns a complex number Deletes the specified attribute (property or method) from the specified object Returns a dictionary (Array) Returns a list of the specified object's properties and methods Returns the quotient and the remainder when argument1 is divided by argument2 Takes a collection (e.g. a tuple) and returns it as an enumerate object Evaluates and executes an expression Executes the specified code (or object) Use a filter function to exclude items in an iterable object Returns a floating point number Formats a specified value Returns a frozenset object getattr() globals() hasattr() hash() help() hex() id() input() int() isinstance() issubclass() iter() len() list() locals() map() max() memoryview() min() next() object() oct() open() ord() pow() print() property() range() repr() reversed() round() set() setattr() slice() sorted() @staticmethod() str() sum() super() tuple() type() vars() zip() Returns the value of the specified attribute (property or method) Returns the current global symbol table as a dictionary Returns True if the specified object has the specified attribute (property/method) Returns the hash value of a specified object Executes the built-in help system Converts a number into a hexadecimal value Returns the id of an object Allowing user input Returns an integer number Returns True if a specified object is an instance of a specified object Returns True if a specified class is a subclass of a specified object Returns an iterator object Returns the length of an object Returns a list Returns an updated dictionary of the current local symbol table Returns the specified iterator with the specified function applied to each item Returns the largest item in an iterable Returns a memory view object Returns the smallest item in an iterable Returns the next item in an iterable Returns a new object Converts a number into an octal Opens a file and returns a file object Convert an integer representing the Unicode of the specified character Returns the value of x to the power of y Prints to the standard output device Gets, sets, deletes a property Returns a sequence of numbers, starting from 0 and increments by 1 (by default) Returns a readable version of an object Returns a reversed iterator Rounds a numbers Returns a new set object Sets an attribute (property/method) of an object Returns a slice object Returns a sorted list Converts a method into a static method Returns a string object Sums the items of an iterator Returns an object that represents the parent class Returns a tuple Returns the type of an object Returns the __dict__ property of an object Returns an iterator, from two or more iterators Pandas Pandas is a library for working with data - it has functions for analyzing, cleaning, exploring, and manipulating data, once you have Python and PIP installed on your computer then you just need to install pandas( C:\Users\Your Name>pip install pandas) and import the Pandas library as pd to get started – using something like Anaconda can help since it already has it installed   Series  .Series()– it’s like a column in a table (one-dimensional array) o Create a series from a list a=[“1,2,3”], If no labels created the index is 0,1,2 like usual o Can create labels for the elements by example myvar=pd.Series(a,index=[“x”,”y”,”z”]) now 1,2,3 are x,y,z o You can also do key/value objects as a series calories ={“day1”:100,”day2”:200} myvar=pd.Series(calories) Dataframe  .DataFrame() – It’s like the whole table (a multi-dimensional array) o   Locate based on named indexes  print(df.loc["day2"]) READ CSV OR JSON o Load the CSV into a data frame  use to_string() to get the whole Dataframe o o   To locate a row within a dataframe you use the loc attribute using row index  print(df.loc[0])  Multiple indexes  print(df.loc[[0, 1]])  Create Named indexes  **BY DEFAULT WHEN YOU do print(df) you will only get the FIRST AND LAST 5 ROWS** MUST USE STRING Only difference is if JSON is already in a python dictionary you can load it right into the dataframe ANALYZING DATA: head(), tail(), info() – can tell you a lot of info including any null values (important for cleaning!) CLEANING DATA o Cleaning empty cells  Remove empty rows  Replace empty value file_name.fillna()  Replace specific column file_name[“column_name”].fillna(#, inplace=True)  Replace using mean(), median(), or mode()  X= file_name[“column_name”].mean() then repeat the “specific column” code o Cleaning wrong format - either remove the rows or change all the cells in that column to the same format  Change column - Date: use to_datetime()  Remove rows: file_name.dropna(subset=['Column_name'], inplace = True) Cleaning wrong data  Replacing Values   file_name.loc[row #, 'Column_name'] = new #  for x in file_name.index: if file_name.loc[x, "Column_name"] > #: File_name.loc[x, "Column name"] = # you need set to  Removing rows   for x in file_name.index: if file_name.loc[x, "Column_name"] > #: file_name.drop(x, inplace = True) o Removing duplicates  Check for duplicates  print(file_name.duplicated())  will return Boolean values  Remove  file_name.drop_duplicates(inplace=True) CORRELATIONS  corr() method o Show the relationship between columns  file_name.corr() … ignores not numeric columns o Number varies from -1 to 1 with 1 (or -1) being a perfect correlation example:  Duration to duration got perfect 1  Duration to calories got 0.922721 – good correlation  Duration to maxpulse got 0.009403 – bad correlation – cannot predict max pulse based on duration of workout and vice versa PLOTTING  plot() method  need pyplot from matplotlib o Scatter plot: o   o Histogram (only needs one column):  File_name["Column_name"].plot(kind = 'hist') Numpy (arrays)             50x Faster than processing lists - object type is numpy.ndarray Import numpy as np Basics on arrays: Creating an object  array_name=np.array([1,2,3 etc]) Dimensions in arrays nested array, arrays that have other arrays as their elements o 0-D – 1 value arr=np.array(42) o 1-d – np.array=([1,2,3]) o 2-d – np.array=([[1,2,3],[4,5,6],[7,8,9]]) – matrix o Check number of dimensions  print(arr.ndim) Indexing o Array_name.size: Index – how many values Slicing o Print(array_name[first:last(not included)]) o Print(array_name[#:] from that element to the end o Print(array_name[:#] from the beginning to that element NOT INCLUDING that element o STEP:  Print(array_name[beginning:how often:last(not included)])  i.e.  print(arr[1:5:2]) return every other element from index 1 to 5 o 2-d  Print(arr[list element, Beginning:end]) i.e.  print(arr[1, 1:4] Data types – check with dtype property can also create specific type by adding argument dtype and astype for changing a current array type o i - integer o M - datetime o b - boolean o O - object o u - unsigned integer o S - string o f - float o U - unicode string o c - complex float o V - fixed chunk of memory for other type ( o m - timedelta void ) Copy vs view  .copy (won’t change the original) .view (will change the original) Array shape o Array_name.shape: size of the array in each dimension Reshaping o newarr = arr.reshape(#of arrays, #of elements) o unknown dimensions  use -1 and numpy will calculate it for you o flatten array into 1d – use reshape(-1) Iterating – going through elements one by one o Use the FOR loop o 1-d and 2-d     o Then to return actual values (the scalars)- iterate in each dimension o Simpler way o Enumeration means mentioning sequence number of somethings one by one Join o arr = np.concatenate((arr1, arr2)) o join 2-d along rows  arr = np.concatenate((arr1, arr2), axis=1) o stacking (done along a new axis)  arr = np.stack((arr1, arr2), axis=1) o stack along rows  arr = np.hstack((arr1, arr2)) o stack along columns  arr = np.vstack((arr1, arr2)) o stack along height  arr = np.dstack((arr1, arr2)) Split o Split in 3 ways example  newarr = np.array_split(arr, 3) o Split along rows  newarr = np.array_split(arr, 3, axis=1) o Split along rows alt  newarr = np.hsplit(arr, 3) o Vsplit() and dsplit() also available Search o Where function  x = np.where(arr == 4) o Find even  x = np.where(arr%2 == 0) o Find odd  x = np.where(arr%2 == 1) o Find indexes where 7 should be inserted  x = np.searchsorted(arr, 7) (how many arrays) o Search from the right side  x = np.searchsorted(arr, 7, side='right') Sort: Sort array alphabetically  print(np.sort(arr))  can also search Boolean  Filter o o Boolean index list Simple APIs Application Program Interface and REST APIs  WHAT IS AN API?: lets two pieces of software talk to each other the API sends your program to the other software through inputs and outputs (you just need to know the inputs and outputs) o API Libraries – the API is the only part of the library that you see where the library is the entire process of the data going back and forth o Example: pandas in python is the API that processes the data by speaking with other software components  REST API: enable you to communicate via the internet taking advantage of storage, greater data access, artificial intelligence algorithmns, etc stands for representational state transfer o They have a set of rules about communication with the web service, input(request), and output(response) Rest APIs, Webscraping, and Working with Files  Databases/SQL Why SQL is great   originally for relational databases but has expanded Knowing SQL will help you do many different jobs in data science, including business and data analyst, and it's a must in data engineering  When performing operations with SQL, you access the data directly. There's no need to copy it beforehand. This can speed up workflow executions considerably.  SQL is the interpreter between you and the database.  SQL is an American National Standards Institute, or "ANSI," standard, which means if you learn SQL and use it with one database, you will be able to easily apply that SQL knowledge to many other databases.  There are many different SQL databases available, including MySQL, IBM Db2, PostgreSQL, Apache OpenOffice Base, SQLite, Oracle, MariaDB, Microsoft SQL Server, and more. The syntax of the SQL you write might change a little bit based on the relational database management system you’re using. Basics of SQL  What is SQL – a query language to get data out of a database  Basic commands o Create Table o Insert o Select command – retrieving data from the table; DML statement, it is a query, and the result from the query is a set/table  Select * from <tablename> o Update o Delete Relational Database Model  Data stored in tabular form (a table) – columns and rows  RDBMS – Relational database management system - set of software tools that controls the access, organization, and storage  Advantages of relational model  Explain how entity name and attributes map to a relational database  Working knowledge of SQL and database Connect to database and run SQL queries R Why R is Great   R has become the world’s largest repository of statistical knowledge. As of 2018, R has more than 15,000 publicly released packages, making it possible to conduct complex exploratory data analysis.  R integrates well with other computer languages, such as C++, Java, C, .Net, and Python.  Common mathematical operations such as matrix multiplication work straight out of the box. R has stronger object-oriented programming facilities than most statistical computing languages Data Visualization with Python Machine Learning with Python Basics:   Machine Learning uses algorithms to identify patterns in the data through model training then can make decisions from that training Deep learning is a specialized type of machine learning – it’s a general set of models and techniques that tries to loosely emulate the way the human brain solves a wide range of problems. o Common uses: natural language processing, image/audio/visual analysis, time series forcasting, o **requires VERY LARGE data sets of labeled data and is computing intensive o Built using: tensorflow, pytorch, keras – look for “model zoo” o Model asset exchange (MAX) 1. SUPERVISED LEARNING: In supervised learning, a human provides input data and the correct outputs. The model tries to identify relationships and dependencies between the input data and the correct output. Generally speaking, supervised learning is used to solve regression and classification problems. Controlled environment  Regression: predict real numeric value (i.e. home values based off home characteristic)  Classification: does something belong to a class (i.e. identify spam) 2. UNSUPERVISED LEARNING: In unsupervised learning, the data is not labelled by a human. The models must analyze the data and try to identify patterns and structure within the data based only on the characteristics of the data itself. Clustering and anomaly detection are two examples of this learning style. Less controlled environment  Clustering: used to divide the record set into groups (purchase recommendations based off group of purchases from previous)  Anomaly detection: identify outliers (ie like detecting credit card fraud because of anomaly) 3. REINFORCEMENT LEARNING: The third type of learning, reinforcement learning, is loosely based on the way human beings and other organisms learn. Think about a mouse in a maze. If the mouse gets to the end of the maze it gets a piece of cheese. This is the “reward” for completing a task. The mouse learns – through trial and error – how to get through the maze to get as much cheese as it can. In a similar way, a reinforcement learning model learns the best set of actions to take, given its current environment, in order to get the most reward over time. This type of learning has recently been very successful in beating the best human players in games such as go, chess, and popular strategy video games. Techniques Python for machine learning      Numpy – for working with arrays and doing computation SciPy – for scientific and high performance computation Matplotlib Pandas SciKitLearn – classification, regression, and clustering algorithmns, easy to implement Regression – predicting a continuous variable     Y = Dependent variable (all other variables affect this one) – must be continuous and cannot be a discrete value X= Independent variables (affect the x variable) Types:  Simple regression: one independent variable is used to estimate a dependent variable  Linear – dependent on the nature of relationship of the values o  Non-linear  Multiple regression: multiple independent variables used to estimate a dependent variable  Linear  Non-linear Regression algorithms  Measuring regression model accuracy    Can improve out of sample by doing: o Train/test split evaluation – more accuracy on on out of sample, accuracy but highly dependent on which dataset the data is trained and tested on o K fold cross-validation – average each fold and each fold is distinct (no data reused in another fold) What is an error? Difference between the data points and the trend line of a model o Mean absolute error – just the average error o Mean squared error – focused on large errors due to the square terms o Root mean squared error – easy to relate its information o o Relative absolute error – total absolute error and normalizes it Relative squared error – used to calculate R2 – represents how close the data values are to the trend line higher R2 = better fit Statistics for Data Science

Data Science Foundations: Big Data, ML, and Tools

Related documents

Products

Support

Data Science Foundations: Big Data, ML, and Tools

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib