Intrusion Detection Systems using Machine Learning By: Riley Arneson And Kellan Dempsey Basics of Intrusion Detection Systems ● Intrusion ○ ○ Defined as an attack that affects a network or computers reliability, privacy, or accessibility Example Attacks Include ■ Denial of Service (DOS) ● Restricts access to network or computer ■ User-to-root (U2R) ● Attempt to gain admin access where previously the attack only had user access ■ Remote-to-local(R2L) ● Sending packets to target machine ● Intrusion Detection System (IDS) ○ ○ A system designed to identify unauthorized activity and then notify admins or hosts of the suspicious and potentially malicious activity Typically positioned between the Network Switch and the Firewall Concerns of Adequate IDS’s in new technology ● Unmanned Aerial Vehicles (UAVs) ○ ○ Used in military and industrial applications for essential tasks such as surveillance and mission control A new IDS was proposed by Whelan et al. to handle potential risks associated with wireless technology in a high threat environment with risks such as GPS spoofing and jamming ● Internet of things (IoT) ○ Due to the nature, multiple devices all exchanging data, and popularity of IoT devices there is a high possibility of threats. Types of IDS’s ● Deployment Based IDS ○ ○ Host-based IDS Network-based IDS ● Detection Based IDS ○ ○ Signature-based detection Anomaly detection-based ● Statistics Based Technique Deployment Based IDS ● Host-based IDS ○ ○ ○ Installed on single machine Monitors all activity on the host Main drawback is each computer has to have the IDS installed, which can results in lower overall performance of the IDS due to the increased processing power need on each node ● Network-based IDS ○ ○ ○ Deployed on a network Monitors all traffic that occurs on the network Often uses Machine and Deep learning Machine Learning in Network-Based IDS 1. Preprocessing - Essential to ML as it makes results in faster and more accurate outputs a. Data Cleaning - Handles inconsistency in data, missing data, and outliers b. Feature Selection/Extraction - Selecting features or removing features based on relevance c. Feature Scaling - Standardizes all features so features don’t dominate based solely on their magnitude d. Handling Categorical Variables - Process of transforming sex,country, grade into numeral values e. Data splitting - Split data into training,validation, and testing sets f. Handling Data Imbalance - If classes in datasets are imbalance employ techniques such as oversampling or undersampling 2. Training - 80% of the original data set is used to train the ML algorithm 3. Testing - 20% is used to test the accuracy of the ML algorithm Detection Based IDS’s ● Signature-based Detection ○ ○ ○ Focuses on identifying known attack patterns by recognizing attack signatures Requires a database of known attack pattern signatures which is used to compare intrusions against Extremely good at recognizes known attacks, however can’t detect novel attacks or patterns ● Anomaly detection-based ○ ○ ○ ○ Establishes a profile of what a normal user does Deviation from the pattern is flagged as an attack Can detect novel attacks Higher false alarm rate Statistics-Based Technique ● Collects and analyzes data to build a regular user behavior statistical model ○ ○ ○ Two types of models Univariate ■ Only models a single feature in isolation ■ No regard to its relationship with other variables Multivariate ■ Analyzes multiple variables at one time and considers the relationship to one another ■ More common in complex systems with lots of moving parts IDS Feature Engineering ● Used for multiple reasons ○ Reduces computational complexity, removes and simplifies redundant data, decreases false alarm rates, and improves the accuracy of machine learning algorithms ● Large amounts of data are generated across different fields ○ ○ Causes issues as the large datasets require more computational resources Creates a need for feature engineering to make the amount of data more manageable ● Feature engineering is a deeply researched and highly utilized field ○ ○ Used for pattern recognition, machine learning, and intrusion detection, among other things Has incorporated simple and comprehensive models to improve performance of feature engineering Feature Engineering Algorithms ● Multiple advantages ○ Reduces cost of data collection and classification model training, improves classification performance, allows smaller model sizes, and may improve interpretability of a model ● Feature extraction ○ ○ ○ Combines features based on important characteristics Effective method for improving a learning algorithm Can complicate future analysis ● Feature selection ○ ○ Chooses the most important features Keeps the original features IDS Evasion Techniques ● Obfuscation ○ ○ All techniques which aim to avoid detection by making it harder to recognize malicious activity Fragmentation, flooding, and encryption are methods of obfuscation ● Anonymizers ○ ○ ○ All techniques which attempt to hide information about malicious network traffic Information manipulation (routing, ip, port), proxy servers, and VPNs are all anonymizers Tor network, an anonymization system which encrypts and routes network traffic through nodes to make it harder to find the original source Obfuscation 1 ● Fragmentation ○ ○ ○ Tries to bypass an IDS by splitting a packet and reassembling once transmitted IDS may scan the split packets individually and not recognize the danger IDS may use algorithms that attempt to reform and analyze the fragmented packets as a counter ● Flooding ○ ○ ○ ○ Technique which attempts to overwhelm the system by creating a high amount of traffic Once system resources are in use, it may be possible for harmful data to get through the IDS Dos And DDos attacks attempt to flood the system with a large amount of traffic in order to attempt to disrupt the system Protocol level flooding creates TCP/IP or ICMP protocols to attack the protocol handlers within the IDS Obfuscation 2 ● Encryption ○ ○ ○ ○ Transforms the data into an unreadable form As the IDS uses encryption for safety, this can make it difficult to ensure security IDS can use metadata to help identify suspicious patterns and behavior which may indicate malicious intent Some encrypted data can be decrypted, analyzed, and encrypted again to check it Anonymizers 1 ● Source Routing ○ ○ ○ Manipulates the routing path of network packets to slip past an IDS Loose source routing specifies some routers along the path and may deviate from the most common path to confuse or pass an IDS system Strict source routing specifies the exact router path the data must follow, allowing the packet to avoid security measures ● Source Port Manipulation ○ Alters the source port value of network packets to avoid appearing suspicious and bypass an IDS which may have less security for common source ports such as HTTP or HTTPS Anonymizers 2 ● Spoofing IP address ○ ○ Alters the source IP address to make a packet look like it is coming from a trusted IP Used in DDoS attacks to get around situations where the IDS limits traffic from one IP ● Address Decoy ○ ○ Manipulates network addresses during communication with a network by changing the IP address of the destination in the header of the packet Modifying the destination can make it appear like it is heading for a different IP address than where the packet is actually heading ● Proxy servers ○ ○ ○ Uses an intermediary server between the source and destination to attempt to hide activity from an IDS Hides the source of the malicious data Can use encryption and tunneling to further protect the source of the malicious data and allow the attackers to communicate in a secure channel