Uploaded by Rohan Raj Shrestha

DBA

advertisement
Introduction
We reside in exciting times wherever machine learning is profoundly influencing various
applications from text understanding, image, and speech recognition, to health care and
genetics. As a hanging example, deep learning techniques are best-known to perform on par
with ophthalmologists on distinctive diabetic eye diseases in pictures. Most of the recent
success is thanks to higher computation infrastructure and huge amounts of coaching
information.
Among the numerous challenges in machine learning, data collection is changing into one of
the essential bottlenecks. It is known that the bulk of the time for running machine learning
end-to-end is spent on ready the information, which has to assemble, clean, analyse, visualise,
and have engineering. whereas all of those steps are unit long, information assortment has
recently become a challenge for subsequent reasons.
First, as machine learning is employed in new applications, it is typically the case that there's
not enough coaching data. ancient applications like AI or object detection relish large
amounts of coaching information that are accumulated for many years. On the opposite hand,
more modern applications have very little or no coaching information. As Associate in
Nursing illustration, good factories area units progressively change into machine-controlled
wherever product internal control is performed with machine learning. Whenever there's a
replacement product or a replacement defect to find, there's very little or no coaching
information to start with. The naïve approach of manual labelling might not be possible as a
result of its expensive and needs domain experience. This drawback applies to any novel
application that advantages from machine learning.
Moreover, as deep learning becomes fashionable, there is even a lot of would like for
coaching information. In ancient machine learning, feature engineering is one of the foremost
difficult steps wherever the user has to perceive the appliance and supply options used for
coaching models. Deep learning, on the other hand, will mechanically generate options that
save the United States of feature engineering, which may be a vital part of information
preparation. However, in return, deep learning might need larger amounts of coaching
information to perform well.
As a result, there's a pressing would like for correct and climbable information assortment
techniques within the era of massive information, which motivates the United States to
conduct a comprehensive survey of the information assortment literature from an information
management purpose of reading. There is a unit for the most part 3 strategies for information
assortment. First, if the goal is to share and search for new datasets, then information
acquisition techniques may be wont to discover, augment, or generate datasets. Second, once
the datasets area unit is accessible, varied information labelling techniques may be wont to
label the individual examples. Finally, rather than labelling new datasets, it should be higher
to enhance existing information or train on prime of trained models. These 3 strategies aren't
essentially distinct and may be used alone. as an example, one might search and label a lot of
datasets whereas existing ones.
An interesting observation is that the information assortment techniques come back not solely
from the machine learning community (including tongue process and pc vision, that
historically use machine learning heavily), however, have additionally been studied for many
years by the information management community, in the main beneath the names of
information\ science and data analytics.
Purpose
In today's world, information is generated at a breakneck speed, and the database is the last
resting place for this data. Information is saved in a database so it can be controlled simply
and efficiently. Database Management System is used to carry out all information control and
upkeep procedures. A secure database can withstand a variety of real-world attacks. Moving
forward with databases necessitates security models. These models are noteworthy in several
ways, as they address various database security challenges. In this study, we discussed a few
plausible attacks, as well as countermeasures and oversight mechanisms. Securing a database
is an important step in the process of determining specific and mandated database security
requirements. The purpose of a database security question is to prevent the database from
being misused or crushed. This article contains the most recent research on the subject of
database security investigation. To begin, we categorise these publications according to the
influencing components of database security, which are the categorization criteria. Compared
to traditional and machine learning (ML) methodologies, a few concept clarifications are
blended to make these tactics easier to grasp. Furthermore, we notice that while the linked
investigation has yielded some positive results, there are a few flaws, such as powerless
generalisation and deviation from reality. Then, possible future research in this area is
suggested. Finally, we'll review the most crucial points. Database security risks have become
more complicated as IT has advanced. We combed through the database security inquiries
and discovered the following components that are closely related to database security: data,
portion, defensive structure, and external components. Because the data or information stored
in databases is frequently regarded as a defenceless yet vital company asset, security is of
paramount importance in any database administration system, particularly for databases that
contain sensitive information. Security methods for commercial frameworks abound, but data
framework security is frequently overlooked. According to a few surveys, data security
remains a major concern in the IT business. Security-related mishaps are becoming more
common and technologically advanced.
Problem Statements
Machine learning is being adopted by new applications and is a trending technology of the
decade. The era has changed, as a result, we can see a large quantity of data being processed
in just fractions of seconds. We can’t just imagine how much data volume has increased these
past years. Every day, 2.5 quintillion bytes of data are generated. These pieces of information
can be numerical (temperature, loan amount, customer retention rate), categorised (gender,
colour, highest degree achieved), or even free text (think doctor's notes or opinion polls). The
practice of acquiring and analysing data from a variety of sources is known as data collection.
Data must be collected and kept in a form that makes sense for the business challenge at hand
to use to build viable artificial intelligence (AI) and machine learning solutions.
However, as machine learning is being adopted in new applications, the amount of training
data has seemed to bottleneck the process. Traditional applications such as machine
translation and object recognition benefit from large volumes of training data acquired over
several years. On the other hand, recent applications have little or no training data. As an
example, smart factories are becoming increasingly automated, with machine learning used to
regulate production efficiency. When a new product or fault has to be spotted, there is little or
no training data, to begin with.
Data collection helps us to keep track of past occurrences so that we may utilise data analysis
to uncover repeating patterns. Using machine learning algorithms, you may create forecasting
analytics that searches for trends and forecasts future adjustments depending on those
patterns.
Because forecasting analytics are only as strong as the data they're based on, excellent data
collecting procedures are essential for creating high-performing models. The data must be
devoid of errors (garbage in, garbage out) and contain data that is pertinent to the work at
hand. A debt default model, for example, would not profit from tiger population levels but
would gain over time from gas costs.
Methodology
The author mentions some of the core concepts for understanding machine learning patterns
in a better way. They are:
1) Supervised Learning: The author mentions the use of supervised learning if you know in
advance what you want to teach a machine. The author mentions this typically requires
exposing the algorithm to a huge set of training data, letting the model examine the output,
and adjusting the parameters until getting the desired results. Common supervised learning
tasks typically implement prediction, regression, or classification. The author explains all
these tasks in detail in the paper.
2) Unsupervised Learning: The author mentions Unsupervised learning can enable a machine
to explore a set of data. The machine tries to identify hidden patterns that connect different
variables after the initial exploration. Based only on statistical properties, this type of learning
can help turn data into groups. Training on large data sets can be avoided with Unsupervised
learning. It is much faster and easier to deploy, compared to supervised learning.
Based on those approaches, below are the main methodologies/approaches used in the
journal.
a) Neural Networks: According to the author, Neural networks mimic the structure of the
brain: each artificial neuron connects to several other neurons, and together millions of
neurons create a complex cognitive structure. Neural networks are used for a wide variety of
business applications. In healthcare, they are used in the analysis of medical images, to speed
up diagnostic procedures and search for drugs. In the telecommunications and media
industries, neural networks can be used for machine translation, fraud detection, and virtual
assistant services.
b) Regression:
Regression methods are used for training supervised ML. The goal of
regression techniques is typically to explain or predict a specific numerical value while using
a previous data set. For example, regression methods can take historical pricing data, and
then predict the price of a similar property.
c) Clustering: Clustering algorithms are unsupervised learning methods. A few common
clustering algorithms are K-means, mean-shift, and expectation-maximisation. They group
data points according to similar or shared characteristics. Grouping or clustering techniques
are particularly useful in business applications when there is a need to segment or categorise
large volumes of data.
d) Decision Trees: The decision tree algorithm classifies objects by answering “questions”
about their attributes located at the nodal points. Depending on the answer, one of the
branches is selected, and at the next junction, Decision tree applications include knowledge
management platforms for customer service, predictive pricing, and product planning.
Findings
The author has given us an example of data collection based on their experience. The
example is based on a smart factory application and Sally is a data scientist who works on
product quality for a smart factory. The factory may produce manufacturing components like
gears where it is important for them not to have scratches, dents, or any foreign substance.
Sally may want to train a model on images of the components, which can be used to
automatically classify whether each product has defects or not.
A decision flow chart for data collection. From the top left, Sally can start by asking whether
she has enough data. The following questions lead to specific techniques that can be used for
acquiring data, labelling data, or improving existing data or models. This flow chart does not
cover all the details in this survey.
Sally would need to acquire datasets if there is no or little data, to begin with. She can either
search for relevant datasets on the Web or in a company data lake, or she can create her
dataset by installing camera equipment and photographing the products in the factory. Sally
could supplement the data with external information about the product if the products had
metadata. Once the data is available, then Sally can choose among the labelling techniques
using the categories. Self-labelling using semi-supervised learning is an appealing option if
there are enough existing labels. As investigated, there are numerous variants of self-labelling
depending on the model training assumptions. If there aren't enough labels, Sally can use
crowd-sourcing techniques with a budget to create some. If only a few experts are available
for labelling, active learning may be the best option, if the important examples that influence
the model can be narrowed down. When many workers do not necessarily have the necessary
expertise, general crowdsourcing methods can be used. If Sally already has labels, she may
want to see if they can be improved in quality. The various data cleaning methods that can be
used in this same data are noisy or biassed. Existing models for product quality can be used to
improve the model using transfer learning if they are available through tools such as
TensorFlow Hub.
The author shared their experience that it's not always easy to tell if there's enough data and
labels. For example, even if the dataset is small or there are few labels, if the data distribution
is simple to learn, automatic approaches such as semi-supervised learning will outperform
manual approaches such as active learning. Another difficult-to-quantify factor is the amount
of human effort required. When contrasting active learning and data programming, one must
consider the tasks of labelling examples and implementing labelling functions, which are
very different. Implementing a program on examples can range from simple (e.g., look for
specific keywords) to nearly impossible depending on the application (e.g., general object
detection).
Conclusion/ Recommendations
Data collection has always been the major source of machine learning. But when the
collected data and data sets become extensive it is difficult to handle the data management.
Therefore it is very crucial for the study of such databases. These massive data sets can be
handled through advanced computing which could lead to even more accurate machine
learning. The data collection process can also be improved. It can be more accurate by
making sure the collected data is relevant to the subject topic. Another thing that can be done
is to make sure the machine learning models get sufficient data sets for their calculations or
else it could give an inaccurate result.
There may be too many data sets and collections in some circumstances, and the resulting
integration may have a negative impact on the learning model. As a result, selecting the
appropriate data set is critical. Furthermore, if the data set is dynamic (for example, a sensor's
signal stream) and the quality changes, the data set selection can be adjusted dynamically.
Second, many data retrieval systems rely on data set owners to annotate data sets for easier
retrieval, however, understanding and extracting metadata from data requires additional
automation approaches.
Although most data gathering jobs entail training the model after the data is collected,
enriching or refining the data depending on the model's performance is also an essential area.
Even though there is a wealth of research on model interpretation, it is unclear how to
manage feedback at the data level. Modifying the data is one way to eliminate injustice in the
fairness model literature. ActiveClean and BoostClean are two noteworthy data modification
algorithms for improving model accuracy in data cleaning. The traditional labelling system
focuses on accuracy, but a recent stimulus has resulted in a huge number of low marks. or
later is required to grasp the accuracy of accuracy than scalability of up to, For example, it
does not suggest that the accuracy of the model will eventually attain flawless accuracy
because it is only a weak label. It is feasible to invest or learn to transfer at some time to
make extra improvements. These solutions can be achieved through approximation methods
and errors, but it's worth considering whether there is a systematic approach to such
evaluation.
In conclusion, much work has been done to develop machine learning through data
collection. Nevertheless, there is still a lot that can be done. Different modes of data
collection with different funding levels can be implemented for accurate data collection and
machine learning.
Download