Uploaded by Mahmud

Digitization with Big Data

Digitization with Big Data is to provide the COMPANY
ABC with their own search engine and querying system
that is capable to process any sort of digital files, be it
text, document, image or even video and audio.
Digitization with Big
29 May 2019
The Organizations contains hundreds of thousands of different files (documents, images, video or voice
etc.) that often contain high quality information. However, when there are many files with different types,
it is often too problematic to send every query to every file in order to find the most suitable result to
search and correlate.
However, in era of Big Data technologies, it’s now possible to store any size, or type of data, process it,
correlate with each other, and get best query results for further analytics or decision making.
Digitization with Big Data is the process by which physical or manual records such as text, images, video,
and audio are converted into digital forms.
Digitized data offers the following benefits
Long term preservation of documents
Orderly archiving of documents
Easy & customized access to information
Easy information dissemination through images, text, voice and video files.
Document Purpose
The purpose of this document is to provide COMPANY ABC, with a clear view and understanding on the
Digitization with Big Data Technologies. The document will detail the features and functions of the
application, artifacts and the architecture of the proposed solution.
The following describes the architecture and principle of the system for the synthesis of various kinds of
data such as text document pdf, images, audio and video files. Below is described the overall concept of
the proposed system
1) Source level
2) Data Processing level
3) Data Presentation level
All data located in the Data Lake which is digested from external sources to Hadoop Distributed File
System (HDFS) through the Apache Kafka namely Queue Management System. All files stored in the HDFS,
absorbs by Crawler which creates metadata and saves artifacts. The "metadata" is divided to two main
types. The first is the meta information entered by the user when the artifact is loaded, as well as the
artifacts meta information such as file extension, created data, etc..
Metadata consists of TAG's where a user during the artifacts loading process should choose pre-defined
tags or he may create own custom tags. Examples of tags - Narcotics, Document subject matter, Robbery.
Then the data is sent to the main Data Processor - Crawler, where it stores artifacts with the correlated
Search Engine: The request sent by the user to Search Engine for the appropriate document as
well as search for the corresponding artifacts is processed by Apache Solr. The engine is looking for the
searched keywords in text files. Result of the search is automatically indexed. Searching, the system will
detect the results by searching for the search word in documents with a simultaneous comparison of
metadata (tags). The result of the search is displayed by rating most closely corresponding and correlated