Digitization with Big Data

ABSTRACT Digitization with Big Data is to provide the COMPANY ABC with their own search engine and querying system that is capable to process any sort of digital files, be it text, document, image or even video and audio. Digitization with Big Data 29 May 2019 Introduction The Organizations contains hundreds of thousands of different files (documents, images, video or voice etc.) that often contain high quality information. However, when there are many files with different types, it is often too problematic to send every query to every file in order to find the most suitable result to search and correlate. However, in era of Big Data technologies, it’s now possible to store any size, or type of data, process it, correlate with each other, and get best query results for further analytics or decision making. Digitization with Big Data is the process by which physical or manual records such as text, images, video, and audio are converted into digital forms. Benefits  Digitized data offers the following benefits  Long term preservation of documents  Orderly archiving of documents  Easy & customized access to information  Easy information dissemination through images, text, voice and video files. Document Purpose The purpose of this document is to provide COMPANY ABC, with a clear view and understanding on the Digitization with Big Data Technologies. The document will detail the features and functions of the application, artifacts and the architecture of the proposed solution. The following describes the architecture and principle of the system for the synthesis of various kinds of data such as text document pdf, images, audio and video files. Below is described the overall concept of the proposed system 1) Source level 2) Data Processing level 3) Data Presentation level All data located in the Data Lake which is digested from external sources to Hadoop Distributed File System (HDFS) through the Apache Kafka namely Queue Management System. All files stored in the HDFS, absorbs by Crawler which creates metadata and saves artifacts. The "metadata" is divided to two main types. The first is the meta information entered by the user when the artifact is loaded, as well as the artifacts meta information such as file extension, created data, etc.. Metadata consists of TAG's where a user during the artifacts loading process should choose pre-defined tags or he may create own custom tags. Examples of tags - Narcotics, Document subject matter, Robbery. Then the data is sent to the main Data Processor - Crawler, where it stores artifacts with the correlated metadata. Search Engine: The request sent by the user to Search Engine for the appropriate document as well as search for the corresponding artifacts is processed by Apache Solr. The engine is looking for the searched keywords in text files. Result of the search is automatically indexed. Searching, the system will detect the results by searching for the search word in documents with a simultaneous comparison of metadata (tags). The result of the search is displayed by rating most closely corresponding and correlated words.

Digitization with Big Data

Related documents

Products

Support

Digitization with Big Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib