Image Processing With Hadoop by

[Image Processing with Hadoop]

by

Abstract

This thesis is based on image retrieval system supporting to Hadoop. Hadoop is a kind of open source software with powerful parallelization and scalability and gradually becomes a popular technique for processing and storage of big data in recent years. We design and implement the system to overcome performance bottlenecks brought by computing complexity and big amount of data when constructing a IR system. We will present our ideas, designs and fruits of the system in this paper. Image Retrieval (IR), the ef?cient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the ef?ciency of the indexing strategies, and for the most ef?cient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.

ABSTRACT2

ACKNOWLEDGEMENT3

LIST OF FIGURE AND ILLUSTRATION6

CHAPTER 1: INTRODUCTION7

CHAPTER 2: LITERATURE REVIEW12

The HIPI Framework13

Data Storage13

Image-based MapReduce15

XMP Data Model17

HADOOP19

Hadoop Distributed File System (HDFS)19

HBase21

Index structures22

Single-pass indexing23

Distributing indexing24

MapReduce25

Image Operations in Java27

Blurring a Portion of an Image28

XMP Packet29

XMP schemas31

The XMP Packet32

XMP Development Kit (SDK)32

Multithreaded programming in Java34

Image Metadata Standards37

Example Image Description38

The Browse Interface41

CHAPTER 3: METHODOLOGY42

Hadoop Sequence Files42

Adding an External JAR Library to your Hadoop Project42

CHAPTER 4: RESULTS43

Image Processing Hadoop Java Code43

CHAPTER 5: DISCUSSION AND ANALYSIS52

Searching the System53

Derivation of the Measures56

Calculating the Lorenz Information Measure (Lim)62

Encoding the Metadata Tags64

Working with background64

CHAPTER 6: CONCLUSION78

REFRENCES80

List of Figure and Illustration

Figure 152

Figure 252

Figure 353

Figure 456

Figure 558

Figure 658

Figure 759

Figure 860

Figure 960

Figure 1061

Figure 1162

Figure 1263

Figure 1363

Chapter 1: Introduction

Currently the field of parallel computing finds abundant application in text analysis.

Companies, such as Google Inc. and Yahoo! Inc., which derive revenue from delivering

web search results, process large amounts of data and are interested in keeping delivered

results relevant. Because these companies have access to a large number of computers and

because the amount of data to be processed is big, the search query processing

computations are often parallelized across a cluster of computers. In order to simplify

programming for a distributed environment, several tools have been developed, such as the

MapReduce programming model, which has become popular, because of its automatic

parallelism and fault tolerance.

Pig operates as a layer of abstraction on top of the MapReduce programming model. It

frees programmers from having to write MapReduce functions for each low level data

processing operation and instead allows them to simply describe, how data should be

analyzed using higher level data definition and manipulation statements. Additionally,

because Pig still uses MapReduce, it retains all its useful traits, ...