This thesis is based on image retrieval system supporting to Hadoop. Hadoop is a kind of open source software with powerful parallelization and scalability and gradually becomes a popular technique for processing and storage of big data in recent years. We design and implement the system to overcome performance bottlenecks brought by computing complexity and big amount of data when constructing a IR system. We will present our ideas, designs and fruits of the system in this paper. Image Retrieval (IR), the ef?cient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the ef?ciency of the indexing strategies, and for the most ef?cient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.
ABSTRACT2
ACKNOWLEDGEMENT3
LIST OF FIGURE AND ILLUSTRATION6
CHAPTER 1: INTRODUCTION7
CHAPTER 2: LITERATURE REVIEW12
The HIPI Framework13
Data Storage13
Image-based MapReduce15
XMP Data Model17
HADOOP19
Hadoop Distributed File System (HDFS)19
HBase21
Index structures22
Single-pass indexing23
Distributing indexing24
MapReduce25
Image Operations in Java27
Blurring a Portion of an Image28
XMP Packet29
XMP schemas31
The XMP Packet32
XMP Development Kit (SDK)32
Multithreaded programming in Java34
Image Metadata Standards37
Example Image Description38
The Browse Interface41
CHAPTER 3: METHODOLOGY42
Hadoop Sequence Files42
Adding an External JAR Library to your Hadoop Project42
CHAPTER 4: RESULTS43
Image Processing Hadoop Java Code43
CHAPTER 5: DISCUSSION AND ANALYSIS52
Searching the System53
Derivation of the Measures56
Calculating the Lorenz Information Measure (Lim)62
Encoding the Metadata Tags64
Working with background64
CHAPTER 6: CONCLUSION78
REFRENCES80
List of Figure and Illustration
Figure 152
Figure 252
Figure 353
Figure 456
Figure 558
Figure 658
Figure 759
Figure 860
Figure 960
Figure 1061
Figure 1162
Figure 1263
Figure 1363
Chapter 1: Introduction
Currently the field of parallel computing finds abundant application in text analysis.
Companies, such as Google Inc. and Yahoo! Inc., which derive revenue from delivering
web search results, process large amounts of data and are interested in keeping delivered
results relevant. Because these companies have access to a large number of computers and
because the amount of data to be processed is big, the search query processing
computations are often parallelized across a cluster of computers. In order to simplify
programming for a distributed environment, several tools have been developed, such as the
MapReduce programming model, which has become popular, because of its automatic
parallelism and fault tolerance.
Pig operates as a layer of abstraction on top of the MapReduce programming model. It
frees programmers from having to write MapReduce functions for each low level data
processing operation and instead allows them to simply describe, how data should be
analyzed using higher level data definition and manipulation statements. Additionally,
because Pig still uses MapReduce, it retains all its useful traits, ...