We need to manage research data in a way that preserves its utility over time and across users. This means we must pay attention to data documentation and storage so that, at a minimum, others can access our data; replicate our analyses; and, preferably, extend them. How we achieve these goals depends upon whether we produce the data (primary data collection) or simply reuse it (secondary data analysis; see secondary data, secondary analysis of survey data), as well as what kind of data we are dealing with.
Storage of data relating to research projects should be taken seriously from the outset to ensure that valuable qualitative data resources are kept safe during the research process and beyond if data are to be formally archived. Both digital and nondigital aspects of storage must be considered by those who create, store, and curate data. There are a number of considerations relating to data storage, including data preparation procedures, confidentiality of data, physical conditions, and security (Ackerman, 2004).
Over the last two and a half years we have designed, implemented, and deployed a distributed storage system for managing structured data at Google called Bigtable. Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability. Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable for a variety of demanding workloads, which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users. The Bigtable clusters used by these products span a wide range of configurations, from a handful to thousands of servers, and store up to several hundred terabytes of data.
Different type of storage mechanisms
If we break down the survey into a sequence of discrete transactions (questions, check items, looping instructions, data storage commands, etc.) and construct a relational database, with each transaction being a row in a database table and the table having a set of attributes as defined in the relational database, we can efficiently manage both survey content and survey data.
Relational database software is a major software industry segment, with vendors such as Oracle, Sybase, IBM, and Microsoft offering competitive products. Many commercial applications use relational database systems (inventory control; accounting systems; Web-based retailing; administrative records systems in hospitals, welfare agencies, and so forth, to mention a few), so social scientists can piggyback on a mature software market. Seen in the context of relational databases, some of the suggested standards for codebooks and for documenting survey data, such as the data documentation initiative, are similar to relational database designs but fail to use these existing professional tool sets and their standard programming conventions.
The primary data collector has several data management choices: (a) Design the entire data collection strategy around a relational database, (b) input the post-field data and instrument information into a relational database, or (c) do something ...