![]() |
threaddb
2.0
A file mapped memory container extension
|
Release 2.0 18 March 2020
Copyright (C) 2019 The ThreadDB Project
The motivation for implementing ThreadDB was the recurring need for handling large amounts of data in the environment of IC manufacturing. While the std:: container provided by the stl are powerful and versatile instruments for data management, they require to have all information held in the system memory. Especially for large amounts of data this shortcoming results into conditions where the system runs out of memory. Researching the documentation for available packages the Author did not succeed to find a solution fully covering this demand. The available solutions suffer either on limited database size or performance. This led to the situation, that special solutions have to be implemented for each individual scenario. To provide a more general means handling really large volumes of data (up to hundreds of Gigabytes) efficiently and provide a standardized technique, ThreadDB was born. The library is dedicated to help developers managing their data in an efficient and simple way. Performance is achieved by utilizing multiple threads and sharing data amongst different folders or devices. In addition it was taken care that the interface stays simple and easy to be integrated into existing C/C++ code. The implementation has been designed for minimal additional main and disk memory overhead. While for 64bit systems there are techniques like mmap available to directly load and manage large volumes as part of the virtual address space, 32bit systems suffer from a hard limit of not more then 2Gb adressable file sizes. The solution offered overcomes this restraint by allowing to load only portions of the whole volume in the process main memory. Since ThreadDB was designed to distribute the contents of each package over a variety of files located on different volumes and restrict the filesize it is also possible to use FAt32 volumes to mange huge amounts of data.
ThreadDB is available for download in precompiled form for Windows x64 and Linux. The author decided to not release the source code to public for keeping better control over the way in which the package is built and utilized. The freely available package is limited with an eye on the amount of data and available threads. This to not run into legal issues e.g. of export control, to encouraging the user to send feedback and to help tracking the application cases in which ThreadDB is used. The Windows package has been built using Microsoft Visual Studio 2017 (v141). The Linux version has been built using Ubuntu-18.04.1 g++ 4.8.5
NOTE: the free ThreadDB version is limited to 100Gb of data and 8 worker threads per process. For legal aspects and warranty be advised to refer to the disclaimer section of this document.
For feedback, comments or to achieve an unlimited version please contact the author via mail thethreaddbproject(at)gmail.com.
The ThreadDB library and header files are available as simple .zip or .tar.gz files for Windows and Linux respectively. After downloading the package unzip the package using e.g. WinZip, 7-Zip or under Linux tar -zxvf. The package contains the required interface header files, documentation and an example for running initial tests on your system. It was taken care, that there are no other 3rd party libraries (like boost) required.
Integrating the library in your project depends on the IDE you are using. For Microsoft Visual Studio it is necessary to add the ThreadDB "\public" folder provided in your local copy of ThreadDB to your project include folder path. Then include the interface header file threaddbC.h for C or threaddbCPP.h for C++ in your program code. Next tell the linker to use the treaddb shared library when building your application. Therefore threaddb.lib has to be added to your linker path and linked libraries. For Visual Studio all required settings are available through the Project Settings dialog. For Linux it is necessary to add the include path to your makefile. This can be achieved by using the -I compiler option pointing to the installation "\public" folder. In addition one has to specify the linker path via -L and add the library using -l to the build command. Before running your application make also sure, that the threaddb.dll or threaddb.so shared libraries are added to your library search path.
To utilize ThreadDB in your application it is first necessary to identify the type of data which should be stored. In general one has to separate between fixed and variable length data items. For fixed length item data no additional information is required. This follows the general principle of the std:: template containers. Since the size of stored data items is not stored in the database itself, variable length data requires intrusive handling of the item size. This could be achieved by either an unique identifier to distinguish different types of data or by a class member providing the item size directly. In both cases usually two read steps are required. In the first step the length (or type) information of the data item is read and in a 2nd step the data section of the stored item.
The handling of the ThreadDB database itself is pretty straight forward. In a first step the database is created using ThreadDB_Create. After this, the desired number of threads have to be started using ThreadDB_NewThread. The function also allows to specify a maximum temporary file size using MaxFileSize_p
. Especially for very large file volumes copying or transferring data can become harmful. Limiting the file size allows to generate more handy file packages that can be transferred in parallel. This option allows also to overcome partition limitations on large file size support. E.g. for FAT32 4Gbyte of file size cannot be exceeded. On success ThreadDB_Create returns the generated temporary database filename assigned to the thread in the parameter pFileName_p
. After the worker threads have been successfully established, one ore more packages can be registered using ThreadDB_NewPackage. A package can be seen as a container, which holds the required item data. Individual packages can be processed (nearly) indpendent, while inserting data items in the same package requires additional synchronization. Each package gets a memory buffer assigned to, which temporarly holds and collects the data items. If the buffer is exceeded, it is written to one of the temporary database files. Due to this optimization, synchronization prior starting to recover the itmes within a package is required using ThreadDB_Synchronize. In the current state of implementation the package management information stored persistently in the memory. Therefore an additional amount of about 180 byte of memory is consumed for each created package.
Data can be stored as desired in the destination package(s) using ThreadDB_Store. This function takes the provided data and generates a store request to first available worker thread. The parameters PackageSize_p
and PackageCacheLimit_p
spezified during the creation of the database define the size of the internal package cache. Since the number of packages could become quite large PackageCacheLimit_p
sets a maximum number of buffered packages. If this limit is exceeded, package buffers are withdrawn and flushed to file. To flush packages the database uses a write timestamp. Packages which are not accessed for the longest time duration are undergoing this purging process first. In a worst case scenario packages exceeding the PackageCacheLimit_p
are accessed randomly which will lead to onging reading and flushing of package information. This process is usually called "thrashing" and could lead to reduced performance. To which temporary database file the package buffer contents is written depends on the thread available for processing. This means, that the package contents is spread amongst different temporary data packages. After finishing filling the packages the interim buffers of the database need to be synchronized calling ThreadDB_Synchronize.
There are three ways to read back the stored data. One way is continuous streaming through the package items. To execute stream reading the package needs to be opened using ThreadDB_Open. This creates a read handle of type threadDB_ReadInfo which can then used for the consecutive read operations with ThreadDB_RecoverContinous. After the process of reading has finished, ThreadDB_Close should be called to release allocated space back to the system. The second recovery mode is random access of individual items. Individual item access needs the threadDB_ItemInfo to be available. This handle has to be pre-allocated and provided during the ThreadDB_Store operation. The so created threadDB_ItemInfo is then used by the function ThreadDB_OpenItem to locate and read the selected data item. Please be aware, that each stored item that needs to be addressed directly has to have its own copy of the threadDB_ItemInfo handle. A suitable way to store this is to select one of the std:: containers. The third mode of reading data is a mixture between continious and random reading. It allows to jump to a data item at any location directly and from there on continue with sequential reading. This mode is supported by the ThreadDB_OpenItem function, which allows to open a package for reading at the given threadDB_ItemInfo position. The returned threadDB_ReadInfo cursor can then used for continiously calling ThreadDB_RecoverContinous.
In cases where the contents of the database needs to be relocated to a different location/disk the routine ThreadDB_RelocateFileTo allows to copy or move the temporary database files. The routine is available also while insert and/or read operations are in progress. This is useful especially for conditions where e.g. the disk runs out of space. First the routine ThreadDB_GetFileCount can be used to identify the number of registered files. The calling process can then use any number between [0-N[ to specify a new location. In addition to this it is also possible to get information on the current location of a specific database file. This can be achieved by calling the routine ThreadDB_GetDatabaseFileName. Please be aware, that the number of database files may exceed the current number of registered threads. This because the MaxFileSize_p
has been exceeded or an existing database was imported. Information about the generated filenames may be helpful in case of error conditions for the remove operation to free disk space. It is also possible to relocate database files using multiple threads at once. With an eye on performance it might therefore be helpful to limit the filesize using the parameter MaxFileSize_p
. If a full copy of the database needs to be created following steps are necessary:
RelocationType_p
set to eCopyFileToTo get information about the number of registered packages ThreadDB_GetPackageCount can be used.
subsection step9 Error handling
Error handling is available via the C++ exception mechanism. In case of unintended conditions, the called procedure generates a std::runtime_exception exception. Especially in pure C environments this needs additional care to be taken. To
The ThreadDB package also provides means for simplified standard container support - threaddbSTLWrapper.h
To utilize std:: container together with the database the additional classes pool, item and/or key are required. The pool class provides basic services for administering the required additional information for database interaction. The item class acts as an interface to move data items into and retrieve from the database. The tdb::key class is based on the item class but provides additional comparators. This allows stored data items also being used as keys in sorted data structures like the std::map. The pool object has to exist for the lifetime of the std:: container object. It is important to understand, that utilizing std::map replaces the traditional method of database indexes. Query operations with specific keys can then be executed similar than just searching items in std::map. There are many situations where the ordering key is small whereas the data items itself need remarkable space. In these conditions it is sufficient to only store the items in the database to avoid memory overruns. The package provides examples showing the interest reader how to utilize this. Overall each data element to be stored requires some additional administrative information kept in memory. Therefore storing very small data items directly might be less memory cost efficient.
There are two different implementations of the item class. One that uses the random access capability to retrieve data. To utilize random access, the data to be store needs to be wrapped into a tdb::sitem. This implementation requires some addition tdb::ItemHandle to be available to address elements in the database. The second utilizes sequential package reading. The advantage of this approach is the reduced memory comsumption. If this second approach should be utilized, it is reasonable to keep the number of items per package low since some scanning overhead to locate the desired data item is involved. When storing tdb::ritem as key values there is also synchronization overhead to be expected. In general it could be stated, that using random access is faster, at the cost of a somewhat higher memory consumption. To speed up access to the key elements it is favourable to specify a larger package block size. This allows direct access to the elements while stored in memory which reduces disk traffic. Another aspect of rapid data access is the capability to store key and data items in different databases. This allows to store keys and data on different disks which minimizes concurrent disk seek operations.
The implementation allows to store fixed and variable data items in the standard containers. For variable data items a specialization based on std::string is available.
The ThreadDB package also comes with an implementation of a static range tree - threaddbRGrid.h
While std::map provides the capability to search and manage keys in a lexicographical fashion, there is no container in the standard available for managing spacial range queries. There are implementations in the boost::geometry template library available. These implementation uses a balanced tree to store and query spcial elements in various ways. Even so the project has decided to provide a static range tree - rgrid - that allows rapid access of geometrical elements utilizing the capabilities of ThreadDB efficiently. While the available implementations of rtree need to rebalance avoiding degeneration of the tree structure rgrid follows a concept of pre-defined tiling. Therefore it is immanent, that prior utilizing this data structure the user has to know the dimensions of the scenario (the overal extent of the elements) and an estimate of elements to be stored. These two parameter define the extent of tiles and the depth of the rgrid. The advantage of this approach is, that elements can be inserted directly into tiles. The depth argument can be estimated based on the number of elements and theier dimension. For 1D scenarios a reasonable value is ld(N), for 2D scenarios log4(N) and for a 3D scenario log8(N). For higher dimensions the general estimate is log2^D(N). The rgrid datastructure is most efficiently, if the elements to be stored are distributed somewhat homogenously in the given scenario window. In the implementation each tile is assigned to an individual package intended to keep a series of elements. This approach minimizes the amount of consumed administrative memory overhead when storing a vast number of small elements. In addition the concept of individual packages minimizes collisions during multithreaded query operations. While insert operations need to be executed in a serial fashion, multiple threads are utilized to execute the query operation. Please be aware, that this might result into different ordering of elements returned. An important aspect of the rgrid is its capability to be streamed into a separate file or a database package. This allows to fix the current state of the scenario without having to rebuilding the overall structure. To achieve this, the streaming operation not only preserves the contents of stored elements within the rgrid, but also its administrative information.
In some situations the user might not want to go through the process of individually inserting elements into a database package but has a stream of data to be moved. To simplify this process, the ThreadDB packgage comes with an implementation of a simple stream class. This class allows to stream any kind of data directly into and out of an individual database package. The stream class is available through threaddbStream.h
The examples are taken from the test routine mainsample.cpp used during implementation. In the following the discussion focuses on the most important aspects of inserting and recovering data. The test consists of basically two runs - one with a limited number of packages (30) and a 2nd one with an unlimited number of packes. This also to give an idea on the runtime effects of synchronization overhead due to package flushing.
The following example shows some general workflow of creating and inserting into a ThreadDB database. This is demonstrated using fixed size string data.
In a first step the tdb::database is created. The parameters PackageSize_p
and PackageCacheLimit_p
are provided by the calling routine. Next, four worker threads are registered at the database using tdb::database::NewThread. In the example, the folder "D:\tmp" is used to hold the temporary database files. Then 123 packages are created using tdb::database::NewPackage.
To demonstrate the capabilities of ThreadDB to handle multiple concurrent threads at once, four threads are generated executing the routine threadStore in parallel. Therefore five threads (four worker threads + the main thread) are utilized to fill the database in parallel calling tdb::database::Store. The package id's for inserting data are somewhat randomized using the term "(iter + iter % 123) % 123". The store operation of the main thread provides also a handle to store the tdb::ItemInfo entries. Later, this allows for random access individual data items.
Finally tdb::database::Synchronize is called to flush the thread buffers to the temporary database files and prepare for reading.
The snippet provides also an exmaple for how to use tdb::database::RelocateFileTo to move the temporary database file with index 0 to a different location. As demonstrated this is available while the asynchronous store operation are in full progress.
The next example now focuses on different ways to recover back data. In the first part, continious stream reading is demonstrated by also utilizing multiple threads. Four threads are started executing the routine tdb::database::Recover. Each thread opens first randomly packages using tdb::database::Open to gain a tdb::ReadInfo handle. Then the data items are read in a loop using tdb::database::Recover. After all data items are read the parameter pReadBytes_p
is zero, indicating that no more data is available. The main thread operates in a similar way.
The 2nd part shows random access to individual data items. Here the previously stored tdb::ItemInfo is used to select specific item data. Looping through the list of tdb::ItemInfo the routine tdb::database::Recover is called. The example also demonstrates how to utilize tdb::database::Replace to modify individual data items.
The following code snipped illustrates an example of utilizing std::map for storing key and value elements in database packages. Two data pools area created based on different databases (keyDB/valueDB) that reside on individual partitions. Next a std::map is created that contain the generated strings. Therefore both - the key and data items resemble variable length records. By specifying tdb::ritem random access for data retrival is used.
CAUTION: do not run this example with regular std::map inserts on your Windows system. This might consume the main memory immediately and could result into system instabilities. For performance testing it is adviseable to use a reduced string size.
The so stored items can then be recovered by calling the routine get(). This recreates the original contents and returns a std::unique_ptr<T> providing the item contents. Please be aware, that every time get() is called, a new copy of the contents is created. The following example illustrates the usage.
For detailled information on how to create and initialize a database please refer to the related section or examples above.
This example shows how to utilize std::map with sequential access and fixed data item size. Serial access is introduced by using the tdb::sitem item wrapper. Please consider the limited number of data items specified during pool creation.
First a 1 dimensional tdb::rgrid matrix is created, defining a scenario window extent of 'scenarioWindow' and a maximum tree depth of 10.
Elements can be added to the tdb::rgrid providing the surrounding box and element data as follows.
In the next example, the tdb::rgrid is streamed to a database package. The stream contains only the control structures but not the element data itself. If streaming the tdb::rgrid into the same database that was used to during filling, it is important to assure, that the packageID is unique and does not interfere with packages created during element insert into the tdb::rgrid nodes.
The following re-creates a new tdb::rgrid based on the formerly serialized database stream.
The following illustrates the definition of the callback functor and how to issue query operations with one and three threads. In the example the callback function just counts the number of returned elements.
The following list gives a briev overview on past and future extensions and improvements of ThreadDB:
Release 2.0
Tight integration into the std:: container concept. Provide a rgrid spacial search structure. Introduce streaming capabilites into database packages. Windows 32 bit version. Static library.
Release 3.0
Introduce ThreadDB server mode. This major extension will provide means to register processes and threads (agents) from different systems to a central ThreadDB acting as data server. With this additional functionality distributed databases up to terabytes of data will become available. In addition, the process becomes independent of local hardware restrictions.
Release 4.0
Add data compression functionality to allow minimize the consumed disk space.
Copyright (c) 2019 The ThreadDB Project All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the <organization> nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.