User documentation¶
RecSQL package¶
RecSQL is a simple module that provides a numpy.record array frontend to an underlying SQLite table.
The SQLarray
object populates a SQL table from a numpy record array, a
iterable that supplies table records, or a string that contains an
especially simple reStructured text table. The SQL table is held in memory
and functions are provided to run SQL queries and commands on the
underlying database. Queries return record arrays if possible (although a
flag can explicitly change this).
Query results are cached to improve performance. This can be disabled (which is recommened for large data sets).
The SQL table is named on initialization. Later one can refer to this table by the name or the magic name __self__ in SQL statements. Additional tables can be added to the same database (by using the connection keyword of the constructor)
The recsql.rest_table
module uses the base functionality to
parse a restructured text table from a string (such as a doc string)
and returns a nicely structured table. This allows for use of
parameters that are documented in the doc strings.
See also
PyTables is a high-performance interface to table data. In most cases you will probably better off in the long run using PyTables than recSQL.
Important functions and classes¶
A SQLarray
can be constructed by either reading data from a
CSV file or reST table with the SQLarray_fromfile()
function or
constructed directly from a numpy.recarray
via the
SQLarray
constructor.
-
recsql.
SQLarray_fromfile
(filename, **kwargs)¶ Create a
SQLarray
from filename.- Uses the filename suffix to detect the contents:
- rst, txt
- restructure text (see
recsql.rest_table
- csv
- comma-separated (see
recsql.csv_table
)
Arguments: - filename
name of the file that contains the data with the appropriate file extension
- kwargs
- additional arguments for
SQLarray
- additional arguments
recsql.csv_table.Table2array
orrecsql.rest_table.Table2array
such as mode or autoncovert.
- additional arguments for
-
class
recsql.
SQLarray
(name=None, records=None, filename=None, columns=None, cachesize=5, connection=None, is_tmp=False, **kwargs)¶ A SQL table that returns (mostly) rec arrays.
The
SQLarray
can be initialized from- an iterable of records (tuples), given in the records keyword argument, and the column names (provided in columns);
- a string that contains a simple reStructured Text table (see
recsql.rest_table
for details); - a
numpy.recarray
.
Note
SQLite only understands standard Python types and hence has problems with many of the NumPy data types such as
numpy.int64
. When loading a recarray fails we try to convert all data types automatically to Python types (usingrecsql.convert.irecarray_to_py()
). This might loose precision and/or even fail. It is also slow for larger arrays.The class takes the following arguments:
-
SQLarray
([name[, records[, columns[, cachesize=5, connection=None, dbfile=":memory:"]]]])¶
Arguments: - name
table name (can be referred to as ‘__self__’ in SQL queries)
- records
numpy record array that describes the layout and initializes the table OR any iterable (and then columns must be set, too) OR a string that contains a single, simple reStructured text table (and the table name is set from the table name in the reST table.) If
None
then simply associate with existing table name.- filename
Alternatively to records, read a reStructured table from filename.
- columns
sequence of column names (only used if records does not have attribute dtype.names) [
None
]- cachesize
number of (query, result) pairs that are cached [5]
- connection
If not
None
, reuse this connection; this adds a new table to the same database, which allows more complicated queries with cross-joins. The table’s connection is available as the attribute T.connection. [None
]- dbfile
Normally the db is held in memory (”:memory:”) but if a filename is provided then the underlying SQLite db is held on disk and can be accessed and restored (see
SQLarray.save()
). Only works with connection =None
[”:memory:”]- is_tmp
True
: create a tmp table;False
: regular table in db [False
]
Bugs: InterfaceError
: Error binding parameter 0 - probably unsupported typeIn this case the recarray contained types such as
numpy.int64
that are not understood by sqlite and which we were not able to convert to a Python type (usingrecsql.convert.irecarray_to_py()
). Either convert the data manually (by setting the numpy dtypes yourself on the recarray, or better: feed a simple list of tuples (“records”) to this class in records. Make sure that these tuples only contain standard Python types. Together with records you will also have to supply the names of the data columns in the keyword argument columns.If you are reading from a file then it might be simpler to use
recsql.sqlarray.SQLarray_fromfile()
.
-
SELECT
(fields, *args, **kwargs)¶ Execute a simple SQL
SELECT
statement and returns values as new numpy rec array.The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:
SELECT <fields> FROM __self__ [args]
The simplest fields argument is
"*"
.- Example:
Create a recarray in which students with average grade less than 3 are listed:
result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade", "WHERE avg_grade < 3", "GROUP BY surname,subject", "ORDER BY avg_grade,surname")
The resulting SQL would be:
SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__ WHERE avg_grade < 3 GROUP BY surname,subject ORDER BY avg_grade,surname
Note how one can use aggregate functions such avg().
The string ‘__self__’ is automatically replaced with the table name (
T.name
); this can be used for cartesian products such asLEFT JOIN __self__ WHERE ...
Note
See the documentation for
sql()
for more details on the available keyword arguments and the use of?
parameter interpolation.
-
close
()¶ Clean up (if no more connections to the db exist).
- For in-memory: Delete the underlying SQL table from the in-memory database.
- For on-disk: save and close connection
-
connection_count
¶ Number of currently open connections to the database.
(Stored in table sqlarray_master.)
-
has_table
(name)¶ Return
True
if the table name exists in the database.
-
limits
(variable)¶ Return minimum and maximum of variable across all rows of data.
-
merge
(recarray, columns=None)¶ Merge another recarray with the same columns into this table.
Arguments: - recarray
numpy record array that describes the layout and initializes the table
Returns: n number of inserted rows
Raises: Raises an exception if duplicate and incompatible data exist in the main table and the new one.
-
merge_table
(name)¶ Merge an existing table in the database with the __self__ table.
Executes as
'INSERT INTO __self__ SELECT * FROM <name>'
. However, this method is probably used less often than the simplermerge()
.Arguments: name name of the table in the database (must be compatible with __self__) Returns: n number of inserted rows
-
recarray
¶ Return underlying SQL table as a read-only record array.
-
save
()¶ Commit changes to file.
Only works if the SQLarray was created with they dbfile =
FILENAME
keyword. There is currently no way to save a in-memory db.See also
aoft.DB.clone()
-
selection
(SQL, parameters=None, **kwargs)¶ Return a new SQLarray from a SELECT selection.
This method is useful to build complicated selections and essentially new tables from existing data. The result of the SQL query is stored as a new table in the database. By default, a unique name is created but this can be overridden with the name keyword.
Arguments: - SQL
SQL
SELECT
query string. A leadingSELECT * FROM __self__ WHERE
can be omitted (see examples below). The SQL is scrubbed and only data up to the first semicolon is used (note that this means that there cannot even be a semicolon in quotes; if this is a problem, file a bug report and it might be changed).
Keywords: - name
name of the table,
None
autogenerates a name unique to this query. name may not refer to the parent table itself. [None
]- parameters
tuple of values that are safely interpolated into subsequent
?
characters in the SQL string- force
If
True
then an existing table of the same name isDROP``ped first. If ``False
and the table already exists then SQL is ignored and aSQLarray
of the existing table name is returned. [False
]
Returns: a SQLarray
referring to the table name in the database; it also inherits theSQLarray.dbfile
Examples:
s = SQLarray.selection('a > 3') s = SQLarray.selection('a > ?', (3,)) s = SQLarray.selection('SELECT * FROM __self__ WHERE a > ? AND b < ?', (3, 10))
-
sql
(SQL, parameters=None, asrecarray=True, cache=True)¶ Execute sql statement.
Arguments: - SQL : string
Full SQL command; can contain the
?
place holder so that values supplied with theparameters
keyword can be interpolated using thepysqlite
interface.- parameters : tuple
Parameters for
?
interpolation.- asrecarray : boolean
True
: return anumpy.recarray
if possible;False
: return records as a list of tuples. [True
]- cache : boolean
Should the results be cached? Set to
False
for large queries to avoid memory issues. Queries with?
place holders are never cached. [True
]
Returns: For asrecarray =
True
anumpy.recarray
is returned; otherwise a simple list of tuples is returned.Raises: TypeError
if the conversion torecarray
fails for any reason.Warning
There are no sanity checks applied to the SQL.
The last cachesize queries are cached (for cache =
True
) and are returned directly unless the table has been modified.The string “__self__” in SQL is substituted with the table name. See the
SELECT()
method for more details.
-
sql_index
(index_name, column_names, unique=True)¶ Add a named index on given columns to improve performance.
-
sql_select
(fields, *args, **kwargs)¶ Execute a simple SQL
SELECT
statement and returns values as new numpy rec array.The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:
SELECT <fields> FROM __self__ [args]
The simplest fields argument is
"*"
.- Example:
Create a recarray in which students with average grade less than 3 are listed:
result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade", "WHERE avg_grade < 3", "GROUP BY surname,subject", "ORDER BY avg_grade,surname")
The resulting SQL would be:
SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__ WHERE avg_grade < 3 GROUP BY surname,subject ORDER BY avg_grade,surname
Note how one can use aggregate functions such avg().
The string ‘__self__’ is automatically replaced with the table name (
T.name
); this can be used for cartesian products such asLEFT JOIN __self__ WHERE ...
Note
See the documentation for
sql()
for more details on the available keyword arguments and the use of?
parameter interpolation.
For querying the version of the package use
-
recsql.
get_version
()¶ Return current package version as a string.
-
recsql.
get_version_tuple
()¶ Return current package version as a (MAJOR,MINOR,PATCHLEVEL).
Example¶
>>> from recsql import SQLarray
>>> import numpy
>>> a = numpy.rec.fromrecords(numpy.arange(100).reshape(25,4), names='a,b,c,d')
>>> Q = SQLarray('my_name', a)
>>> print repr(Q.recarray)
rec.array([(0, 1, 2, 3), (4, 5, 6, 7), (8, 9, 10, 11), (12, 13, 14, 15),
(16, 17, 18, 19), (20, 21, 22, 23), (24, 25, 26, 27),
(28, 29, 30, 31), (32, 33, 34, 35), (36, 37, 38, 39),
(40, 41, 42, 43), (44, 45, 46, 47), (48, 49, 50, 51),
(52, 53, 54, 55), (56, 57, 58, 59), (60, 61, 62, 63),
(64, 65, 66, 67), (68, 69, 70, 71), (72, 73, 74, 75),
(76, 77, 78, 79), (80, 81, 82, 83), (84, 85, 86, 87),
(88, 89, 90, 91), (92, 93, 94, 95), (96, 97, 98, 99)],
dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
>>> Q.SELECT('*', 'WHERE a < 10 AND b > 5')
rec.array([(8, 9, 10, 11)],
dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
# creating new SQLarrays:
>>> R = Q.selection('a < 20 AND b > 5')
>>> print R
<recsql.sqlarray.SQLarray object at 0x...>
Additional SQL functions¶
Note that the SQL database that is used as the backend for
SQLarray
has a few additional functions defined in addition
to the standard SQL available in sqlite. These can be used in
SELECT
statements and often avoid post-processing of record arrays
in python. It is relatively straightforward to add new functions (see
the source code and in particular the
recsql.sqlarray.SQLarray._init_sql_functions()
method; the
functions themselves are defined in the module recsql.sqlfunctions
).
Simple SQL functions¶
Simple functions transform a single input value into a single output value:
Expression | SQL equivalent |
---|---|
y = f(x) | SELECT f(x) AS y |
Additional simple functions have been defined:
Simple SQL f() | description |
---|---|
sqr(x) | square x*x |
sqrt(x) | square root numpy.sqrt() |
pow(x,y) | power x**y |
periodic(x) | wrap angle in degree between -180º and +180º |
regexp(pattern,string) | string REGEXP pattern |
match(pattern,string) | string MATCH pattern (anchored REGEXP) |
fformat(format,x) | string formatting of a single value format % x |
Aggregate SQL functions¶
Aggregate functions combine data from a query; they are typically used with a ‘GROUP BY col’ clause. They can be thought of as numpy ufuncs:
Expression | SQL equivalent |
---|---|
y = f(x1,x2,...xN) | SELECT f(x) AS y ... GROUP BY x |
For completeness, the table also lists sqlite built-in aggregate functions:
Simple aggregate f() | description |
---|---|
avg(x) | mean [sqlite builtin] |
std(x) | standard deviation (using N-1 variance) |
stdN(x) | standard deviation (using N variance), sqrt(<(X - <X>)**2>) |
median(x) | median of the data (see numpy.median() ) |
min(x) | minimum [sqlite builtin] |
max(x) | maximum [sqlite builtin] |
PyAggregate SQL functions¶
PyAggregate functions act on a list of data points in the same way as ordinary aggregate functions but they return python objects such as numpy arrays, or tuples of numpy arrays (eg bin edges and histogram). In order to make this work, specific types have to be declared when returning the results:
For instance, the histogram() function returns a python Object, the tuple (histogram, edges):
a.sql('SELECT histogram(x) AS "x [Object]" FROM __self__', asrecarray=False)
The return type (‘Object’) needs to be declared with the 'AS "x [Object]"'
syntax (note the quotes). (See more details in the sqlite documentation
under adapters and converters.) The following table lists all PyAggregate
functions that have been defined:
PyAggregate | type | signature; description |
---|---|---|
array | NumpyArray | array(x);
a standard numpy.array() |
histogram | Object | histogram(x,nbins,xmin,xmax); histogram x in nbins evenly spaced bins between xmin and xmax |
distribution | Object | distribution(x,nbins,xmin,xmax); normalized histogram whose integral gives 1 |
meanhistogram | Object | meanhistogram(x,y,nbins,xmin,xmax); histogram data points y along x and average all y in each bin |
stdhistogram | Object | stdhistogram(x,y,nbins,xmin,xmax); give the standard deviation (from N-1 variance) std(y) = sqrt(Var(y)) with Var(y) = <(y-<y>)^2> |
medianhistogram | Object | medianhistogram((x,y,nbins,xmin,xmax); median(y) |
minhistogram | Object | minhistogram((x,y,nbins,xmin,xmax); min(y) |
maxhistogram | Object | maxhistogram((x,y,nbins,xmin,xmax); max(y) |
zscorehistogram | Object | zscorehistogram((x,y,nbins,xmin,xmax); <abs(y-<y>)>/std(y) |
Examples of using types in tables¶
The following show how to use the special types.
Declare types as ‘NumpyArray’:
a.sql("CREATE TABLE __self__(a NumpyArray)")
Then you can simply insert python objects (type(my_array) ==
numpy.ndarray
):
a.sql("INSERT INTO __self__(a) values (?)", (my_array,))
When returning results of declared columns one does not have to do anything
(my_array,) = a.sql("SELECT a FROM __self__")
although one can also do
(my_array,) = q.sql('SELECT a AS "a [NumpyArray]" FROM __self__')
but when using a PyAggregate the type must be declared:
a.sql('SELECT histogram(x,10,0.0,1.5) as "hist [Object]" FROM __self__')
Other approaches to interfacing SQLite and NumPy¶
If RecSQL does not what you need it to do then look at these other projects.
See also
esutil.sqlite_util (part of esutil) and hydroclimpy.io.sqlite
If you do not have to rely on SQL then also look at PyTables.