Preface
HISTORY OF kdb+
AND q
kdb+
andq
are intellectual descendants of an older programming language, APL
. The acronym "APL" stands for "A Programming Language", the name of a book (Iverson, 1962) written by the Canadian computer scientist Kenneth (Ken) Eugene Iverson (1920-2004). Iverson worked on automatic data processing during his years at Harvard (1955-1960), when he found that the conventional mathematical notation wasn't well suited to the task. He proceeded to develop his own notation, borrowing ideas from linear algebra, tensor analysis, and operators à la Oliver Heaviside. This notation was further elaborated at IBM, where Iverson worked alongside Adin Falkoff (1921-2010) from 1960 until 1980. The collaboration between Iverson and Falkoff would span nearly two decades.
The two main ideas behind APL
are the efficient-notation idea (Montalbano, 1982) and the stored-program idea. The stored-program idea, which dates back to John von Neumann (1903-1957), see von Neumann (1945), and amounts to being able to store (and process) code as data, has been taken a step further in languages such as q
, where function names evaluate to their source code. The efficient-notation idea, the idea that developing a concise and expressive syntax is critical to solving complex iterative problems correctly and efficiently, was pioneered by Iverson. In the fpref of Iverson (1962) he defines a programming language in the following terms:
Applied mathematics is largely concerned with the design and analysis of explicit procedures for calculating the exact or approximate values of various functions. Such explicit procedures are called algorithms or programs. Because an effective notation for the description of programs exhibits considerable syntactic structure, it is called a programming language.
Later, in 1979, Iverson would give a Turing Award Lecture with the title Notation as a Tool of Thought (Iverson, 1979). Iverson's notation was effective as it was simple. It relied on simple rules of precedence based on right-to-left evaluation. The fundamental data structure in APL
is a multidimensional array. Languages such as APL
and its progenitors are sometimes referred to as array, vector, or multidimensional programming languages because they implicitly generalise scalar operations to higher-dimensional objects.
Iverson's notation was known as "Iverson's notation" within IBM until the name "APL" was suggested by Falkoff. After the publication of A Programming Language in 1962, the notation was used to describe the IBM System/360 computer. Iverson and Falkoff then focused on the implementation of the programming language. An implementation on System/360 was made available at IBM in 1966 and released to the outside world in 1968.
In 1980, Iverson moved to I. P. Sharp Associates (IPSA), a Canadian software firm based in Calgary. There he was joined by Roger K. W. Hui (b. 1953) and Arthur Whitney (b. 1957). The three of them continued to work on APL
, adding new ideas to the programming language. Hui, whose family emigrated to Canada from Hong Kong in 1966, was first exposed to APL
at the University of Alberta. Whitney had a background in pure mathematics and had worked with APL
at the University of Toronto and Stanford. He met Iverson for the first time well before he joined IPSA, at the age of 11, in 1969. Iverson had been his father's friend at Harvard. Whitney's family then lived in Alberta but would visit Iverson in his house in Mount Kisco. There Iverson introduced Whitney to programming and APL
.
In 1988, Whitney joined Morgan Stanley, where he helped develop A+
, an APL
-like programming language with a smaller set of primitive functions optimised for fast processing of large volumes of time series data. Unlike APL
, A+
allowed functions to have up to nine formal parameters, used semicolons to separate statements (so a single statement could be split into multiple lines), used the last statement of a function as its result, and introduced the so-called dependencies which functioned as global variables. The programming language is now available online, http://www.aplusdev.org/. Programmers can also download the kapl
font, which includes the special characters used by APL
and A+
.
One summer weekend in 1989, Whitney visited Iverson at Kiln Farm and produced - "'on one page and in one afternoon"' Hui (1992) - an interpreter fragment on the AT&T 3B1 computer. Hui studied this fragment and on its basis developed an interpreter for another APL
variant, J
. Unlike APL
and A+
, J
used the ASCII character set. It included advanced features, such as support for parallel MIMD operations. Whitney's original fragment appears under the name Incunabulum in an appendix in Hui's book, see Hui (1992)1. Other ideas by Whitney found their way into J
: orienting primitives on the leading axis, using prefix rather than suffix for agreement, and total array ordering (Hui, 2006, 1995). Ken Iverson, his son, Eric Iverson, and Hui all ended up working in a company called Jsoftware in the 1990s-2000s.
Whitney left Morgan Stanley in 1993 and co-founded Kx Systems with Janet Lustgarten, where he developed another APL
variant, called k
. On its basis he developed a columnar in-memory time series database called kdb
. Kx Systems was under an exclusive agreement with UBS. It expired in 1996 and k
and kdb
became generally available. ksql
was added in 1998 as a layer on top of k
. Some developers regard it as part of the k
language. ksql
includes SQL-like constructs, such as select
.
kdb+
was released in June 2003. This was more or less a total rewrite of kdb
for 64-bit systems based on the 4th version of k
and q
, a macro language layer (or a q
uery language, hence the name) on top of k
, defined in terms of k
. Both q
and k
compile to the same byte code that is executed in the kdb+
byte interpreter. For example, type
in q
is the equivalent of @:
in k
. q
is much more readable than k
and most kdb+
developers write their code in q
, not k
.
The q
programming language contains its own table query syntax called q-sql
, which in some ways resembles the traditional SQL.
MOTIVATION FOR THIS BOOK
q
and kdb+
stand in a special ground between - as well as overlapping - the purely technical world of software engineering and the world of data science. On the one hand, they can be used for building data services which communicate with each other, whether this is simply to expose a time series database - something q
excels at - or a chain of Complex Event Processing (CEP) engines calculating analytics and signals in real-time. On the other hand, q
's fast execution on vectorised time series enables its application to data science, hypothesis testing, research, pattern recognition, and general statistical and machine learning, these being the fields more traditionally focused on by quantitative analysts.
In practice, the distinction between the two worlds can become blurry: they are both very interconnected. An idea cannot be validated without rigorous and fast2 statistical analysis and backtesting, where the algorithms and their parameters are well understood, avoiding black-box solutions. It is then natural to expand our model validation code into the actual production predictive analytics. This is facilitated by the reliability, resilience, scalability, and durability of kdb+
.
q
's notation as a tool of thought and expression enables it to succeed at both these tasks; its roots, traceable from APL
and lambda calculus (Church, 1941), its vector language, the fast columnar database, and its query language make q
a unique all-in-one tool.
Although previously notorious as a language which "cannot be googled" due to its short name and lack of documentation, q
and kdb+
are nowadays well-documented from the programming language and infrastructural perspectives, with some excellent sources of material both on https://code.kx.com/ and in books, such as Borror (2015); Psaris (2015).
Moreover, kdb+
is widely used in the market as a time series database combined with real-time analytics, whereas a lot of the data science, statistical and machine learning (ML) work is done outside of q
, by extracting data into, or interfacing with, Python or R.
The aim of this book is to demonstrate that a lot of the power of q
can be harnessed to deal with a large part of everyday data analysis, from data retrieval and data operations - specifically on very large data sets - to performing a range of...