Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)  0.10
Performance library for Deep Learning
A Performance Library for Deep Learning

The Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN) is an open source performance library for Deep Learning (DL) applications intended for acceleration of DL frameworks on Intel(R) architecture. Intel MKL-DNN includes highly vectorized and threaded building blocks for implementation of convolutional neural networks (CNN) with C and C++ interfaces. This project is created to help the DL community innovate on the Intel(R) processor family.

The library supports the most commonly used primitives necessary to accelerate bleeding edge image recognition topologies, including Cifar*, AlexNet*, VGG*, GoogleNet*, and ResNet*. The primitives include convolution, inner product, pooling, normalization, and activation with support for inference operations. The library includes the following classes of functions:

Intel MKL DNN primitives implement a plain C/C++ application programming interface (API) that can be used in the existing C/C++ DNN frameworks, as well as in custom DNN applications.

Programming Model

Intel MKL-DNN models memory as a primitive similar to an operation primitive. This allows reconstruction of the graph of computations at run time.

Basic Terminology

Intel MKL-DNN operates on the following main objects:

A typical workflow is to create a set of primitives to run, push them to a stream all at once or one at a time, and wait for completion.

Creating Primitives

In Intel MKL-DNN, creating primitives involves three levels of abstraction:

To create a memory primitive:

  1. Create a memory descriptor. The memory descriptor contains the dimensions, precision, and format of the data layout in memory. The data layout can be either user-specified or set to any. The any format allows the operation primitives (convolution and inner product) to choose the best memory format for optimal performance.
  2. Create a memory primitive descriptor. The memory primitive descriptor contains the memory descriptor and the target engine.
  3. Create a memory primitive. The memory primitive requires allocating a memory buffer and attaching the data handle to the memory primitive descriptor. Note: in the C++ API for creating an output memory primitive, you do not need to allocate buffer unless the output is needed in a user-defined format.

To create an operation primitive:

  1. Create a logical description of the operation. For example, the description of a convolution operation contains parameters such as sizes, strides, and propagation type. It also contains the input and outpumemory descriptors.
  2. Create a primitive descriptor by attaching the target engine to the logical description.
  3. Create an instance of the primitive and specify the input and output primitives.

Performance Considerations

Miscellaneous Operational Details

Auxiliary Types

Example

This C++ API example demonstrates how to build a neural network topology block that consists of forward convolution and ReLU.

In this example note:

The steps in the example are:

  1. Initialize a CPU engine. The last parameter in the engine() call represents the index of the engine.
    using namespace mkldnn;
    auto cpu_engine = engine(engine::cpu, 0);
  2. Create a vector of primitives that represents the net.
    std::vector<primitive> net;
  3. Allocate input data and create a tensor structure that describes the data.
    std::vector<float> src(2 * 3 * 227 * 227);
    memory::dims conv_src_tz = {2, 3, 227, 227};
    /* similarly specify tensor structure for output, weights and bias */
  4. Create two memory descriptors: one for data in a user format, and one for the convolution input. Choose nchw (minibatch-channels-height-width) format for user data and any for the convolution data format. The any format allows the convolution primitive to choose the data format that is most suitable for its input parameters (convolution kernel sizes, strides, padding, and so on). If the resulting format is different from nchw, the user data must be transformed to the format required for the convolution.
    auto user_src_md = memory::desc({conv_src_tz},
    memory::data_type::f32, memory::format::nchw);
    auto conv_src_md = memory::desc({conv_src_tz},
    memory::data_type::f32, memory::format::any);
    /* similarly create conv_weights_md and conv_dst_md in format::any */
  5. Create a convolution descriptor by specifying the algorithm, propagation kind, shapes of input, weights, bias, output, convolution strides, padding, and kind of padding.
    auto conv_desc = convolution_forward::desc(
    conv_src_md, /* format::any used here to allow convolution choose a format */
    conv_weights_md, conv_bias_md, conv_dst_md,
    {1, 1}, {0, 0}, {0, 0}, padding_kind::zero);
  6. Create a descriptor of the convolution primitive. Once created, this descriptor has specific formats instead of the any format specified in the convolution descriptor.
    auto conv_pd = convolution_forward::primitive_desc(conv_desc, cpu_engine);
  7. Create a memory primitive that contains user data and check whether the user data format differs from the format that the convolution requires. In case it is different, create a reorder primitive that transforms the user data to the convolution format and add it to the net.
    auto user_src_memory_descriptor
    = memory::primitive_desc(user_src_md, engine);
    auto user_src_memory = memory(user_src_memory_descriptor, src.data());
    /* Check whether a reorder is necessary */
    auto conv_src_memory = user_src_memory;
    if (memory::primitive_desc(conv_pd.src_primitive_desc())
    != user_src_memory_descriptor) {
    /* Yes, a reorder is necessary */
    /* The convolution primitive descriptor contains the descriptor of a memory
    * primitive it requires as input. Because a pointer to the allocated
    * memory is not specified, Intel MKL-DNN allocates the memory. */
    conv_src_memory = memory(conv_pd.src_primitive_desc());
    /* create a reorder between data, make it an input for the convolution */
    conv_reorder_src = reorder(user_src_memory, conv_src_memory)
    /* put the reorder in the net */
    net.push_back(conv_reorder_src);
    }
  8. Create a memory primitive for output.
    auto conv_dst_memory = memory(conv_pd.dst_primitive_desc());
  9. Create a convolution primitive and add it to the net.
    /* Note that the conv_reorder_src primitive
    * is an input dependency for the convolution primitive, which means that the
    * convolution primitive will not be executed before the data is ready. */
    auto conv
    = convolution_forward(conv_pd, conv_src_memory, conv_weights_memory,
    conv_user_bias_memory, conv_dst_memory);
    net.push_back(conv);
  10. Create relu primitive. For better performance keep ReLU (as well as for other operation primitives until another convolution or inner product is encountered) input data format in the same format as was chosen by convolution.
    auto relu_src_md = conv_pd.dst_primitive_desc().desc();
    auto relu_desc = relu_forward::desc(prop_kind::forward, relu_src_md,
    negative_slope);
    auto relu_dst_memory = memory(relu_pd.dst_primitive_desc());
  11. Pass the convolution primitive as input to relu. Note: this allows the stream to establish dependencies between primitives.
    auto relu = relu_forward(relu_pd, conv, relu_dst_memory);
    net.push_back(relu);
  12. Finally, create a stream, submit all the primitives, and wait for completion.
    mkldnn::stream(mkldnn::stream::kind::eager).submit(net).wait();

Extended Examples

These examples provide more details about using the API. All the examples use the topology: Convolution, ReLU, LRN, and pooling.

Legal Information