Skip to contents

This layer takes float32/float64 single or batched audio signal as inputs and computes the Mel spectrogram using Short-Time Fourier Transform and Mel scaling. The input should be a 1D (unbatched) or 2D (batched) tensor representing audio signals. The output will be a 2D or 3D tensor representing Mel spectrograms.

A spectrogram is an image-like representation that shows the frequency spectrum of a signal over time. It uses x-axis to represent time, y-axis to represent frequency, and each pixel to represent intensity. Mel spectrograms are a special type of spectrogram that use the mel scale, which approximates how humans perceive sound. They are commonly used in speech and music processing tasks like speech recognition, speaker identification, and music genre classification.

Usage

layer_mel_spectrogram(
  object,
  fft_length = 2048L,
  sequence_stride = 512L,
  sequence_length = NULL,
  window = "hann",
  sampling_rate = 16000L,
  num_mel_bins = 128L,
  min_freq = 20,
  max_freq = NULL,
  power_to_db = TRUE,
  top_db = 80,
  mag_exp = 2,
  min_power = 1e-10,
  ref_power = 1,
  ...
)

Arguments

object

Object to compose the layer with. A tensor, array, or sequential model.

fft_length

Integer, size of the FFT window.

sequence_stride

Integer, number of samples between successive STFT columns.

sequence_length

Integer, size of the window used for applying window to each audio frame. If NULL, defaults to fft_length.

window

String, name of the window function to use. Available values are "hann" and "hamming". If window is a tensor, it will be used directly as the window and its length must be sequence_length. If window is NULL, no windowing is used. Defaults to "hann".

sampling_rate

Integer, sample rate of the input signal.

num_mel_bins

Integer, number of mel bins to generate.

min_freq

Float, minimum frequency of the mel bins.

max_freq

Float, maximum frequency of the mel bins. If NULL, defaults to sampling_rate / 2.

power_to_db

If TRUE, convert the power spectrogram to decibels.

top_db

Float, minimum negative cut-off max(10 * log10(S)) - top_db.

mag_exp

Float, exponent for the magnitude spectrogram. 1 for magnitude, 2 for power, etc. Default is 2.

min_power

Float, minimum value for power and ref_power.

ref_power

Float, the power is scaled relative to it 10 * log10(S / ref_power).

...

For forward/backward compatability.

Value

The return value depends on the value provided for the first argument. If object is:

  • a keras_model_sequential(), then the layer is added to the sequential model (which is modified in place). To enable piping, the sequential model is also returned, invisibly.

  • a keras_input(), then the output tensor from calling layer(input) is returned.

  • NULL or missing, then a Layer instance is returned.

References

Examples

Unbatched audio signal

layer <- layer_mel_spectrogram(
  num_mel_bins = 64,
  sampling_rate = 8000,
  sequence_stride = 256,
  fft_length = 2048
)
layer(random_uniform(shape = c(16000))) |> shape()

Batched audio signal

layer <- layer_mel_spectrogram(
  num_mel_bins = 80,
  sampling_rate = 8000,
  sequence_stride = 128,
  fft_length = 2048
)
layer(random_uniform(shape = c(2, 16000))) |> shape()

Input Shape

1D (unbatched) or 2D (batched) tensor with shape:(..., samples).

Output Shape

2D (unbatched) or 3D (batched) tensor with shape:(..., num_mel_bins, time).

See also

Other preprocessing layers:
layer_category_encoding()
layer_center_crop()
layer_discretization()
layer_feature_space()
layer_hashed_crossing()
layer_hashing()
layer_integer_lookup()
layer_normalization()
layer_random_brightness()
layer_random_contrast()
layer_random_crop()
layer_random_flip()
layer_random_rotation()
layer_random_translation()
layer_random_zoom()
layer_rescaling()
layer_resizing()
layer_string_lookup()
layer_text_vectorization()

Other layers:
Layer()
layer_activation()
layer_activation_elu()
layer_activation_leaky_relu()
layer_activation_parametric_relu()
layer_activation_relu()
layer_activation_softmax()
layer_activity_regularization()
layer_add()
layer_additive_attention()
layer_alpha_dropout()
layer_attention()
layer_average()
layer_average_pooling_1d()
layer_average_pooling_2d()
layer_average_pooling_3d()
layer_batch_normalization()
layer_bidirectional()
layer_category_encoding()
layer_center_crop()
layer_concatenate()
layer_conv_1d()
layer_conv_1d_transpose()
layer_conv_2d()
layer_conv_2d_transpose()
layer_conv_3d()
layer_conv_3d_transpose()
layer_conv_lstm_1d()
layer_conv_lstm_2d()
layer_conv_lstm_3d()
layer_cropping_1d()
layer_cropping_2d()
layer_cropping_3d()
layer_dense()
layer_depthwise_conv_1d()
layer_depthwise_conv_2d()
layer_discretization()
layer_dot()
layer_dropout()
layer_einsum_dense()
layer_embedding()
layer_feature_space()
layer_flatten()
layer_flax_module_wrapper()
layer_gaussian_dropout()
layer_gaussian_noise()
layer_global_average_pooling_1d()
layer_global_average_pooling_2d()
layer_global_average_pooling_3d()
layer_global_max_pooling_1d()
layer_global_max_pooling_2d()
layer_global_max_pooling_3d()
layer_group_normalization()
layer_group_query_attention()
layer_gru()
layer_hashed_crossing()
layer_hashing()
layer_identity()
layer_integer_lookup()
layer_jax_model_wrapper()
layer_lambda()
layer_layer_normalization()
layer_lstm()
layer_masking()
layer_max_pooling_1d()
layer_max_pooling_2d()
layer_max_pooling_3d()
layer_maximum()
layer_minimum()
layer_multi_head_attention()
layer_multiply()
layer_normalization()
layer_permute()
layer_random_brightness()
layer_random_contrast()
layer_random_crop()
layer_random_flip()
layer_random_rotation()
layer_random_translation()
layer_random_zoom()
layer_repeat_vector()
layer_rescaling()
layer_reshape()
layer_resizing()
layer_rnn()
layer_separable_conv_1d()
layer_separable_conv_2d()
layer_simple_rnn()
layer_spatial_dropout_1d()
layer_spatial_dropout_2d()
layer_spatial_dropout_3d()
layer_spectral_normalization()
layer_string_lookup()
layer_subtract()
layer_text_vectorization()
layer_tfsm()
layer_time_distributed()
layer_torch_module_wrapper()
layer_unit_normalization()
layer_upsampling_1d()
layer_upsampling_2d()
layer_upsampling_3d()
layer_zero_padding_1d()
layer_zero_padding_2d()
layer_zero_padding_3d()
rnn_cell_gru()
rnn_cell_lstm()
rnn_cell_simple()
rnn_cells_stack()