The source code for this blog is available on GitHub.

Brazilian Songs Analysis.

Sailing in Dark Waters: Quantifying Music Complexity

Cover Image for teste
When comes down to music, complexity becomes tricky to measure not only due to subjectivity but to the many other aspects that can be related together in order to enable a more robust estimation and concise approach. Complexity can be defined, for example, in terms of many parameters related to concepts that may involve music theory (e.g., harmony, rythm), lyrical aspects such as poetic synthesis, etc. Because of that, we decided to analyze solely the lyrical aspect of the songs. Although I don't see very much the point of determining a way to measure and perform comparison between music - other than an attempt to diagnose the media massification symptoms such the cultural empoverishment - we did still defined complexity in terms of two measures and analyzed its behavior through the years.

Lexical Richness

A simpler way to define a measure for lyrics complexity is to compute the relation between the amount of repeated terms and the total number of terms in a document. This measure is mostly known as Lexical Richness ( LRL_R).
"The Lexical richness is a relation established between the number repeated (and different) words of a text, and its total number of words." [1]. That is, RL=typestokens R_L = \frac{types}{tokens} , where:
typestypes : Amount of different words in a document.
tokenstokens : Total amount of words of the document.

Information content

As a more sophisticated approach to quantify complexity, I thought about computing the average entropy of the songs lyrics released in a period of time. As the entropy measure allows to quantify the average level of information content of a document, I therefore considered it as promising approach that would enable to perform an analysis through time.
Let h(wi)=log2(1/P(wi))h(w_i) = \log _{2}(1/P(w_i)), the (Shannon) information content of a word (wiw_i) in a document dkd_k.
We can then compute the average information content of a document: H(dk)H(d_k), or entropy (measured in bits), such that:
H(dk)=i=1jP(wi)×h(wi)H(d_k) = \sum_{i=1}^{j} P(w_i)\times h(w_i) widk={w1,w2,,wj}\forall w_i \in d_k=\{w_1, w_2, \ldots, w_j\} .
Considering:
Vi={w1,w2,,wn}V_i =\{ w_1, w_2, \ldots, w_n\}: the set of words (or vocabulary) of a dataset.
Qi={q1,q2,,qn}Q_i =\{ q_1, q_2, \ldots, q_n\} : the respective occurrences of a dataset words.
Pi={p1,p2,,pn}P_i = \{p_1, p_2, \ldots, p_n\} the ratio between word occurrences, qiqi, by the vocabulary lenght V|V| such that (pi=qiV)(p_i = \frac{q_i}{|V|}).
Di={d1,d2,,dk}D_i = \{d_1, d_2, \ldots, d_k\}: the set of documents.