Looking at Structured, Unstructured, and Semi-Structured Data (with examples)

Structured data 1200

Three kinds of data – structured, unstructured and semi-structured – are regularly used in data warehousing. They are typically used at distinct stages of processing, and different techniques are necessary to handle the three types. It’s common to convert between the three kinds of data while loading, transforming, and integrating.

How do you handle the three types of data to achieve optimal results? Let’s look at a worked example of the same set of data represented in all three different ways.

First, the IoT source

As my data source, I will start with a small data sample taken over 90 seconds from an industrial process. An IoT device is recording the rate of gas output from a catalyzed chemical reaction. The gas bubbles through an airlock with an audio transducer attached. The transducer outputs a continuous digital audio signal which is the input data for monitoring the reaction. The frequency of the bubbles measures how fast the reaction is going.

When viewed in a sound editor, the waveform looks like this:

The human eye is very good at detecting patterns, and to us it’s obvious that the signal above contains repeating peaks spaced about 5 seconds apart. Those are individual bubbles going through the airlock. We can easily pick out that feature despite the continual whine of a fan plus various other background noises.

But for automated monitoring and predictability, we need to be a quantum leap more sophisticated. What’s the exact frequency of the peaks? Is the speed as expected? Is it increasing or decreasing? How would a computer fare with this data?

Unstructured data

Of course in data warehousing, there’s no such thing as entirely unstructured data. It would be impossible to read if it really was unstructured, and it would never be usable.

Rather than being completely unstructured, it’s more accurate to say that data warehouses often have to deal with data sources that are structured in complex ways. Often, that structure involves a proprietary or unusual format. Generally, it requires processing to interpret. In other words, unstructured data is hard to read!

As an example, from those 90 seconds of audio captured by the IoT device, here is a hex dump showing roughly the 1/100th of a second between 1:07.30 and 1:07.31.

f9 7d b5 20 73 d8 2f 5d d1 35 cb de 5f f4 9f ac cb 1f cd 51 41 44 f0 be c7 7e aa 29 06 d8 c4 69 e1 db 1b 3e 0b 5e ab da 4b 7a ce 80 9f 62 61 ed 42 b4 d5 51 96 d8 e9 38 c5 75 4e 51 ca 98 4e bf ed 19 7d 2e 2c 3d 9b 92 d5 64 6c 2e f5 99 29 75 ab ec b7 5a f4 39 39 30 ce 9d 91 61 4a d9 8e 57 2d 75 06 83 55 3e 3e 93 f1 c2 9d b3 48 ed f7 79 53 d9 7a 45 a8 8b 92 ef 10 57 df 95 70 44 f8 23 e5 c8 86 f5 0a 9b 94 e3 ed de 73 9c e7 44 8a 01 02 70 18 8c 8a 85 61 a1 18 1a cc 28 cd 4b 97 49 a6 5c 80 c1 48 87 8b ea 80 4d 46 1a 4a ad d5 2e 06 ba ac 42 96 b3 4d 1d 0c 61 9c 62 f5 33 62 fc 30 b1 1f af fa 68 7d 8a 58 43 a0 73 1d 07 e3 c1 b5 07 dc 3c 9f bc e9 74 e8 fe 63 bf 6f 24 9c 3b b1 db 53 24 6a 0c 2b 84 ff a2 a8 15 f5 f9 00 5b a5 0e 0f db d5 0b ff 26 1a 14 0c ad bb 9a a2 ba 52 73 37 20 d5 54 0c 31 92 9d d7 a1 91 54 6f 62 a7 35 43 a1 84 c9 99 cd 87 15 dd a1 cc f4 d6 81 d7 36 9e 07 a1 fe ed 65 7b 97 5f f7 f4 32 59 99 fe 15 31 64 ee 55 cf 15 3b f0 ff f1 4c 80 59 3f fc 21 1b 94 05 fc d4 97 ae f3 bf 8b bf 13 df eb f4 fa f8 ef a9 9a ae 72 b5 72 4b 2e 5c cb b4 a1 42 ca fc cd 6e f8 f7 f7 c6 83 c5 9c 54 7d 35 1a d3 c3 b4 9e 8e 69 da db 3b 45 b7 83 a7 ae 5d e9 88 66 4c e4 74 51 fa 5e e5 cc 73 78 bb 1a ae dc 71 7f 85 a7 6a fc 57 b0 de ef 37 f6 9b 70 73 3e

The above is only around 0.01% of the whole. The full, real data is 3MB in size–more than seven thousand times larger than this fragment.

The information must be in there somewhere, because the periodic peaks showed up clearly in the sound editor. But just looking at the bytes it is really difficult to establish that relationship. This is the first challenge for a computer dealing with “unstructured” data.

To the human eye, the bytes don’t reveal any detectable pattern. But in reality the data is organized in a logical, predictable, and fully documented way. It does have structure. In this case, you would need to know that it’s a stereo, 48KHz audio stream, exported in the AAC format. It was evidently convenient for the transducer to produce the data in that way.

There are code libraries for parsing that kind of audio data, which would be the first step in processing it. In fact, the sound editor used one of those libraries to generate the waveform image I showed at the top.

Too Much Information

In the next screenshot I have zoomed right in to highlight the same 1/100th of a second between 1:07.30 and 1:07.31. There’s no special feature at this point: it’s simply typical of the majority of the data and shows the ambient sound generated by an air fan spinning.

Note that, even at this large magnification, there’s still a lot of information. It could be used to check the fan speed every 100th of a second throughout the sample. This part of the signal is completely irrelevant to the problem of peak detection: It’s just some rather annoying noise. However, it would be music to the maintenance engineers, because if the fan speed changes it probably means it’s about to break and should be replaced.

Next here’s another zoom in, around the 1:33 mark this time. You can just about tell by eye that the highlighted section is not a “bubble-shaped” peak. Also it’s out of sequence, being only two seconds before the next one instead of around five seconds.

In fact, this signal has nothing to do with the chemical reaction, either. It is the audio transducer picking up the click of a door closing in the distance. Perhaps of interest to the security team? But again, irrelevant to the problem of peak detection. Distinguishing between two fairly similar-looking events is a good example of a typical difficulty that a machine learning algorithm would need to overcome.

Clearly there’s a huge amount of information contained in the data, although most of it is not useful for monitoring the chemical reaction.

Unstructured data summary

Sound is just one example of “unstructured” digital data originating from analog sources. The example I have been using is a sound recording taken from a factory floor. Human conversation is another very common source of recorded sound. Two other big categories of “unstructured” analog media are images (such as photographs, medical scans, and handwritten documents) and video (such as security or traffic monitoring).

Many “unstructured” data formats also exist that originate from digital sources. Examples include:

Some general features are common to all “unstructured” data: