Visualizing IoT Data at Scale With Hopara and TimescaleDB
May 16, 2023
I have been involved in building DBMSs for 50 years. During that time, I have been the main architect behind PostgreSQL (the foundation of Timescale), Vertica, VoltDB, and Paradigm4. Recently, I have been studying user-facing problems, most notably, “How do users derive information from the massive amount of data we are collecting?”
Sometimes users know exactly what they are interested in, and a conventional dashboard (such as Tableau or Spotfire) can help them find it. Often, however, the real question behind a query a user wants to run is, “Tell me something interesting?” I.e., “Show me an actionable insight.” To provide these meaningful data insights, a sophisticated visualization system is needed to complement more traditional analytics systems.
I have long been a fan of Google Maps, which allows you to go from a picture of the Earth to the plot map on your street in 21 clicks. This is an exemplar of a “detail on demand” system, often called a “pan-zoom” interface. A big advantage of such systems is that no user manual is required since the interface is so intuitive. Unfortunately, Google Maps only works for geographic data. So, what to do if you have floor plans, 3D models, scatter plots, or the myriad of other representations that users want to see?
Hopara can be thought of as “Google Maps on steroids.” It will produce pan-zoom displays for any kind of data. It is especially applicable for real-time monitoring applications, often from IoT data collection or asset tracking of sensor-tagged devices.
Figure 1: Example of a Hopara app in the lab space (powered by Timescale)
Walking through the Hopara monitoring application powered by Timescale, you can see the benefit of Hopara visualization aided by a traditional analytics dashboard. The application reports vibration issues in sensor-tagged machines, and the real-time vibration data is stored in a Timescale database for effective real-time querying.
The Problem: Monitoring (Lots of) Sensor Data in Real Time
Let’s consider a French company that operates 58 factories in Brazil, manufacturing construction materials (think pipes, glue, and nails). This company is referred to as FC in the rest of this post.
FC is in the process of installing about 50K sensors from a Brazilian vendor, IBBX. These sensors primarily report vibration, typically at 100 msec intervals. Excessive vibration is often an early warning of machine failure or the need for urgent service. Hence, real-time monitoring is required for 50K time series of vibration data.
Unlike some applications that can aggregate data in the network from sensor to server, FC requires that details be available on all sensors so abnormal events can be reported on individual machines.
These abnormal events can include the following:
- Latest reading over a threshold
- Five sequential readings over five minutes above a second threshold
- One reading per minute over a third threshold for 10 minutes
- A reading over a fourth threshold more than once a day for a month
In addition, FC wants to filter machines by location (e.g., only those plants in São Paulo) and by time (e.g., report only weekend vibrations).
A time-series database is the optimal solution for storing the 50K time-series sensor data for a long period of time. Although this database is likely measured in GB, it is easy to imagine much larger applications.
In the rest of this blog post, I first talk about the FC reporting architecture and their future requirements. Then, I discuss the requirements for the database, and why Hopara ended up using Timescale.
From Data to Insights: Building Actionable Visualizations
FC runs a sophisticated analytics application provided by IBBX. It monitors vibrations and employs machine learning-based predictive analytics to forecast needed repair events. The IBBX dashboard for FC is shown in Figure 2. It shows “the details” using typical dashboard technology. This is very useful for initiating corrective action when required.
Figure 2: FC/IBBX dashboard using time-series sensor data in real tim
However, this dashboard does not show “the big picture” desired by the FC personnel, which is why IBBX recently partnered with Hopara to produce a “detail-on-demand” system for the same data. Figure 3 shows FC factories on a map of Brazil color-coded with their facility health.
Figure 3: The Big Picture, FC factories on a map of Brazil color-coded with their facility health
Notice that there are two factories with a yellow status. If an FC user wants to “drill into” one of them, the user will see Figure 4. Figure 4 shows greater detail about one yellow sensor in Figure 3, while Figure 5 shows the time series of sensor readings for the sensor in question over five days. Notice that, in a few clicks, a user can move from an overview to the actual time-series data.
FC users appreciate both the analytics dashboard and the Hopara drill-down system. As a result, IBBX and Hopara combined their software into a single system. The takeaway from this example is that there is a need for predictive analytics and insightful visualization. IoT customers should have both kinds of tools in their arsenal.
✨ Author’s Note: If you want to learn more about time-series forecasting, its applications, and popular techniques, check out this blog post.
Figure 4: The factory in question
We now turn our attention to response time. It is accepted pragma that the response time for a command to a visualization system must be less than 500 msec. Since it takes some time to render the screen, the actual time to fetch the data from a storage engine must be less than this number. The next section discusses DBMS performance considerations in more detail.
Behind The Scenes: Powering Real-Time Visualizations Using Timescale
Figure 5: More real-time details
As mentioned previously, these real-time views are powered by Timescale. Timescale is built on top of PostgreSQL, and it extends it with a series of capabilities that are extremely useful for this use case, such as automatic partitioning by time, boosted performance for frequently-run queries, and continuous aggregates for real-time aggregations.
To guarantee a real-time display, Hopara fetches live data from the database for every user command. Otherwise, stale data will be rendered on the screen. The screenshots above come from a real Hopara application interacting with a database. We note in Figure 2 that the alerting condition is the first one mentioned in a previous section (the latest reading over a threshold).
In other words, the display is showing a computation based on the most recent values from the various sensors. Specifically, Hopara uses a database with the schema shown in Figure 7, including a Readings table with all the raw data. This is connected to a Sensor table with the characteristics of each individual sensor.
Then, the display in Figure 2 requires fetching the most recent reading for each sensor. This can be produced by running the following two-step PostgreSQL query:
– get the latest reading timestamp + sensor_id
SELECT max(timestamp) as timestamp, sensor_id
GROUP BY sensor_id
— get the reading value
SELECT l.timestamp, l.sensor_id, r.value FROM latest l
INNER JOIN readings r ON r.sensor_id = l.sensor_id AND r.timestamp = l.timestamp
This query, while easy to understand, is not efficient. The following representation leverages PostgreSQL distinct on and Timescale skip scan to perform the query faster, so it is a preferred alternative.
SELECT DISTINCT ON (sensor_id) *
WHERE timestamp > now() - INTERVAL '24 hours'
ORDER BY sensor_id, timestamp DESC;
Note that condition one can be triggered inadvertently, for example, by a worker brushing against the machine. Hence, some combination of conditions 2-4 is a more robust alerting criterion. Unfortunately, these require complex real-time aggregation, which is not present in PostgreSQL. As a result, Hopara switched to TimescaleDB, which extends PostgreSQL with these capabilities (through its continuous aggregate capabilities).
Figure 6: The actual data
Figure 6 displays five days of data for three sensors. Since each sensor is reporting every 100 msec, there are around 4.5M observations in a five-day window. Obviously, this level of granularity is inappropriate for the display presented. In addition, there is no way to produce the display within the required response time. Hence, Hopara aggregates the raw data into two-minute averages using Timescale’s continuous aggregates (Figure 5 displays these averages).
Timescale’s hypertables automatically partition big tables, so it’s easier for the query planner to localize time-series data. This greatly accelerates queries such as the one required to produce Figure 5, another reason for the switch from PostgreSQL to Timescale.
Figure 7: The database schema
Hopara and Timescale are powerful tools for IoT applications like the types discussed in this blog post.