Dr. Dobb's Journal - January 2008 - (Page 22) d01jurg_p6db 11/8/07 11:17 AM Page 22 Core Technology TAG CLOUDS: USABILITY AND MATH continued from page 20 • Similar to other graphs, there is a restriction to the density of the information in tag clouds. According to Wikipedia, tag clouds generally contain between 30 and 150 tags. Usability clearly sets an upper limit to the number of tags. Moreover, the page layout can impose a restriction to the available space for the tag cloud. It is therefore necessary to take into account an imposed maximum length for the dataset. • Some texts may not be interesting to users and should be omitted from the tag cloud. This is the case for articles and other small words that are considered to be “noise” by search algorithms. If there are tags such as these in your data, you might want to filter the results. • Many tag clouds present information calculated over a period of time, such as the number of times that search terms have been used in the last 24 hours. Depending on the data, your function may contain extra parameters with which you restrict the aggregation of data to a (progressive) subset. source data from the remaining functional layers of the tag cloud, then you already have a better design than the average tag cloud example found on the Internet. Linearization For the purposes of illustration, I created a dataset of well-known authors in our field, with the number of hits these names score in a Google search. When I use the raw data to create a tag cloud, I get the result in Figure 2(a). The tag cloud presents most of the names in approximately the same size. Only some names jump out, and some are nearly illegible. The reason is that the weights are not distributed evenly over the range of the source data. Most of the authors on my bookshelf have (roughly) the same number of Google hits. Only some authors have either very many or very few hits. It appears you can recognize a normal distribution (or Gaussian distribution) here, of which you can see examples in Figure 3. To get a more evenly distributed range of font sizes in the tag cloud, it is necessary to “linearize” the original values. You get a better result when you use a linearized representation, as in Figure 2(b). Technically, linearization means that the weights become less accurate. Bust because the tags have differing word lengths, there is already no such thing as an accurate reflection of the weights. Here, we are interested in usability, not accuracy. The Pareto distribution, or “80–20 rule” (see Figure 4) is also frequently encountered. In this distribution, 80 percent of the weights are in the lowest 20 percent of the range, while the other 20 percent fill the remaining 80 percent of the range, or the other way around. Well-known examples of this distribution include wealth among people, popularity of websites, and the frequency of words from the English language. You need to select the right algorithm for linearization of your dataset. In Figure 2(c), my dataset (which contains a normal distribution) is linearized as if it contained a Pareto distribution. The result can be weird when you select the wrong distribution model. Strangely enough, I’ve noticed several authors doing exactly the opposite—they linearized datasets that contained Pareto distributions assuming (unknowingly, I suppose) that they Eventually, you will create one or more functions that resemble Listing One. Your architecture for data access is hopefully more sophisticated than this simple example. But if you separate the construction of Listing One Public Function GetWriters(ByVal maxCount As Integer, _ ByVal ignoreNoise As Boolean, ByVal fromDate As DateTime, _ ByVal toDate As DateTime) As DataTable Dim query As String = String.Format( _ “SELECT * FROM (SELECT TOP {0} ID, Text, “ & _ “Count FROM Writers ORDER BY Count DESC) sub “ & _ “ORDER BY Text ASC”, maxCount) ‘TODO: also filter on ignoreNoise, fromDate and toDate Dim adapter As New SqlDataAdapter(query, _ConnectionString) Dim table As New DataTable adapter.Fill(table) Return table End Function Listing Two Public Shared Function FromBellCurve( _ ByVal weights As ICollection(Of Decimal), _ ByVal minSize As Decimal, ByVal maxSize As Decimal) _ As ICollection(Of Decimal) ‘First, calculate the mean weight. Dim meansum As Decimal = 0 For Each w As Decimal In weights meansum += w Next Dim mean As Double = meansum / weights.Count ‘Second, calculate the standard deviation of the weights. Dim sdsum As Double = 0 For Each w As Decimal In weights sdsum += (w - mean) ^ 2 Next Dim sd As Double = ((1 / weights.Count) * sdsum) ^ 0.5 ‘Now calculate the slope of a straight line from -2*sd to +2*sd. Dim slope As Double If sd > 0 Then slope = (maxSize - minSize) / (4 * sd) End If ‘Get the value in the middle between minSize and maxSize. Dim middle As Double = (minSize + maxSize) / 2 ‘Calculate the result for the given deviation from mean. Dim output As New List(Of Decimal) For Each w As Decimal In weights If (sd = 0) Then ‘With sd=0 all tags have the same weight. output.Add(CDec(middle)) Else ‘Calculate the distance from mean for this weight. Dim distance As Double = w - mean ‘Calculate the position on the slope for this distance. Dim result As Double = CDec(slope * distance + middle) ‘If the tag turned out too small, set minSize. If result maxSize Then result = maxSize output.Add(CDec(result)) End If Next Return output End Function 22 Dr. Dobb’s Journal l www.ddj.com l January 2008 http://www.ddj.com
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.