To: squeak-dev at lists squeakfoundation org From: Richard A. O'Keefe <ok at cs otago ac nz> Subject: Design Patterns and Collection Classes Date: Thu Aug 29 02:50:01 CEST 2002 "David Griswold" <David.Griswold at acm org> wrote: So if there is any overall message here, it is: be very careful with code in classes like this, and show great restraint by making changes as minimal as possible. The code in Collection is simple, but not simple-minded. Don't underestimate the subtlety of why these simple-seeming methods are the way they are. In another context, Brian Foote and Joseph Yoder wrote [http://www.laputan.org/mud/mud.html capitals are theirs and were links]: ... when you are prototyping a system, you are not usually concerned with how elegant or efficient your code is. You know that you will only use it to prove a concept. Once the prototype is done, the code will be thrown away and written properly. As the time nears to demonstrate the prototype, the temptation to load it with impressive but utterly inefficient realizations of the system's expected eventual functionality can be hard to resist. Sometimes, this strategy can be a bit too successful. ... Does this sound a bit like the history of Smalltalk? It should. A whole new language every two years. A research goal that was not "how do we build ideal collection classes" but "how do we take the idea of object orientation and build a language and system around it that can be understood ("Personal Mastery") and used in powerful ways, how can we develop a whole new approach to programming?" ... Time, or a lack thereof, is frequently the decisive force that drives programmers to write THROWAWAY CODE. Taking the time to write a proper, well thought out, well documented program might take more time than is available to solve a problem, or more time than the problem merits. Often, the programmer will make a frantic dash to construct a minimally functional program, all the while promising him- or herself that a better factored, more elegant version will follow thereafter. They may know full well that building a reusable system will make it easier to solve similar problems in the future, and that a more polished architecture would result in a system that was easier to maintain and extended. Quick-and-dirty coding is often rationalized as being a stopgap measure. All too often, time is never found for this follow-up work. The code languishes, while the program flourishes. THEREFORE [and this is their recommendation], produce, by any means available, simple, expedient, disposable code that adequately addresses just the problem at hand. ... while this approach [that is, the approach used in a particular example, not throwaway code as such] is, in general, a lousy way to address this [particular example's] problem, it is perfectly satisfactory within the confines of the particular purpose for which [it] has ever actually been used. Such practical constraints are typical of THROWAWAY CODE, and are more often than not undocumented. For that matter, everything about THROWAWAY CODE is more often than not undocumented. When documentation exists, it is frequently not current, and often not accurate. Does this sound like #removeAll: and #addAll:? It should! ... [Two example] systems might be good candidates for RECONSTRUCTION, were the resources, interest, and audience present to justify such an undertaking. In the mean time, these [particular] systems, which are still sufficiently well suited to the particular tasks for which they were built, remain in service. Keeping them on the air takes far less energy than rewriting them. They continue to evolve, in a PIECEMEAL fashion, a little at a time. Doesn't this sound like everything about Squeak? From the core classes to Morphic? Isn't piecemeal (community) improvement part of the point of Squeak? So David Griswold claims that the Collection classes are subtle, simple but not simple-minded, and that the methods (including the ones I claim are broken) are the way they are for excellent reason and should not be altered without great [indeed, to my thinking, extreme] restraint, while I claim that the Collection classes are a "Ball of Mud" where most methods are the simplest thing that could possibly work, except they don't. Obviously, eyewitness testimony from the original developers could settle this. Except that it couldn't; it could only tell us how they remember what they thought they were doing. We need some more objective criterion to help us decide what kind of code we are dealing with here. If we are dealing with entire and perfect chrysolite, what we need is documentation. If we are dealing with mud, we need documentation and repairs. So let's look away from #removeAll: and #addAll:. Let's even look away from #union:, #intersection:, and #difference:, enchanting though their perversity may be. ---------------------------------------------------------------- A Design Principle. ---------------------------------------------------------------- Every class should have a class invariant. It does not need to be expressed as code, but it needs to be thought about and written down. Initialisation methods have the obligation of setting up an object's state so that its class invariant is true. Other (externally accessible) methods get to assume that the invariant is true and have the obligation to ensure that it is still true when they complete. If you cannot rely on sufficient knowledge about the object's state at the beginning of a method, the only thing you can do to restore the invariant is reinitialise. If the invariant of a class depends on the state of a mutable object, and that object is accessible outside this class, then you DON'T have an invariant and you cannot expect your methods to work. So the design principle is: a class's invariant should depend only on the values of its instance variables and the states of such objects as are immutable or are not shared with other objects. ---------------------------------------------------------------- The Collection Classes ---------------------------------------------------------------- Do the collection classes heed this warning? Many of them do. OrderedCollection, for example, never hands out its array, and does not provide any non-private methods for changing its instance variables. (At least, by putting methods in the 'private' category it is warning "don't call these lest the heavens fall".) Set, for another example, depends on the exact state of its array, but never hands out access to that array. but it also depends on the state of the elements; it requires that the hash code of an element not change. Oddly enough, it does provide a repair procedure that can be called. You could mutate the elements of a Set, taking care that the Set itself was not used while this was happening, and then invoke the #rehash method. Except that #rehash is private. With the exception of numbers and symbols, there are no immutable values that one might use as keys. There is no equivalent of OPAL's InvariantArray, for example. (I'm trying to write one and it's a nightmare. Array has so many methods that think it's OK to smash, even when they are making copies.) But what about Dictionary? Dictionary cannot make up its mind whether it's about Sets of Associations or keyed collections of values, and tries to be both. Considered as a keyed collection of values, a Dictionary's Associations are part of its private internal structure. Dictionary's invariant concerns the structure of its hash table, which depends on the state of these Associations. Yet it is perfectly happy to hand those Associations out to anyone who asks. Consider d := Dictionary new. d at: #red put: Color red. d associationsDo: [:each | each key: #green]. d inspect ==> (shows #green -> nil) d at: #green ==> (Error: key not found) d at: #red ==> (Error: key not found) Had Dictionary been designed with class invariants in mind, it could still have looked much the way it does now, but the Associations it hands out would be InvariantKeyAssociations, where Association subclass: #InvariantKeyAssociation instanceVariableNames: '' classVariableNames: '' poolDictionaries: '' category: 'Collections-Support' "private methods" pvtKey: k value: v key := k. value := v. "accessing methods" key: k self shouldNotImplement "class instance creation methods" key: k value: v ^self new pvtKey: k value: v Except that this won't work; practically the only time I ever really need 'pvt' methods is so that a class can initialise a new instance without the initialisation method being exposed to another class, but the 'pvt' machinery will not let you do this: "private messages may only be sent to self". If only there were a category of private messages that a class could send to 'self new'. So the method really has to be called #privateKey:value: and we just have to hope that no outsider will use it. So much for private methods. It's all very well to say that competent programmers wouldn't even try to change the key of an association that came from a dictionary, but if there are two things that the last fifty years of programming and the recent history of the Internet have taught us, they are - everyone makes mistakes - some people deliberately try to break other people's programs. Oh yes, what if no Association is changed, but a key is? Just like Set, Dictionary could let you make a series of changes to keys (as long as no previously unequal keys became equal) and then repair the state of the dictionary using #rehash, except that #rehash is a private method. (Which means you can do this, but you shouldn't.) #rehash depends on a rather weaker invariant: it requires that keys still be non-nil and unequal, but it doesn't require that their values or hashes be unchanged. Is there another Collection class that exposes its internal structure? Yes, MappedCollection. MappedCollection collection: c map: m in effect constructs the composition of two tabulated functions: (MappedCollection collection: c map: m) at: index = c at: (m at: index) The base collection c and the map m are not copied when you do this. Indeed, that's the point. But it leads to some very odd consequences. For example, i ~= j and: [ (mc at: j) ~= 0 and: [ mc at: i put: 0. mc at: j = 0]] can come out true. For every other keyed collection, setting the element at one key doesn't alter any elements at other keys. So don't try to sort a MappedCollection! It's also the only Collection class I can find in Squeak 3.2 where (coll contents) is not the receiver but a copy with different properties. One of the nasty things about the fact that the collection and mapping are shared with unknown numbers of other objects is that foo: aCollection bar: aBlock |index value| index := aCollection size. value := aBlock value: (aCollection at: index). aCollection at: index put: value may fail, even though aBlock value: ... doesn't send any messages whatsoever to aCollection. How? aBlock could make the mapping smaller, or alter its last element so that it is no longer a valid key for the other collection, or could alter the other collection so that the key disappears, &c. Oh yes, imagine the fun when you do a := #(3 1 4 1) copy. m := MappedCollection collection: a map: a. m ==> a MappedCollection(4 3 1 3) m at: 2 put: 2. m ==> a MappedCollection(1 2 1 2) m at: 3 put: 3 m ==> a MappedCollection(1 2 3 4) a ==> #(2 1 4 3) MappedCollection tacitly assumes that the collection and mapping are different objects or that neither will be changed, but it does not say this anywhere. At the very least the class comment for MappedCollection should be changed from I represent an access mechanism for a sequenceable collection re-ordering or filtering its elements. (which is quite untrue because it doesn't in the least depend on the collections being sequenceable) to I represent the composition of two collection. (MappedCollection collection: c map: m) at: index is always the same as c at: (map at: index). The collection and map could be any combination of sequenceable collections and dictionaries as long as the elements of m all are (and remain) acceptable keys for c. My instances are sequenceable precisely when their maps are sequenceable. If you plan to use #at:put: with one of my instances, the collection c and map m should be different objects. Note that the collection c and map m are not copied, but are shared. If you change them, the mapped collection you made from them will also change. Changes to the elements of the collection are normally safe; changes to the key space of the collection or the elements of the map may make the mapped collection invalid. Do not change the map while you are iterating over the map or the mapped collection. Hey, there's another oversight, found while writing that new class comment. There's a missing method: "in category 'testing'" isSequenceable ^map isSequenceable Alternatively, if the (otherwise pointless and unnecessary) restriction to sequenceable collections is to be preserved, then #setCollection:map: should be setCollection: aCollection map: aDictionary self assert: [aCollection isCollection]. self assert: [aDictionary isSequenceable]. domain := aCollection. map := aDictionary except that the names of the method parameters (which I have not changed) strongly suggest that the map should be allowed to be a dictionary, which is not sequenceable. But the instance creation method calls the very same parameter aSequenceableCollection. Whatever this code may be, it is not 'subtle'. It's confused and confusing. Are there any other collection classes that violate this design principle? Yes, there is one that doesn't so much violate it as cut its breast off and eat half of one of its kidneys: LinkedList. Most container libraries regard a linked list as a container for general objects, with the link nodes being part of the private structure of the list. From the outside, it should not matter whether the nodes are allocated one at a time, or in chunks (a good way to combine the flexibility of lists with the space efficiency of arrays is to have a linked collection of arrays; the well known "piece table" representation used in some editors is related to this). However, Smalltalk LinkedLists are very different: all their elements must be Link nodes. This makes them utterly unlike arrays; any number of arrays may refer to the same object as an element, and an array may include an object as element any number of times. Not so LinkedLists: each Link may be an element of at most one LinkedList, and then at most once. In other words, LinkedList does not encapsulate. Serious weirdness can result. Consider c1 add: c2 first. where c1 and c2 are separate collections sharing no objects at all before this message is sent. If c1 is any other kind of collection that can support #add:, this is fine. Now c1 and c2 have a common element. But if c1 and c2 are both LinkedLists, now a node is shared between the two lists. While #add: is defined to be #addLast:, it need not actually true that c2 first is now the last element of c1! (Because c2 might have had more than one element.) I could multiply examples of how easy it is to break LinkedLists without exploiting (initial) aliasing, but the main consequence is that Smalltalk programmers should never ever use LinkedList, not even after making their own subclasses of Link. Arguably, LinkedList should be outside the Collection hierarchy. (As noted in an earlier message, LinkedList is amongst the SequenceableCollections, but does not support #at:, surely a defining characteristic of such collections.) Make no mistake: - I am not saying Smalltalk is unusable. - I am not saying the Collection classes are unusable. - I am not saying that the Smalltalk designers are culpable. What I am saying is that the Smalltalk designers weren't trying to produce highly polished code but were engaged in research of a completely different kind, so that they produced 'good enough for the immediate task' code which after 20-odd years outside the lab is overdue for an overhaul. The Collection classes are a means to an end, not Holy Writ, and mortals may dare to criticise and revise them.