Internship Report of Jerome Bou Aziz (June 1999): I was assigned to work on EiffelBench (known as ebench). Thus I joined a team composed essentially of Emmanuel Stapf here in Santa Barbara, but also Michael in Germany, Eric in France, David Schwartz (for time to time on VMS) and from time to time, depending on the good will of the INS officer in charge, Emmanuel Texier (for the run-time). The first challenge was to get to know ebench deep enough to be able to work on it: it is a 2500 classes application !!! Then my work was composed of various little tasks to improve the compiler, its speed or some bug fixes. To make short a long story, I choose to present in the rest of this document one of those tasks: the rewriting of the dead code removal. ============================================================================== Table of contents Introduction The structure of an Eiffel program seen from the compiler The feature id The body id The routine id A few word about dependencies What are the dependencies ? Structure of a server Structure of a FEATURE_DEPEDENDANCE The dead code removal Introduction The algorithm itself The procedure record The marking process Marking the feature Propagation to the clients Propagation to the heirs Problem due to the incrementality of the compiler Upgrade of the DEPEND_UNIT Upgrade of the CLASS_DEPENDANCE Conclusion ============================================================================== Introduction The main problem with the former dead code removal degree was not a lack of efficiency in removing dead code but how slowly it was doing it. The challenge for the new dead code removal was to keep doing exactly the same thing but much faster. In fact the rewriting of the dead code removal took place in a larger process of improvement of the compilation speed. A big customer had a huge system of more then 3000 classes that took more than 110 hours to finalize. He wanted to be able to compile it overnight. At the end of the process, he was able to compile his system in a bit more than 13 hours. The new dead code removal itself took nearly 2 weeks to come to what it is today. A few words about the structure of an Eiffel system viewed from inside the compiler...Got your id ? An Eiffel program is a set of clusters, each of one is a set of classes, each of one is a set of feature. Each feature can be either a function or an attribute. Inside the compiler the notion of cluster is an artificial notion that we forget very early. In fact this notion is just taken into account in the very first degrees of compilation (6 and 5: the parsing degrees) in case of an override cluster (if two features have the same name, one in the override cluster, one somewhere else, then it is not an error: the feature in the override cluster in the right one, the over is to be forgotten). During the interesting degrees (the compiling ones), the system is viewed as a set of classes each being a set of features. Internally, this means that a system is a set of CLASS_C characterized by a unique identifier called CLASS_ID. A CLASS_C contains a FEATURE_TABLE and an ORIGIN_TABLE which list all the features in the class. The feature id In the FEATURE_TABLE, each feature is identified by a FEATURE_ID The FEATURE_TABLE is a structure whose interest is local to the class: the FEATURE_ID is an identifier for a feature in a class, the same FEATURE_ID in different classes designates different features. Even an inherited feature does not have the same FEATURE_ID in the parent class and in the heir class. The body id In the ORIGIN_TABLE, each feature is identified by a BODY_ID. On the contrary of the FEATURE_ID, the BODY_ID is a global identifier: two features with the same BODY_ID, even in two different classes, are always the same. A BODY_ID is set to identify a feature body (as suggested in the name). It means that when you inherit a feature, even if you rename it, it will keep the same BODY_ID in the heir class. But, obviously, if you redefine it the body of the feature changes, so does the BODY_ID: an other one will be created for the new feature. The same thing happens when you modify a feature: it will be incrementally recompiled and as you changed its body, it will be assigned a new BODY_ID. The routine id So we have a local id, a global id attached to the body of a feature, but for the moment we have no way to follow the inheritance tree, there is no common id to the features of a same tree unless we do not redefine anything. An id common to all the feature of a same line, redefined or not exists: it is called the ROUTINE_ID because it characterizes a routine line: all the heirs of a same routine have the same ROUTINE_ID. We do not have to worry about the heir of an attribute because an attribute cannot be redefined. But what happens in case of an inheritance diamond? Two features with the same name will become one in the heir. Concerning the ROUTINE_ID, we have to keep a trace of both lines. We do that by assigning a list of ROUTINE_ID to the heir feature: in this case the first id in the list should be the one corresponding to the selected feature. Note that the table indexed by the ROUTINE_IS is the POLY_TABLE which is filled at degree minus 2. A few word about dependencies What are the dependencies? By dependencies, we mean clientelism dependencies. A class A depends on a class B if and only if a feature of A uses a feature of B. In other words, A depends on B if and only if A is a client of B. The dependencies are stocked in what we call a server. In ebench, a server is a structure you can use nearly as a hash-table in the sense that it is some data indexed by a key you can choose freely. Contrary to the hash-table, the server is designed to be able to manage huge amounts of data using a limited amount of memory. To accomplish that, it acts like a database: it stocks all the data on disk using a cache in memory to make the consultation faster. Structure of the server The server we use for the dependencies is called DEPEND_SERVER. It stores CLASS_DEPENDANCE indexed by their CLASS_ID. A CLASS_DEPENDANCE is a class that stores the dependencies of each and every of the features of the class it is linked to. Technically, a CLASS_DEPENDANCE is a hash-table (precisely an EXTEND_TABLE which is a specialized HASH_TABLE inheriting form nothing and modified to give a maximum of performance in the server case) storing FEATURE_DEPENDANCE indexed by the BODY_ID of the studied feature. To each feature in the class, corresponds a FEATURE_DEPENDANCE in the CLASS_DEPENDANCE. Note that it is possible to search for a FEATURE_DEPENDANCE in a CLASS_DEPENDANCE by BODY_ID of course but also by its name. Structure of a FEATURE_DEPENDANCE A FEATURE_DEPENDANCE is a set of DEPEND_UNIT; each DEPEND_UNIT gives all the information you need to identify in every table the feature it represents. Each DEPEND_UNIT in the FEATURE_DEPENDANCE stands for a feature on which depends the owner of the FEATURE_DEPENDANCE. Note about the DEPEND_UNIT: as you can see, there is some redundancy in the DEPEND_UNIT structure: we have several identifiers for a same feature whereas each of them stands for being globally unique. This was not the case before; only one unique identifier was kept which let us retrieve the description of the feature and so the rest of the identifiers. But this process turned out to be really slow because of disk access and garbage collector overuse. The dead code removal The dead code removal is mainly accomplished by two classes FEAT_ITERATOR and REMOVER. To eliminate the dead code, we use the DEPENDANCE information. To work, the algorithm needs 3 structures: * the control list: it is the list of the DEPEND_UNIT still to be treated. * the marked_table: it is an array of ROUT_ID_SET (list of ROUTINE_ID) indexed by BODY_ID. As we have seen, a feature can have more than one ROUTINE_ID (case of an inheritance diamond). This structure allows us to know if a feature of a certain BODY_ID has already been treated, under the view of what ROUTINE_ID it was considered. * the used_table: it is an array of BOOLEAN indexed with BODY_ID. At the end of the degree, this table will tell us which feature is alive. The algorithm itself The algorithm consists on going through the program seen as a tree and to mark all the feature we can reach from the root (creation feature in the root class of the system). At the end, all the features reached will be marked in the used_table. The features not marked in the used_table are dead and should not be generated. The procedure record The entry point in the algorithm is the procedure record: it takes the next element in the control list and if this element is not already alive, it launches the marking process on it and goes on and on until the control list is empty. Note that as all the attributes have to be generated anyway, we do not consider them in the dead code removal. record (feat: FEATURE_I; in_class: CLASS_C) is -- Record Eiffel routines reachable by feature `feat' from -- static type `in_class'. local next: FEATURE_I; traversal_unit: TRAVERSAL_UNIT; a_feature: FEATURE_I; dep: DEPEND_UNIT; static_class: CLASS_C; body_id: BODY_ID do from !! dep.make (in_class.id, feat); control.extend (dep) until control.empty loop dep := control.item; body_id := body_index_table.item (dep.body_index); if not (dep.rout_id.is_attribute or else is_alive (body_id.id)) then mark (dep.feature_id, dep.body_index, body_id, dep.id, dep.written_in, dep.rout_id) end; control.remove end ensure control_empty: control.empty end; The marking process The marking process works in 3 phases: 1) marking the feature alive 2) propagation to the clients 3) propagation to the heirs 1) Marking the feature If the feature is not already alive, then we simply mark it alive. Note: As far as the dead code removal is concerned, a feature that arrives to this point is never alive. But the function mark is also used in the array optimization context and there we do not know. mark_and_record ( feature_id: INTEGER; body_index: BODY_INDEX; body_id: BODY_ID; actual_class_id: CLASS_ID; written_class_id: CLASS_ID) is -- Mark feature `feat' alive. -- (export status {NONE}) local depend_list: FEATURE_DEPENDANCE; original_dependances: CLASS_DEPENDANCE; just_born: BOOLEAN; like_feature: LIKE_FEATURE; like_feat: FEATURE_I; feature_name: STRING; static_class, written_class: CLASS_C; depend_feature, original_feature: FEATURE_I; a_class: CLASS_C do just_born := not is_alive (body_id.id); debug ("dead_code") io.putstring ("MARKING: "); a_class := actual_class_id.associated_class; io.putstring (a_class.feature_table.feature_of_body_id (body_id).feature_name); io.putstring (" (bid: "); io.putint (body_id.id); io.putstring ("; fid: "); io.putint (feature_id); io.putstring (") of "); io.putstring (a_class.lace_class.name); io.putstring (" ("); io.putint (a_class.id.id); io.putstring (") originally in "); a_class := written_class_id.associated_class; io.putstring (a_class.lace_class.name); io.putstring (" ("); io.putint (a_class.id.id); io.putstring (")%N") end; if just_born then mark_alive (body_id.id); original_dependances := depend_server.item (written_class_id); depend_list := original_dependances.item (body_id); if depend_list /= void then propagate_feature (written_class_id, body_index, body_id, depend_list) end; debug ("dead_code") io.putstring ("La depend_list contient "); if depend_list /= void then io.putint (depend_list.count) else io.putstring ("aucun") end; io.putstring (" elements.%N"); io.putstring ("-----------------------------------------------------%N%N%N") end end; debug ("dead_code") if not just_born then io.putstring ("already alive%N") end end end 2) Propagation to the clients First, we search for the dependencies of the feature in the DEPEND_SERVER. To do that, we ask the CLASS_DEPENDANCE corresponding to the class in which the feature was originally written. Then, in this CLASS_DEPENDANCE, we look for the FEATURE_DEPENDANCE corresponding to the BODY_ID of the feature. If the feature has dependencies, for each client we check if it was previously treated, that is to say the exact feature, exact inheritance branch (same BODY_ID, same ROUTINE_ID): * if so, we forget it, it is already alive. * otherwise, we mark it treated, we mark it dead and we add it to the control list. We mark it dead because it may already be alive, treated as belonging to an other branch in a multiple inheritance scheme. This way, the new branch will be also treated completely (it is important for the phase 3: the propagation to the heirs). propagate_feature ( written_class_id: CLASS_ID; original_body_index: BODY_INDEX; original_body_id: BODY_ID; depend_list: FEATURE_DEPENDANCE) is -- (export status {NONE}) local depend_unit: DEPEND_UNIT; body_id: BODY_ID; rout_id: ROUTINE_ID do from depend_list.start until depend_list.after loop depend_unit := depend_list.item; debug ("dead_code") if depend_unit.is_special then io.putstring ("special%N") end end; if not depend_unit.is_special then body_id := body_index_table.item (depend_unit.body_index); debug ("dead_code") print_dep (depend_unit); if is_treated (body_id.id, depend_unit.rout_id) then io.putstring ("previously treated%N") end end; rout_id := depend_unit.rout_id; if not (is_treated (body_id.id, rout_id) or else rout_id.is_attribute) then mark_treated (body_id.id, rout_id); mark_dead (body_id.id); debug ("dead_code") print ("Since it was not treated, we marked bid: "); print (body_id.id); print (" dead%N") end; control.extend (depend_unit) end end; depend_list.forth end; array_optimizer.process (written_class_id.associated_class, original_body_index, original_body_id, depend_list) end; 3) The propagation to the heirs We retrieve the POLY_TABLE corresponding to the ROUTINE_ID of the feature. In this table are listed the routines of a same branch: they are the redefinitions of a routine inherited from higher in the hierarchy. Then we go through the POLY_TABLE and we treat every feature whose original class conforms to the original class of the feature we are currently considering. mark ( f: INTEGER; body_index: BODY_INDEX; body_id: BODY_ID; static_class_id: CLASS_ID; original_class_id: CLASS_ID; rout_id_val: ROUTINE_ID) is -- Mark feature and its redefinitions -- (from FEAT_ITERATOR) -- (export status {NONE}) local table: ROUT_TABLE; old_position: INTEGER; unit: ROUT_ENTRY; body_id_unit: BODY_ID do mark_and_record (f, body_index, body_id, static_class_id, original_class_id); if tmp_poly_server.has (rout_id_val) then debug ("dead_code") print ("%NMarking Poly_table `"); print (rout_id_val.id); print ("%'; for feature written in "); print (original_class_id.associated_class.name); print ("%N") end; table ?= tmp_poly_server.item (rout_id_val); check table_exists: table /= void end; from table.start until table.after loop unit := table.item; if unit.id.associated_class.simple_conform_to (original_class_id.associated_class) then old_position := table.position; if not is_alive (unit.body_id.id) then debug ("dead_code") io.putstring ("marking for rout_id: "); io.putint (rout_id_val.id); io.putstring ("%N") end; body_id_unit := body_index_table.item (unit.body_index); mark (unit.feature_id, unit.body_index, body_id_unit, unit.id, unit.written_in, rout_id_val) end; table.go_to (old_position) end; table.forth end end end; Problems due to the incrementality of the compiler Ebench is an incremental compiler. This means that when you make a modification to your system, it will not recompile the whole system but just the part you modified and its dependencies. But when you modify the a feature, by definition its body changes. And so must do its BODY_ID. But as we have seen the BODY_ID is one of the identifiers that we save in the DEPEND_UNIT. So each time we change a BODY_ID we must also upgrade all the DEPEND_UNIT and the CLASS_DEPENDANCE that contains it. Upgrade of the DEPEND_UNIT We do not really upgrade all the DEPEND_UNIT that contains it. In fact, we have in the system a table, the BODY_INDEX_TABLE, that indexes all the BODY_Ids. In the DEPEND_UNIT, we store a key in this table, the BODY_INDEX and we read the BODY_ID corresponding in this table each time we need it. This way, when the BODY_ID changes, we just have to upgrade an entry in this table. Upgrade of the CLASS_DEPENDANCE This is far more complicated. the CLASS_DEPENDANCE linked to the class in which the modified feature was written is itself a table indexed by BODY_ID. So the FEATURE_DEPENDANCE of the modified feature is the entry in this table of index the old BODY_ID and should now become the entry of index the new BODY_ID. The problem is that at the moment the BODY_ID changes, the only information we have is the old BODY_ID and the new one. With that, we must search in the DEPEND_SERVER, retrieve the CLASS_DEPENDANCE concerned and upgrade it. To do so, in the DEPEND_SERVER, we have defined a table, the bid_cid_table to link a BODY_ID to the CLASS_ID of the class in which it is written. This way, we are able to retrieve the right CLASS_DEPENDANCE and upgrade it: we delete the entry with the old BODY_ID and introduce a new one with new BODY_ID containing the same thing. Note: If a bug appear concerning a wrong upgrade of BODY_ID, please have a look at the function change_body_ids from FEATURE_I. Conclusion On the technical point of view, although a few bugs at the beginning, I can say that the new debugger is a real amelioration compared to what was before. It is not a revolution since the algorithm itself is pretty much the same. But the new implementation allowed a huge acceleration of the process: on a test application of 500 classes, the former dead code removal took 5 minutes to work whereas the new one need only a bit more then 20 seconds !!! In this document is presented the last version of the debugger. In reality, I cannot remember exactly how many version were tried before the good one, but they were a lot. And this is very representative of my internship at ISE: I did a lot of little things in fields I knew nothing about. This obliged me to go each time through a process of try/failure before finding an appropriate solution. Thanks to this internship, I learnt better how a OOL is working and also how important are its various features. Before coming here, I have got to admit that I did not understand well the importance of having or not multiple inheritance, or the importance of genericity in the design of a program. But now I realize how important those concepts are and how they can make life easier to the poor system designer. On a more human aspect, I would like to thanks Bertrand Meyer for this internship, Annie Meyer for her help in the everyday life and I would like to thanks all the team here at ISE for creating such a great atmosphere. I will add a special thanks to Emmanuel Stapf, with whom I have worked more closely, for his support and his help despite the bugs his baby suffered from time to time because of me.