PREDUCTION OF CARCINOGENICITY


INTRODUCTION
	The problem is to predict the carcinogenicity of a set of 330
diverse organic compounds.  This dataset is based on years of work by
the US National Toxicology Program (NTP) at the National Institute of
Health Sciences.  Anyone who gets good predictive results on the data
is urged to publish his/her results and to take part in the next round
of NTP prdictions. The data was obtained by testing the chemicals on
rodents, each trial takes several years and hundreds of animals.  It
was therefore very expensive to obtain, both in terms of mnoney and
pain to the animals.  It is is very important to be able to predict
the carcinogenicty of chemicals because there are millions in the
environment for which no tests have been done or are liekely to be
done.

	An understanding of the molecular mechanisms of chemical
carcinogenesis is central to the prevention of many environmentally
induced cancers.  One approach is to form Structure Activity
Relationships (SARs) that empirically relate molecular structure with
ability to cause cancer.  This work has been greatly advanced by the
long term carcinogenicity tests of compounds in rodents by the
National Toxicology Program (NTP) of the US National Institute of
Environmental Sciences [1] .  These tests have resulted in a database
of >300 compounds that have been shown to carcinogens or
non-carcinogens.  The database of compounds can be used to form
general SARs relating molecular structure to formation of cancer.
	The compounds in the NTP database present a problem for many
conventional SAR techniques because the compounds in the NTP databases
are structurally very diverse, and many different molecular mechanisms
are involved.  Most conventional SAR methods are designed to deal with
compounds having a common molecular template and presumed similar
molecular mechanisms of action - congeneric compounds.  Numerous
approaches have been taken to forming SARs for carcinogenesis.  Ashby
and co- workers [2- 4] developed a successful semi-objective method of
predicting carcinogenesis based on the identification of chemical
substructures (alerts) that are associated with carcinogenesis.  A
similar, but more objective approach, was taken by Sanderson and
Earnshaw [5] who developed an expert-system based on rules obtained
from expert chemists.  An inductive approach, not directly based on
expert chemical knowledge, is the CASE system [6,7] .  This system
empirically identifies structural alerts that are statistically
related to a particular activity.  A number of other approaches have
been applied based on a variety of sources of information and SAR
learning methods [8-13].  The effectiveness of these different SAR
methods was evaluated on a test set of compounds for which predictions
were made before the trials were completed (round 1 of the NTP's
tests for carcinogenesis prediction) [8,14,15] There is currently a
second round of tests.


DATA

foreground: 
	Files with .f are positive examples (carcinogens)
	Files with .n are negative examples (non-carcinogens)

	all/ all compounds
	train_test/ split into traing ant test (see below)
	5fold/ split for 5-fold cross-validation (see below)

background:
	ames.pl		ames(Drug)
	information on mutagenicity of compound

	atoms.pl	atm(Drug, Atom_id, Element, Type, Charge)
	information on atoms in compound, see [21]

	bonds.pl	bond(Drug, Atom_id1, Atom_id2, Bond_type)
	information on bonds in compound, see [21]

	conc_pos.pl	Structure_predicate(Drug,[Atom_list])
	higher level chemical information, the different predicates are generic 
	for chemistry and were taken from [21].

	conc_no.pl	ring_no(Drug,Structure_predicate,no)
	the number of different structural features in compound

	ind_pos.pl	Structure_predicate(Drug,[Atom_list])
	higher level chemical information, the different predicates 
	were taken from [4] some are generic and some particular to cancer.

	conc_no.pl	ring_no(Drug,Structure_predicate,no)
	the number of different structural features in compound

misc
	c(Drug,Cas_no,Ames,Cancer)
	Cas_no is a standard id no. for the compound,
	Ames is a test for mutagenicity (sp positive, se equivocal,
	sn negative), Cancer is the cancergenicity test (cp positive, 
	ce equivocal, cn negative).
	

**********************************************************************

Application of Progol:

The dataset was collected for the following paper:
Prediction of Rodent Carcinogenicity Bioassays from Molecular Structure 
Using Inductive Logic Programming
by
Ross D. King and Ashwin Srinivasan
Environmental Health Perspectives
(in press)

	Abstract
	The machine learning program Progol was applied to the problem of 
forming the Structure Activity Relationship (SAR) for a set of compounds tested 
for carcinogenicity in rodent bioassays by the US National Toxicology Program.  
Progol is an Inductive Logic Programming (ILP) algorithm which is the first to 
use a fully relational method for describing chemical structure in SARs - based on 
using atoms and their bond connectivities.  Progol is well suited to forming SARs 
for carcinogenicity as it is designed to produce easily understandable rules 
(structural alerts) for sets of non-congeneric compounds.  The Progol  SAR 
method was tested by prediction of a set of compounds that have been widely 
predicted by other SAR methods (the compounds used in the NTP's first round 
of carcinogenesis predictions).  For these compounds no method (human or 
machine) was significantly more accurate than Progol,.  Progol was the most 
accurate method that did not use data from biological tests on rodents (however 
the difference in accuracy is not significant).  The Progol  predictions were based 
solely on chemical structure and the results of tests for Salmonella mutagenicity.  
Using the full NTP database the prediction accuracy of Progol  was estimated to 
be 63% (+/- 3%) using five-fold cross-validation.  A set of structural alerts for 
carcinogenesis was automatically generated and the chemical rationale for them 
investigated - these structural alerts are statistically independent of the Salmonella 
mutagenicity.  Carcinogenicity is predicted for the compounds used in the NTP's 
second round of carcinogenesis predictions.  The results for prediction of 
carcinogenesis, taken together with the previous successful applications of 
predicting mutagenicity in nitro-aromatic compounds, and inhibition of 
angiogenesis by suramin analogues, show that Progol has a role to play in 
understanding the SARs of cancer related compounds.  


Data
	The compilation of 330 chemicals used in this study was taken from  the 
references [2,3,8]  as well as directly from the Collective data base of the National 
Cancer Institute and National Toxicology Program [1].  The compounds used 
were all the organic compounds that had completed NTP reports at the time of 
this work.  A listing of the compounds and their activities is given in Table 1.  
Inorganic compounds were not included as it was considered that there are too 
few of them to allow meaningful generalisations.  Of the 330 compounds, 182 
(55%) are classified carcinogenic, and the remaining 148 non-carcinogenic.  
Carcinogenicity is determined by analysis of long term rodent bioassays.  
Compounds classified by the NTP as equivocal are considered non-carcinogenic, 
this allows direct comparison with other predictive methods.  No analysis was 
made of differences in incidence between rat and mouse cancer, or the role of sex, 
or particular organ sites.  
	The Progol  SAR method was first tested using the test data considered in 
the first round of the NTP trial [3].  This allowed direct comparison with the 
results of many other SAR techniques [8] .  The training set consisted of 291 
compounds, 161 (55%) carcinogens and 130 non-carcinogens.  In addition to this 
train/test split, a five fold cross-validation split of the 330 compounds was tested 
for a more accurate estimate of the accuracy of Progol.  The compounds were 
randomly split into 5 sets, and Progol was successively trained on 4 of the splits 
and tested on the remaining split.  

Compound representation for Progol
	The generic atom/bond representation that we previously
applied to mutagenesis was used [21].  Two basic relations were
utilized to represent structure: atom and bond.  For example, for
compound 1 (CAS no. 117-79-3), atom(127 127_1, carbon,
aromatic_carbon_6_ring, -0.133) .  states that in compound 127, atom
no. 1 is of element carbon, and of type aromatic carbon in a 6
membered ring, and has a partial charge of -0.133.  The type of the
atom and its partial charge were taken from the molecular modelling
package QUANTA, any similar modelling package would have also have
been suitable.  Equivalently, bond(127, 127_1, 127_2, aromatic).
states that in compound 127, atom no. 1 and atom no. 2 are connected
by an aromatic bond.  In QUANTA partial charges assignment is based
on a specific molecular neighbourhood, this has the effect that a
specific molecular sub- structure can be identified by an atom type
and partial charge.  This relational representation is completely
general for chemical compounds and no special attributes need to be
invented.  The structural information of these compounds was
represented by ~18,300 facts of background knowledge.
	Information was also given about the results of Salmonella mutagenicity 
tests for each compound.  The mutagenic compounds were represented by the 
relation Ames, e.g. ames(127)  states that compound 127 is mutagenic.  
	The Progol algorithm allows for the inclusion of complex
background knowledge, either in the form of facts, or in the form of
computer programs.  This allows the addition, in a unified way, of any
information that is considered relevant to learning the SAR.  In
general the more that is known about a problem the easier it is to
solve.  The ability to use a variety of background knowledge is
perhaps the most powerful feature of Progol.  In this study we
included the background knowledge of chemical groups from our work on
predicting mutagenesis [21], and the structural alerts identified by
Ashby et al.  [4] were also encoded and tested.  It is important to
appreciate that encoding PROLOG programs to define these concepts is
not the same as including them as simple indicator variables.  This is
because Progol can learn SARs that use structural combinations of
these groups, e.g. Progol could in theory learn that a structural
indicator of activity is diphenylmethane (as a benzene single-bonded
to a carbon atom single-bonded to another benzene).  In contrast, a
normal SAR method would only be able to use the absence or presence of
the different groups, not a bonded combination of them.  To represent
compounds to the equivalent level of detail using a CASE type
representation [6] would require several orders of magnitude more
descriptors than needed for only the simple atom/bond representation
[21] .  In the future the background knowledge used could be extended
to include more information, e.g.: 3D structure, knowledge about
metabolism, subchronic in vivo toxicity, route of administration, MTD
dose levels, etc.

Other SAR algorithms Compared with Progol
	The train/test dataset has previously been studied using a number of 
different SAR methods.  We use the predictions from these methods and the 
predictions from two default methods to compare their results with those of 
Progol.  The two default methods that we implemented were:

1)	The largest class prediction method is to predict all compounds to be 
carcinogenic (this is the largest class).  

2)	The Ames prediction method is to predicted a compound to be carcinogenic if 
it has any form of positive Ames test.  


	The previously applied prediction methods that were compared with 
Progol can be grouped into two groups.  In the first group are the prediction 
methods that do not directly use data from experiments on rodents.  The Progol  
SAR method belongs to this group and can be directly compared with such 
methods.  These methods are: 

3)	The Bakale & McCreary method [9] used experimentally measured 
electrophilic reactivity (Ke)  values to discriminate between carcinogenic and 
non-carcinogenic compounds.  

4)	The DEREK method (Deductive Estimation of Risk from Existing Knowledge) 
[5]  is an expert-system that predicts 

carcinogenesis based on a set of rules derived from experienced chemists.  
5)	The COMPACT method [10]   

predicts carcinogenesis based on the predicted interaction of the compound 
with cytochrome P450 and the Ah receptor.  

6)	The CASE method [25]  is based on a statistical method of selecting chemical 
substructures associated with carcinogenesis.  

7)	The TOPKAT system [11]  uses 
structural attributes to describe the compounds and applies statistical 
discrimination and regression to estimate the probability of carcinogenesis; 
interestingly it uses a number of non-carcinogenic pharmaceuticals and food 
additives to increase the number of negative examples.  

8)	The Benigni method [12] forms a Hansch type QSAR using estimated 
electrophilic reactivity (Ke) and Ashbys structural alerts (see below).  


	The second group of prediction methods that have been previously 
applied use information from biological tests on rodents.  It is unfair to directly 
compare these procedures with methods based only on chemical structure and 
Salmonella mutagenicity as they use more information.  Rodent biological tests 
are very expensive both in money and animal welfare terms.  The prediction 
methods that use rodent biological tests are:

9) 	The Ashby prediction method [28] is based on the expert judgement
of chemists to evaluate evidence from: a set of chemical structural
indicators, Salmonella mutagenicity, subchronic in vivo toxicity, the
route of administration of the compound, and MTD dose levels [4] .
When experimental carcinogenicity results from previous studies were
available in the literature, this evidence was also taken into
consideration [15] .

10)	The TIPT method [8] uses the machine learning algorithm C4.5 (a 
propositional tree-learning method) to combine the same evidence used by 
the Ashby prediction method.  It cannot identify new structural alerts.  

11)	The RASH method [13] uses relative potency analysis and does levels to 
modulate the Ashby method.  


Train/Test results
	The predictions of Progol  on round 1 of the NTP carcinogenicity trial are 
given in Table 1 (291 training compounds and 39 test compounds) [8].  The small 
size of test dataset makes it difficult to show statistically significant differences 
between algorithms; this difficulty is compounded because some algorithms 
cannot predict all the examples.  Comparing the predictions, using a binomial 
McNemar test for changes [26], shows that no algorithm is significantly more 
accurate than Progol (P < 0.05).  The McNemar test exploits the fact that the 
different prediction methods are applied to the same data and is based on 
counting the examples where the methods disagree about predictions.  
	Progol is marginally the most accurate prediction method that does not use 
rodent tests (although this is not statistically significant).  The more accurate 
prediction methods of Ashby, TIPT, and RASH are based on use of short term 
rodent in vivo tests.  This information is much more difficult and expensive to 
obtain than chemical structural and Salmonella mutagenicity data.  The Ashby 
and RASH methods are also based on the subjective application of a set of 
structural alerts formed by Ashby et al.  [4], the TIPT method uses an objective 
application of these expert defined alerts.  
	A number of the errors in prediction made by Progol were repeated by 
most other methods, suggesting some anomaly with these compounds [14]; 
methylphenidate hydrochloride and methyl bromide.  Interestingly Progol 
correctly identified naphthalene as a carcinogen while it was missed by all other 
methods.

Cross-validation results
	Progol has an accuracy of 63% (Standard Error  3%)  for all compounds 
estimated by five fold cross-validation.  This compares with estimated accuracies 
of 55% using the default rule, and 63% using the Ames rules.  There is a 
significant difference at P < 0.05 between the accuracy of Progol and the default 
rule.  Although there is no significant difference in accuracy between Progol and 
the Ames rule, there is a large difference in the number of carcinogens identified.  
Progol  makes fewer errors of omission than the Ames rules and more errors of 
commission, i.e. Progol identifies more carcinogens than the Ames rule at the cost 
of classifying more non-carcinogens as carcinogens.  

Rules
	The Progol  SAR method produces prediction rules in the form of easily 
understood chemical patterns.  The prediction rules are given in Figures 2 and 3.  
There is a direct translation from the rules generated by Progol  into chemical 
structure.  For example rule 3 in PROLOG notation is:
active(Drug) if  
	atm(Drug, Atom_1, Element_1, ester_carbon, Charge1) and
	atm(Drug, Atom_2, Element_2, aromatic_hydrogen, Charge2) and
	less_than_or_equal(Charge2, 0.041).
(names with capital letters are variables).  
	The particular use of partial charges requires some explanation.  They are 
given to three significant places because  of a peculiarity in the method of 
assigning partial charges in QUANTATM (see above), not because it is considered 
these exact values are important to this accuracy.  
	It is important that rules produced by any automatic SAR procedure are 
screened to ensure that they make chemical sense.  More confidence can be put 
on a rule if a mechanism of action can be identified [27-29].  This is a general 
application of the principle of using prior knowledge to guide decision making.  
All the rules found by Progol were analysed to try and identify their chemical 
rational.  
	It was found that use of the Ames test for Salmonella mutagenicity (rule 1) 
was the most effective rule for predicting carcinogenicity.  While learning rule 1, 
Progol automatically searched for structural feature that improved rule 1 and no 
such rule was found that had higher compression than rule 1 (recall that 
compression is an objective way of balancing sensitivity/specificity of a rule).  
This does not conflict with the results of Ashby and Tennant 
[4]  who showed that the Ames test was 
correlated with a set of structural alerts.  
	The remaining rules found by Progol are new and automatically generated 
structural alerts for carcinogenesis.  As Progol removes examples covered by 
previous rules when searching for a new rule, rules found after rule 1 was 
covered are indicators for carcinogenic compounds not recognised by the Ames 
test.  This means that they could be either structural alerts for non-genotoxic 
carcinogenesis [4], (i.e. not based on induction of DNA damage by the test agent 
or its metabolites), or structural alerts for genotoxic carcinogens that are missed 
by the Ames test.  Most of the structural features identified by Progol appear to be 
for highly reactive structures, suggesting that they mainly act by genotoxic 
carcinogenesis.  Chemical interpretations of the rules are give below (arranged by 
chemical group): 
 
	Rules 2 and 3 identify ester groups as indicators for carcinogenesis.  The 
meaning of the modifying groups is unclear but they are essential, as ester 
groups on their own have no discriminatory power.  Rules 2, 6, and 11 use the 
generic background knowledge that was first used in applying Progol to 
predicting mutagenesis [21].  

	Rules 4 is concerned with ether oxygens with high partial charges.  All such 
groups are bonded to aromatic rings, suggesting the involvement of 
electrophilic substitution in activity.  

	Rule 5 identifies an ether group in a 6 membered ring.  These cyclic ethers 
may also be involved in electrophilic reactions.  

	Rules 6 and 7 identify reactive halides as indicators of carcinogenesis, such 
compounds have been widely recognised as potential carcinogens.  

	Rule 8 identifies an aldehyde group as an indicator of carcinogenicity.  
Aldehyde groups are potentially very reactive.  

	In rule 9 the aromatic amine group indicates high reactivity, as does the low 
partial charge on the unsaturated carbon (it is associated with a double bond 
to an oxygen group).  

	The high partial charge on the unsaturated carbon in rule 10 occurs in reactive 
alkenes.  

	Rule 11 occurs in substituted cyclohexenes, note the similarity with rule 5. 

	Rule 12 occurs when a 6 membered aromatic ring is bonded to a non-aromatic 
ring.

	Rule 13 occurs when a carbon atom in a single 6 membered aromatic ring is 
bonded to an amine or carbon substituted amine group.  

	Rule 15 occurs in chlorinated alkane groups, see rule 6.

	Rule 16 occurs when an hydroxyl group is attached to an aliphatic carbon.  

	In rule 17 the indicator of a halide atom attached a tetrahedral carbon.   This is 
the only rule that uses the structural alerts from Ashby et al  [4],  It is possible  
that this rule is an artefact as there appears no chemical reason why 4 halide 
atoms should be chosen instead of say 3 or 5.  

	Rules 14 and 18 may also be artefacts as there appears to be no chemical 
rational for them.  


Comparison with other SAR results
	The estimated accuracy of 63% for predicting carcinogenesis by Progol  is 
higher, but not statistically significantly higher than the results obtained using 
other SAR methods that do not incorporate results from rodent biological tests.  
This confirms the results of Benigni [15] who showed that all the SAR approaches 
to carcinogenicity had similar prediction profiles.  The relatively low prediction 
accuracy of ~ 60% is probably due  to the diversity of mechanisms  of action and 
the complexity of interactions in vivo.   

Comparison of the Ashby et al. structural alerts and those generated by Progol
	The Ashby et al. structural alerts [2-4,28] and those generated by Progol 
differ fundamentally in their formation and application.  The Ashby alerts were 
generated by a human expert and applied subjectively.  The Progol alerts were 
generated automatically by machine and are applied objectively.  The Ashby 
structural alerts are based on electrophilic attack on DNA.  This means that they 
are not statistically independent of the Ames test [4], and there is some 
redundancy between the Ames test and the structural alerts.  The Progol 
structural alerts were selected so that they covered compounds not covered by 
the Ames test.  This makes them much more independent of each other than 
those of Ashby.  
	Many of the structural alerts found by Progol are similar to those identified 
by Ashby, e.g. Ashby recognises forms of esters (see Rules 2 and 3), ethers (see 
Rules 4 and 5) and halogenated compounds (see Rules 6, 7, and 15), aldehydes 
(see Rule 9) as structural alerts.  The exact forms of the alerts differ significantly 
between Ashby and Progol.  This strongly suggests that it may be possible to 
develop a system for predicting chemical carcinogenesis that combines the best 
features of human based prediction with the objectivity and speed of the Progol 
rules to develop a superior SAR system.  
	The results for prediction of carcinogenesis taken together with the 
previous successful applications of predicting mutagenicity in nitro-aromatic 
compounds, and inhibition of angiogenesis by suramin analogues, show that 
Progol has a role to play in understanding the SARs of cancer related compounds.  


Program availability
	The ILP program Progol (implemented in Prolog) can be obtained
by request from: Ashwin Srinivasan, Oxford Laboratory, Wolfson
Building, Parks Road, Oxford, OX1 3QD, U. K,
Ashwin.Srinivasan@comlab.oxford.ac.uk.  It is freely available to
academics.  A version of Progol is also available that is implemented
in C, available by anonymous ftp from ftp.comlab.ox.ac.uk in directory
pub/Packages/ILP/progol4.1.  Additional information about Progol and
ILP can be found at the www address
http:/www.comlab.ox.ac.uk/oucl/groups/machlearn.

Acknowledgements
	We thank Michael J.E. Sternberg and Stephen H. Muggleton.  This work 
was supported by ESPRIT (6020), the SERC project Experimental application and 
development of ILP, and the Imperial Cancer Research Fund.




References

1.	Huff, J. and Haseman J.  Long-term chemical carcinogenesis experiments 
for identifying experiments for identifying potential human cancer 
hazards. Environmental Health Perspectives. 96: 23-31 (1991).
2.	Ashby, J., Tennant, R.W., Zeiger, E., and Stasiewicz, S.  Classification 
according to chemical structure, mutagenicity to Salmonella and level of 
carcinogenicity of a further 42 chemicals tested for carcinogenicity by the 
U.S. National Toxicology Program. Mutation Research. 223: 73-103 (1989).
3.	Tennant, R.W., Spalding, J., Stasiewicz, S., and Ashby, J.  Prediction of the 
outcome of rodent carcinogenicity bioassays currently being conducted on 
44 chemicals by the National Toxicology Program. Mutagenesis. 5: 3-14 
(1990).
4.	Ashby, A. and Tennant R.W.  Definitive relationships among chemical 
carcinogenicity and mutagenicity for 301 chemicals tested by the U.S. 
NTP. Mutation Research. 257: 229-306 (1991).
5.	Sanderson, D.M. and Earnshaw, C.G.  Computer prediction of possible 
toxic action from chemical structure. Human and Experimental 
Toxicology. 10: 261-273 (1991).
6.	Klopman, G.  Artificial Intelligence Approach to Structure-Activity 
Studies. Computer Automated Structure Evaluation of Biological Activity 
of Organic Molecules. J. Am. Chem. Soc. 106: 7315-7321 (1984).
7.	Klopman, G.  MULTICASE: 1. A hierarchical computer automated 
structure evaluation program. Quant. Struct.-Act. Relat. 11: 176-184 (1992).
8.	Bahler, D. and Bristol, D.W.  The induction of rules for predicting chemical 
carcinogenesis in rodents. In: Intelligent Systems for Molecular Biology-93. 
(Hunter, L., Searls, D., Shavlick, J. eds.) AAI/MIT Press. 1993): 29-37.
9.	Bakale, G. and  McCreary, R.D.  Prospective Ke screening of potential 
carcinogens being tested in rodent bioassays by the U.S. National 
Toxicology Program. Mutagenesis. 7: 91-94 (1992).
10.	Lewis, D.F.V., Ionnides, C., and  Parke, D.V.  A prospective toxicity 
evaluation (COMPACT) on 40 chemicals currently being tested by the 
National Toxicology Program. Mutagenesis. 5: 433-436 (1990).
11.	Enslein, K., Blake, B.W., and Borgstedt, H.H.  Prediction of probability of 
carcinogenicity for a set of ongoing NTP bioassays. Mutagenesis. 5: 305-
306 (1990).
12.	Benigni, R.  QSAR prediction of rodent carcinogenicity for a set of 
chemicals currently bioassayed by the US National Toxicology Program. 
Mutagenesis. 6: 423-425 (1991).
13.	Jones, T.D. and Easterly, C.E.  On the rodent bioassays currently being 
conducted on 44 chemicals: a RASH analysis to predict test results from 
the national Toxicology Program. Mutagenesis. 6: 507-514 (1991).
14.	Ashby, J. and Tennant, R.W.  Prediction of rodent carcinogenicity for 44 
chemicals: results. Mutagenesis. 9: 7-15 (1994).
15.	Benigni, R.  Predicting chemical carcinogensis in rodents: The state of the 
art in the light of a comparative exercise. Mutat. Res. 334: 103-113 (1995).
16.	King, R.D., Muggleton, S., Lewis, R.A., and Sternberg, M.J.E.  Drug design 
by machine learning: the use of inductive logic programming to model the 
structure-activity relationships of trimethoprim analogues binding to 
dihydrofolate reductase. Proc. Natl. Acad. Sci. USA. 89: 11322-11326 
(1992).
17.	Hirst, J.D., King, R.D. and Sternberg, M.J.E.  Quantitative structure-activity 
relationships by neural networks and inductive logic programming. I. The 
Inhibition of dihydrofolate reductase by pyrimidines. J. Comput.-Aided 
Mol. Des. 8: 405-420 (1994).
18.	Hirst, J.D., King, R.D. and Sternberg, M.J.E.   Quantitative structure-
activity relationships by neural networks and inductive logic 
programming. II. The inhibition of dihydrofolate Reductase by triazines. J. 
Comput.-Aided Mol. Des. 8: 421-432 (1994).
19.	Muggleton, S.H.  Inverse Entailment and Progol. New Generation 
Computing. 13: 245-286 (1995).
20.	King, R.D., A. Srinivasan, and Sternberg, M.J.E.  Relating chemical activity 
to structure: an examination of ILP success. New Gen. Computing. 13: 411-
433 (1995).
21.	King, R.D., Muggleton, S.H., Srinivasan, A., and Sternberg, M.J.E.  
Structure activity relationships derived by machine learning: The use of 
atoms and their bond connectivities to predict mutagenicity using 
inductive logic programming. Proc. Natl. Acad. Sci. USA.  93: 438-442 
(1996).
22.	Lee, Y., Buchanan, B.G., Mattison, D.M., Klopman, G., Rosenkranz 
Learning rules to predict rodent carcinogenicity of non-genotoxic 
chemicals. Mutat. Res. 328: 127-149 (1995). 
23.	DeLong, H.  A Profile of Mathematical Logic. Reading, MA: Addison-
Wesley. 1970.
24.	Wallace, C.S. and Freeman, P.R.  Estimation and Inference by compact 
coding. J. R. Statist. Soc. B. 49: 195-209 (1987)
25.	Rosenkranz, H.S. and  Klopman, G.  Prediction of the carcinogenicity in 
rodents of chemicals currently being tested by the US National Toxicology 
Program.  Mutagenesis. 5: 425-432 (1990).
26.	McNemar, Q.  Note on the sampling error of the difference between 
correlated proportions or percentages. Psychometrica. 12: 153-157 (1947).
27.	Ashby, J.  Two million rodent carcinogens? The role of SAR and QSAR in 
their detection. Mutation Research,. 305: 3-12 (1994).
28.	Ashby, J. and Paton, D.  The influence of chemical structure on the extent 
and sites of carcinogenesis for 522 rodent carcinogens and 55 different 
human carcinogen exposures. Mutation Research. 286: 3-74 (1993).
29.	Richard, A. M.  Application of SAR methods to non-congeneric data bases 
associated with carcinogenicity and mutagenicity: Issues and approaches. 
Mutat. Res. 305: 73-97 (1994).



Tables

Table 1	The accuracy of different prediction methods on 39 compounds 
previously tested by the NTP.  Default methods are those that use simplistic 
prediction strategies.  The basic methods are those that use information solely 
from chemical structure and Salmonella mutagenicity ests.  The complex methods 
use information from rodent biological tests; the Ashby and RASH methods also 
exploit expert chemical knowledge and are therefore not automatic.  Accuracy = 
no. correctly predicted / total no. predicted.  Cover is the number of compounds 
predicted(PP = predicted to be carcinogenic and is carcinogenic, PN = predicted 
to be carcinogenic and is not carcinogenic, NP = predicted to be not carcinogenic 
and is carcinogenic, NN = predicted to not carcinogenic and is not carcinogenic.   

Table 2	The Progol predictions for the test set.  Actual is the result of the 
NTP rodent bioassays (pos a carcinogen for any species and in any organ, equ an 
equivocal classification treated as a non-carcinogen, - a non carcinogen). Pred. the 
PROGOL predictions (pos is predicted to be a carcinogen, neg a non-carcinogen).  

Table 3 The rules found by Progol using all the compounds.


 
Table 1


Information	Prediction method	Accuracy	Cover (PP, PN, NP, NN)

Default		Ames			59%		39 (10, 5, 11, 13)
		Largest Class		54%		39 (21, 18, 0, 0)

No rodent tests	PROGOL			64%		39 (18, 11, 3, 7)
		Bakale & McCreary	63%		30 (11, 5, 6, 8)
		Benigni			62%		37 (16, 9, 5, 7)
		DEREK			57%		37 (12, 8, 8, 9)
		COMPACT			54%		37 (14, 10, 7, 6)
		TOPKAT			54%		26 (6, 3, 9, 8)
		CASE			49%		37 (11, 9, 10, 7)

Rodent tests	Ashby			77%		39 (19, 7, 2, 11)
		RASH			72%		29 (8, 0, 8, 13)
		TIPT			67%		39 (19, 11, 2, 7)



Table 2

CAS No.		Name				Actual		Pred
6459-94		CI Acid red 114			pos		pos
96-18-4		1,2,3-Trichloropropane		pos		pos
96-13-9		2,3-Dibromo-1-propanol		pos		pos
119-93-7	3,3-Dimethylbenzidine 2HCL	pos		pos
1825-21-4	Pentachloanisole		pos		pos
2425-85-6	CI Pigmant red 3		pos		pos
91-23-6		o-Nitroanisole			pos		pos
28407-37-6	CI Direct blue 218		pos		pos
91-64-5		Coumarin			pos		pos
2429-74-5	CI Direct blue 15		pos		pos
137-09-7	2,4 Diaminophenol 2HCl		pos		pos
115-96-8	Tris(2-chloroethyl)phosate	pos		pos
396-01-0	Triamterene			pos		pos
120-32-1	o-Benzyl-p-Chlorophenol		pos		pos
57-41-0		Diphenylhydantoin		pos		pos
91-20-3		Napthalene			pos		pos
62-23-7		p-Nitrobenzoic acid		pos		pos
3296-90-0	2-2-Bis(bromomethyl)-
		1,3-propanediol			pos(draft)	pos
119-84-6	3,4-Dihydrocoumarin		pos		neg
1330-78-5	Tricresyl phosphate		pos		neg
298-59-9	Methylphenidate HCL		pos		neg
75-65-0		t-Butyl alcohol			pos		neg
10599-90-3	Chloramine			equ		neg
96-48-0		g-Butryolacetone		equ		neg
9005-56-6	Polysorbate 80			equ		neg
103-90-2	4-Hydroxyacetanalide		equ		pos
127-19-8	Titanocene dichloride		equ		pos
52551-67-4	HC yellow 4			equ		pos
100-01-6	p-Nitroaniline			equ		pos
6471-49-4	CI Pigmant red 23		equ		pos
60-13-9		Amphetamine sulfate		neg		neg
107-21-1	Ethylene glycol			neg		neg
108-46		Resorcinol			neg		neg
100-02-7	p-Nitrophenol			neg		neg
79-11-8		Monochloroacetic acid		neg		pos
81-11-8		4,4-Diamino-2,2	
		stilbenedisulfonic acid		neg		pos
74-83-9		Methyl Bromide			neg		pos
58-33-3		Promethasine HCl		neg		pos
96-69-5		4-4 Thiobis(6-t-buttl-m-cresol)	neg		pos



A compound is carcinogenic if
(1)	It has a positive Ames test.						OR
(2)	It has an ester oxygen and =< 1 methyl group.				OR
(3)	It has an ester oxygen and an aromatic hydrogen with a 
	partial charge =< 0.041.						OR
(4)	It has an ether oxygen with a partial charge >= -0.182.			OR
(5)	It has an ether oxygen with a partial charge of -0.358.			OR
(6)	It has a chlorine, an hydroxyl group and a benzyl ring. 		OR
(7)	It has a bromine with a partial charge  =< However-0.086.			OR
(8)	It has an aldehyde oxygen.						OR
(9)	It has a aromatic amine group and an unsaturated carbon 
	with a partial charge =< -0.181 					OR
(10)	It has an unsaturated carbon with a partial charge of >= 0.4.	)	OR
(11)	It has an unsaturated carbon and a 6-membered carbon ring. 		OR
(12)	It has a carbon atom in a 6 membered aromatic ring 
	with a partial charge of 0.005.						OR
(13)	It has a carbon atom in a 6 membered aromatic ring 
	with a partial charge of 0.211.						OR
(14)	It has a carbon atom in a 6 membered aromatic ring 
	with a partial charge of -0.135.					OR
(15)	It has a aliphatic carbon with a partial charge >= 0.507.		OR
(16)	It has an aliphatic carbon with a partial charge of -0.085. 		OR
(17)	It has 4 halide atoms attached to tetrahedral carbons.			OR
(18)	It has a carbon in a 5 membered aromatic ring with the 
	same charge as a carbon in a 6 membered aromatic ring.	


 
