Learning Curve

dK00dK00 MemberPosts:5Learner I
Hello Rapidminer commuity,

I would like to compare the learning curve of three models, but I don't know how this should be applied in rapid miner. Can anyone help how can I plot the learning curve for each model?

Much appreciated!

Answers

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified ExpertPosts:950Unicorn
    Hi!

    I would useOptimize Parameters (Grid)for this.
    You connect the incoming data to Optimize Parameters. Inside the Optimize Parameters process you put a Sample operator and configure Optimize Parameters to try different settings of Sample. For example, you could sample 0.05, 0.1, 0.15 and so on from the original data set. Then you put the three cross validations with the different models behind the Sample and a Multiply operator. And you use Log to extract the performance from those and the sampling parameter. You will get a Log output in the Results view and you can visualize it, or use Log to Data after Optimize Parameters to turn it into a regular data table which you can export.

    Regards,
    Balázs

    dK00
  • dK00dK00 MemberPosts:5Learner I
    Hello@BalazsBarany,

    谢谢你的回应。

    Would the suggested way generate a curve of the training and testing as illustrated in the attached picture?
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified ExpertPosts:950Unicorn
    Hi@dK00

    the performance of a cross validation returns the performance on the test set that wasn't used for building the model. This is the correct way to calculate the performance.
    If you want to calculate the training performance, you can apply the model on its own input and get the performance from that result. But in data science we consider that cheating. Models should be tested on a test set, not the training set.

    I would actually expect the validation curve to also get better with more data. Where is this illustration coming from? It's strange.

    You can generate these curves with varying training samples, but I doubt you will get similar curves.

    Another important aspect for the model performance, especially on the training set, is the modelcomplexity. That is on the X axis in most similar illustrations and it describes the phenomenon of the training performance growing while the test performance getting worse when the point of overfitting has been reached.

    Regards,
    Balázs



  • earmijoearmijo MemberPosts:270Unicorn
    edited June 22
    dk00:
    This is what I would do. I had to do it in two steps. Probably someone here more knowledgeable than me can do it in one step. In Process 1 (not shown below I splitted the famous diamonds dataset (ggplot): diamonds1 (80%) and diamonds2 (20%). These are the datasets used in the process below.

    Balázs: the learning curve is a tool to diagnose overfitting (Andrew Ng made it famous). It requires the computation of both the training error and the test error. When the TestError >> TrainingError this is taken a sign of overfitting. You could do two things to fix it then: simplify your model or get more data. There used to be an operator in RM to graph learning curves.

    Hope this helps.

    \Ernesto

    P.S. The graph I get for the learning curve is:


















    <运营商激活= " true "类=“检索”兼容ibility="10.1.001" expanded="true" height="68" name="Retrieve diamonds1" width="90" x="112" y="340">




    <参数键= " Range.last_example过滤器示例" value="[5000;40000;7;linear]"/>


























































    <运营商激活= " true "类=“检索”兼容ibility="10.1.001" expanded="true" height="68" name="Retrieve diamonds2" width="90" x="313" y="391">



























































    BalazsBarany
  • earmijoearmijo MemberPosts:270Unicorn
    Ok. I got it in one step using the Remember/Recall operators.















    <运营商激活= " true "类=“检索”兼容ibility="10.1.001" expanded="true" height="68" name="Retrieve diamonds" width="90" x="112" y="391">






    < /枚举>












    <参数键= " Range.last_example过滤器示例" value="[5000;40000;7;linear]"/>



































































































































    BalazsBarany
Sign InorRegisterto comment.