With the release of Josh Hermsmeyer's injury database, lots of people are commenting on how predicting injuries (and also predicting injury-related performance changes) might be the new breakthrough in sabermetrics. Perhaps so. I've had access to descriptive DL data for the 2000's for some times now, so I've given it a shot. Let me tell you: predicting injuries is really hard; using injuries for performance prediction is even harder.
I haven't touched the injury data while I was busy with pitch data, but here is where I left off...
I used descriptive injury data to come up with features like "did the [pitcher] have a surgery?" "how many days did he spend on the DL?" and "did he have elbow-related injuries." I spent a week coming up with what I thought might be interesting features for pitcher injuries. Then I ran "feature selection" in WEKA with that data, to see which of my injury features were most useful for predicting innings pitched, VORP, and the incidence of future injuries. I still have those notes somewhere. A full notebook of which injury feature notes for what helps to predict future DL time, future elbow problems, future shoulder problems, and future surgeries. Good times.
Of course, the correlations between my predictions (for DL days, probability of elbow-related DL stints, etc) and reality wasn't very high. We all know that injuries are hard to predict, and that there is much variance involved. Still, my predictions weren't bad. The average pitcher projected for 15 (!) DL days, with some pitchers for considerably more time than that. All the extreme cases made some sense. However the DL projections were not of much use for predicting VORP, or even IP. Most strangely, the features for "projected_DL_X" often had a positive value in projecting IP.
How can a high injury risk player project to have more value than another player with the same stats and less injury risk?
The key is same stats. Suppose you have a pitcher with seasonal WAR (total value, in wins) of 5.0, 5.0, and 2.0. What is his established level of performance? It matters whether the 2.0 drop off was injury related. If so, we might say that he's a 5.0 WAR player, but perhaps project him a bit lower because of injury risk. If he wasn't injured, we might say he is now a 2.0 WAR player.
It's like applying to Yale Medical School with a 3.4 GPA. You'd better have some 4.0 GPA semesters, and an explanation for the 3.0 GPA semesters. If you got a 3.4 GPA every semester, that's your established level of performance.
Of course, determining to which extent performance change is due to injuries is very difficult. Especially for a computer.
A projection system, in either case, will look at the 5.0, 5.0 and 2.0 seasons and probably give a guess between 2.0 and 5.0 for next season. Now you tell the computer that the player has above-average injury risk. Well, that means that his upside (an injury-free season) is also higher than expected, so the computer might upgrade his projection. Then again, if we know that the player will most likely miss half of the next season, then the computer should lower his projection instead. It's not as simple as downgrading the projected value of a player, if he has high injury risk.
As you can see, the projection systems, built to function without injury information, will not necessarily benefit from injury projections. A lot of the information about a player's injury risk is already embedded in the previous years' stats.
Modeling injury risk separately from usage and value projection is useful in itself. But if you see a pitcher projected at X, but with high injury risk, do not necessarily think this means that X needs to be revised downward. The only thing that you can be sure of is that variance in performance increases with injury risk. But you don't need a computer model to tell you that.
I'll be getting back to pitcher injuries soon. Good luck to everyone else looking at these fascinating problems!