How FiveThirtyEight’s House, Senate And Governor Models Work

FiveThirtyEight’s 2020 Senate and House models are mostly unchanged from 2018, so the large majority of the methodological detail below will still apply. (Note that we are not planning to run gubernatorial forecasts in 2020.) The handful of changes we do have, however, reflect either takeaways from our review of the model’s performance in the 2018 midterms or efforts to make our congressional forecasts more consistent with the current best-practices in our presidential model. The changes include the following Model tweak
Sept.18, 2020 :

T he principles behind our House, Senate and gubernatorial forecasts should be familiar to our longtime readers. They take lots of polls, perform various types of adjustments to them, and then blend them with other kinds of empirically useful indicators (what we sometimes call “the fundamentals”) to forecast each race. Then they account for the uncertainty in the forecast and simulate the election thousands of times. Our models are probabilistic in nature; we do a lot of thinking about these probabilities, and the goal is to develop probabilistic estimates that hold up well under real-world conditions. For instance, when we launched the 2018 House forecast, Democrats’ chances of winning the House were about 7 in 10 — right about what Hillary Clinton’s chances were on election night in 2016! So ignore those probabilities at your peril.

The methods behind our House and Senate forecasts are nearly identical. In fact, they’re literally generated by the same program, with data from House races informing the Senate forecasts and vice versa. The governor forecasts are also largely similar, 4 but with some important differences: For instance, House and Senate data is used to inform the gubernatorial forecasts, but not the other way around. Where there are further differences between how House, Senate and gubernatorial races are handled, we’ll describe them as they arise below.

Overall, the congressional and gubernatorial models have a different flavor to our presidential forecasts in two important respects:

Three versions of the models: Lite, Classic, Deluxe

In 2016, we published what we described as two different election models: “polls-only” and “polls-plus.” 6 But now we’re running what we think of as three different versions of the same model, which we call Lite, Classic and Deluxe. I realize that’s a subtle distinction — different models versus different versions of the same model.

But the Lite, Classic and Deluxe versions of our models somewhat literally build on top of one another, like different layers of toppings on an increasingly fancy burger. I’ll describe these methods in more detail in the sections below. First, a high-level overview of what the different versions account for.

The layers in FiveThirtyEight’s House forecast
Which versions use it?
Layer Description Lite Classic Deluxe
1a Polling District-by-district polling, adjusted for house effects and other factors.
1b CANTOR A system which infers results for districts with little or no polling from comparable districts that do have polling.
2 Fundamentals Non-polling factors such as fundraising and past election results that historically help in predicting congressional races.
3 Expert forecasts Ratings of each race published by the Cook Political Report, Inside Elections and Sabato’s Crystal Ball

Lite is as close as you get to a “polls-only” version of our forecast — except, the problem is that a lot of congressional districts have little or no polling. So we use a system we created called CANTOR 7 to fill in the blanks. It uses polls from states and districts that have polling, as well as national generic congressional ballot polls, to infer what the polls would say in those that don’t or that have only minimal polling. The Lite forecast phases out CANTOR and becomes truly “polls-only” in districts that have a sufficient amount of polling.

The Classic version also uses local polls 8 but layers a bunch of non-polling factors on top of it, the most important of which are incumbency, past voting history in the state or district, fundraising and the generic ballot. These are the “fundamentals.” The more polling in a race, the more heavily Classic relies on the polls as opposed to the “fundamentals.” Although Lite isn’t quite as simple as it sounds, the Classic model is definitely toward the complex side of the spectrum. With that said, it should theoretically increase accuracy. In the training data, 9 Classic miscalled 3.3 percent of House races, compared with 3.8 percent for Lite. 10 You should think of Classic as the preferred or default version of FiveThirtyEight’s forecast unless we otherwise specify.

Finally, there’s the Deluxe flavor of the model, which takes everything in Classic and sprinkles in one more key ingredient: expert ratings. Specifically, Deluxe uses the race ratings from the Cook Political Report, Nathan Gonzales’s Inside Elections and Sabato’s Crystal Ball, all of which have published forecasts for many years and have an impressive track record of accuracy. 11

Within-sample accuracy of forecasting methods

Share of races called correctly based on House elections from 1998 to 2016

Forecast 100 Days Before Election Election Day
Lite model (poll-driven) 94.2% 96.2%
Fundamentals alone 95.4 95.7
Classic model (Lite model + fundamentals) 95.4 96.7
Expert ratings alone* 94.8 96.6
Deluxe model (Classic model + expert ratings) 95.7 96.9

* Based on the average ratings from Cook Political Report, Inside Elections/The Rothenberg Political Report, Sabato’s Crystal Ball and CQ Politics. Where the expert rating averages out to an exact toss-up, the experts are given credit for half a win.

So if we expect the Deluxe forecast to be (slightly) more accurate, why do we consider Classic to be our preferred version, as I described above? Basically, because we think it’s kind of cheating to borrow other people’s forecasts and make them part of our own. Some of the fun of doing this is in seeing how our rigid but rigorous algorithm stacks up against more open-ended but subjective ways of forecasting the races. If our lives depended on calling the maximum number of races correctly, however, we’d go with Deluxe.

Collecting, weighting and adjusting polls

Our House, Senate and governor forecasts use almost all the polls we can find, including partisan polls put out by campaigns or other interested parties. We had not traditionally used partisan polls in our forecasts, but they are a necessary evil for the House, where much of the polling is partisan. Having developed a system we like for handling partisan polls in our House forecasts, we’re also using them for our Senate and our governor forecasts.

However, as polling has gotten more complex, including attempts to create fake polls, there are a handful of circumstances under which we won’t use a poll:

Polls are weighted based on their sample size, their recency and their pollster rating (which in turn is based on the past accuracy of the pollster, as well as its methodology). These weights are determined by algorithm; we aren’t sticking our fingers in the wind and rating polls on a case-by-case basis. Also, the algorithm emphasizes the diversity of polls more than it has in the past; in any particular race, it will insist on constructing an average of polls from at least two or three distinct polling firms even if some of the polls are less recent.

There are also three types of adjustments to each poll:

Our models use partisan and campaign polls, which typically make up something like half of the overall sample of U.S. House district polling. 15 Partisanship is determined by who sponsors the poll, rather than who conducts it. Polls are considered partisan if they’re conducted on behalf of a candidate, party, campaign committee, or PAC, super PAC, 501(c)(4), 501(c)(5) or 501(c)(6) organization that conducts a large majority of its political activity on behalf of one political party.

Partisan polls are subject to somewhat different treatment than nonpartisan polls in the model. They receive a lower weight, as partisan-sponsored polls are historically less accurate. And the house effects adjustment starts out with a prior that assumes these polls are biased by about 4 percentage points toward their preferred candidate or party. If a pollster publishing ostensibly partisan polls consistently has results that are similar to nonpartisan polls of the same districts, the prior will eventually be overridden.

CANTOR: Analysis of polls in similar states and districts

CANTOR is essentially PECOTA or CARMELO (the baseball and basketball player forecasting systems we designed) for congressional districts. It uses a k-nearest neighbors algorithm to identify similar congressional districts and states based on a variety of demographic, 16 geographic 17 and political 18 factors. For instance, the district where I was born, Michigan 8, is most comparable to other upper-middle-income Midwestern districts such as Ohio 12, Indiana 5 and Minnesota 2 that similarly contain a sprawling mix of suburbs, exurbs and small towns. Districts can be compared to states, 19 so data from House races informs the CANTOR forecasts for Senate races, and vice versa. Gubernatorial races are not used in the CANTOR calculation for House and Senate races, but House and Senate races are used in CANTOR for gubernatorial races.

The goal of CANTOR is to impute what polling would say in unpolled or lightly polled states and districts, given what it says in similar states and districts. It attempts to accomplish this goal in two stages. First, it comes up with an initial guesstimate of what the polls would say based solely on FiveThirtyEight’s partisan lean metric (FiveThirtyEight’s version of a partisan voting index, which is compiled based on voting for president and state legislature) and incumbency. For instance, if Republican incumbents are polling poorly in the districts where we have polling, it will assume that Republican incumbents in unpolled districts are vulnerable as well. Then, it adjusts the initial estimate based on the district-by-district similarity scores.

All of this sounds pretty cool, but there’s one big drawback. Namely, there’s a lot of selection bias in which races are polled. A House district usually gets surveyed only if one of the campaigns or a media organization has reason to think the race is close — so unpolled districts are less competitive than you’d infer from demographically similar districts that do have polls. CANTOR projections are adjusted to account for this.

Overall, CANTOR is an interesting method that heavily leans into district polling and gets as close as possible to a “polls-only” view of the race. However, in terms of accuracy, it is generally inferior to using …

The fundamentals

The data-rich environment in gubernatorial and congressional elections — 435 individual House races every other year, compared with just one race every four years for the presidency — is most beneficial when it comes to identifying reliable non-polling factors for forecasting races. There’s enough data, in fact, that rather than using all districts and states to determine which factors were most predictive, I instead focused the analysis on competitive races (using a fairly broad definition of “competitive”).

In competitive House districts with incumbents, the following factors have historically best predicted election results, in roughly declining order of importance:

In addition, in Pennsylvania, which underwent redistricting in 2018, the model accounts for the degree of population overlap between the incumbent’s old and new district. And in California and Washington state, it accounts for the results of those states’ top-two primaries.

The Senate model uses almost all the same factors for incumbents, but there are some subtle differences given that senators face election once every six years instead of once every other year. For instance, previous victory margin is less reliable in Senate races; that’s because a longer time has passed since the previous election was held. In addition, the Senate model uses more sophisticated data in calculating the effects of incumbency. Candidates in smaller, 23 more demographically distinct states 24 tend to have larger incumbency advantages. The Senate model also accounts for changes in the state’s partisan orientation since the seat was last contested. Finally, the Senate model uses a more advanced method of calculating candidate experience. 25

For gubernatorial races, we use the same factors as for Senate races, with two exceptions:

Note, however, that while the variables used in the gubernatorial model are largely the same as in the congressional ones, their weights can be a lot different. In particular, the generic ballot is somewhat less predictive in gubernatorial races than in congressional races (somewhere around two-thirds as much), and a state’s partisan lean is much less predictive than in congressional races (somewhere around one-third as much). A state that leans toward the Democrats by 12 points in a race for Congress would be predicted to do so by only about 4 points in a race for governor, for example.

In open-seat races, the model uses the factors from the list above that aren’t dependant on incumbency, namely the generic ballot, fundraising, FiveThirtyEight partisan lean, scandals, experience and (where applicable) top-two primary results. It also uses the results of the previous congressional election in the state or district for congressional elections, but this is a much less reliable indicator than when an incumbent is running for re-election. (Previous election results aren’t used at all in gubernatorial races without incumbents.)

But wait — there’s more! In addition to combining polls and fundamentals, the Classic and Deluxe models compare their current estimate of the national political climate to a prior based on the results of congressional elections since 1946, accounting for historic swings in midterms years and presidential approval ratings. The prior is designed in such a way that it phases it out completely by Election Day.

Incorporating expert ratings

Compared with the other steps, incorporating expert ratings and creating the Deluxe version of the model is fairly straightforward. We have a comprehensive database of ratings from Cook and other groups in House races and gubernatorial races since 1998 and in Senate races since 1990, so we can look up how a given rating corresponded, on average, with a margin of victory. For instance, going into the 2018 midterms, candidates who were deemed to be “likely” winners in their House races won by an average of about 12 points:

What do ratings like “lean Republican” really mean?
Expert Rating Average margin of victory
Toss-up 0 points
“Tilts” toward candidate 4 points
“Leans” toward candidate 7 points
“Likely” for candidate 12 points
“Solid” or “safe” for candidate 34 points

Based on House races since 1998.

But, of course, there are complications. One is that there’s an awful lot of territory covered by the “solid” and “safe” categories: everything from races that could almost be considered competitive to others where the incumbent wins every year by a 70-point margin. Therefore, the Deluxe forecast doesn’t adjust its projections much when it encounters “solid” or “safe” ratings from the experts, except in cases where the rating comes as a surprise because other factors indicate that the race should be competitive.

Also, although the expert raters are really quite outstanding at identifying “micro” conditions on the ground, including factors that might otherwise be hard to measure, they tend to be lagging indicators of the macro political environment. Several of the expert raters shifted their projections sharply toward the Democrats in early 2018, for instance, even though the generic ballot was fairly steady over that period. Thus, the Deluxe forecast tries to blend the relative order of races implied by the expert ratings with the Classic model’s data-driven estimate of national political conditions. Deluxe and Classic will usually produce relatively similar forecasts of the overall number of seats gained or lost by a party, therefore, even though they may have sharp disagreements on individual races.

Simulating the election and accounting for uncertainty

Sometimes what seem like incredibly pedantic questions can turn out to be important. For years, we’ve tried to design models that account for the complicated, correlated structure of error and uncertainty in election forecasting. Specifically, that if a candidate or a party overperforms the polls in one swing state, they’re also likely to do so in other states, especially if those states are demographically similar. Understanding this principle was key to understanding why Clinton’s lead wasn’t nearly as safe as it seemed in 2016.

Fortunately, this is less of a problem in constructing a congressional or a gubernatorial forecast; there are different candidates on the ballot in every state and district, instead of just one presidential race, and the model relies on a variety of inputs, instead of depending so heavily on polls. Nonetheless, the model accounts for four potential types of error in an attempt to self-diagnose the various ways in which it could go off the rails:

Error becomes smaller as Election Day approaches. In particular, there’s less possibility of a sharp national swing as you get nearer to the election because there’s less time for news events to intervene.

Nonetheless, you shouldn’t expect pinpoint precision in a House forecast, and models that purport to provide it are either fooling you or fooling themselves. Even if you knew exactly what national conditions were, there would still be a lot of uncertainty based on how individual races play out. And individual Senate and governor races are, at best, only slightly more predictable, as they can be highly candidate-driven.

Odds and ends

OK, that’s almost everything. Just a few final notes:

Editor’s note: This article is adapted from a previous article about how our election forecasts work.