November 28, 2007
High-Frequency Automated Trading (HFAT); Part 1
Backfilled vs Real-Time Data issues
If you are not sure what HFAT (High Frequency Automated Trading) is all about, please Google the topic. This post highlights some of the problems you may encounter when venturing into HFAT of stocks. The views expressed here are based on personal experiences and/or may be anecdotal; not everything that happens in real-time trading is easy to explain. If you have technical insight and see inaccuracies, please comment for the benefit of future readers.
Designing and implementing high frequency trading systems is, from a trader’s viewpoint, probably the ultimate experience. To see and hear trades executed every few seconds and see the profits rolling in should give any trader an unprecedented high.
The catch is that to design an HFAT system that works with real money is very different from designing one using local data. The smaller the time frame the greater the impact of small data discrepancies. In sub-minute time frames, your HFAT system may perform very differently with local data than with real-time, raw market data.
A typical problem is that real-time live market data are delayed by up to several hundred milliseconds and that quotes may arrive out of sequence. What you see on your charts may be several quotes after the trade took place. This flawed data is what your trading system is trading and must be designed to work with. The charts you see in AmiBroker are mostly backfilled and/or updated after trading hours. At the end of a trading day, you will have data in your database that have a mixture of backfilled data (time and data errors have been corrected) and raw data (flawed) that were collected during the current day’s trading session. You may also have several lengthy data gaps that were introduced when you shut down the system and/or you lost your data feed.
While the procedure may vary for different data providers, quotes that are received in real time will lag in time. Since bar periods during live data collection are based on your computer clock, quotes may end up in the next bar due to their delayed arrival. The data used to backfill your database come from a different data server and will be time stamped. This allows AmiBroker to correct the position of quotes that were received out of sequence. This process removes the real time delays that were present when the data was received.
Since there are no delays with backfilled data, your backfilled data look ahead by several hundred milliseconds with respect to the data you will eventually be trading.
It is not unusual to develop a system with 5-second backfilled data (where all bad ticks and time-stamp errors have been corrected by the data provider) and obtain Holy Grail performance only to find out that when traded with real-time streaming data (where the data is delayed, contains bad ticks and time-stamp errors), the system is a total failure. The following charts illustrate this problem. The data to the left of the red line is backfilled and the data to the right of the Red line is data collected in real time. White is the equity.

You will not be able to visually judge whether data are backfilled or raw. The differences will only show up by running a trading system on the data; your trading system may be the only way to distinguish between backfilled and raw data. The chart below shows a close up of the data change.

Backfilling the above database and performing another Backtest over the same period produces a very different equity:

There is no guarantee that a system developed on one type of data will perform equally well with the other. When you first encounter a major equity drawdown, you may assume that this was just “a bad day”; after all, all trading systems have them. You may have developed and backtested your system over thousands of trades, covering a period of six months or more. You have been a good student and have used all the recommended methods to validate your trading system. You have tested in- and out-of-sample, applied intelligent optimizations, used Walk-Forward testing, performed Monte-Carlo analysis, and the list goes on. After being so thorough, how could you go wrong? You are ready to trade real money tomorrow and make your first 50% in one day!
The point is that all this effort is wasted time if the data used during development aren’t 100% identical to what you will be trading with.
The best way to develop an HFAT system is to use real live market data. The earlier you change from local or edemo data to real data, the more time you will save, and the more disappointments you will be spared. An HFAT system can never be completed off line, with a local database, or with simulated edemo Data. Its design must always include a significant paper trading and real-money phase.
Another problem when trading your IB paper-trading (simulated) account is that the user does not know the rules Interactive Brokers uses to decide whether an order should be executed or not. These execution criteria may change without warning. This imposes an artificial order to your executions that is unreal; the simulated market conditions will be different from those encountered in real trading. You may well develop a trading system that exploits IB’s way of processing to give you unreal performance, but such a system would fail in real trading.
Also, your paper-trades are not seen by, and cannot influence, the market. When trading real money your orders could be setting a new High or a Low, or if you are trading large sums, you could draw the price up or down. This means that even if your system performs extremely well in simulated trading, this is no guarantee that your system will perform well trading real money.
Using your simulated account to validate your system should never be your final validation before trading for profits; you should always include a real-money evaluation phase in your development plan. Your first real trades should never be to make money; they should be planned to validate your system under varied conditions.
Market behavior is very complex; be prepared for the unexpected and never skip a development step because something works extremely well. For example you might be testing your system using your IB simulated paper-trading account and see your profits skyrocket too fast to follow, perhaps having 90% winners and RARs that are out of this world. When this happens, it is extremely exciting and fun to watch; it is a rare experience that must be appreciated. It suggests that Holy-Grails are possible. But are they? Such favorable trading conditions may last for a few hundred trades, a few hours, or perhaps a few days. This can happen when technical conditions and market behavior are all just perfect for your system. Some unknown factor just made everything work perfectly. When you experience this, you’ll be analyzing your charts, trading log, execution report, etc. for weeks to follow. The fact is that it may never happen again, and you may never know what really happened.
Order and Position Status
IB Position size reporting may be erratic, is always delayed, and may include transient information. If you are trading fast and you use the IB Position Size to determine your next action, this will be a problem. This is especially the case with reversal systems where Covers may be processed before the Buys, and there may be many partial fills. For example, if you are reversing 100 shares, going alternatively Long and Short, you might read position sizes of 0, 100, 200, and even 300 shares. Do not base your system’s action solely on a single reading of the position size; your protective mechanisms will shut down your system many times a day. If a position is not what it is supposed to be on 5 consecutive queries (at quote interval), you may want to close all positions, suspend operation and continue later, or shut down the system and retry later.
Reporting of order status appears more reliable and stable. Usually it seems unnecessary to repeat Order Status queries.
IB Snapshots
Not addressed in this post is the matter of Snapshots however it is extremely important for real-time traders to understand how IB compresses and transmits its data. This topic has been discussed on several forums, for more information on IB data in general please read the following threads:
AmiBroker user group: Interactive Brokers Plug-in dropping volume data
IB’s Discussion Board: Globex Ticks snapshot or reality?
AmiBroker User Group: AB Tick Bar Analysis
The IB Maximum Message Rate
IB has a limit to the maximum rate of messages (order related) you can transmit per second. The rate of queries is not limited. The current limit is 50 messages per second. If you exceed this rate, IB will produce an error code, and if you continue to exceed the message rate, IB will suspend your connection. This, of course, should be prevented at all cost. The message rates are documented here. How to introduce real-time delays measuring in milliseconds is documented in the post on High Precision Interval and Delay Timing.
Internet Delays
Order and Position status is subject to a 50-400 milliseconds Internet delay. This delay will vary with your location and type of Internet connection to IB. You can test this delay by pinging the IB server. To do this type ping gw1.ibllc.com on command in the Start->Run windows (for Windows XP), and click the run button. A window, as shown below, will appear and show you the delays for three consecutive queries (pings) to IB:

If you encounter excessive delays or cannot connect at all, you can get more details about how your connection is routed by running tracert gw1.ibllc.com in the same manner. You may want to browse the Technical FAQ at IB for related items.
Edited by Al Venosa.
Filed by Herman at 8:55 am under Real-Time System Design


It seems to me that this:
It is not unusual to develop a system with 5-second local data and obtain Holy Grail performance only to find out that, after a backfill of data, the system is a total failure.
should read like this instead:
It is not unusual to develop a system with 5-second local data (backfilled, timestamped, analyzed with zero delay) and obtain Holy Grail performance only to find out that when traded in realtime (streaming data, non-zero delay), the system is a total failure.
Is the first paragraph above really meaning to say the same thing as the 2nd? Or is it saying something else?
BTW, I think that this is a great discussion/illustration of a critical topic for HFAT!
Thank you Progster, you are correct and I revised the sentence.
I really like this article. It repays repeated reading!
“Since bar periods during live data collection are based on your computer clock, quotes may end up in the next bar due to their delayed arrival.”
This is a general and eye-catching statement. Is it a property of the data feed being discussed, a property of AB realtime bar building, or a combination of the two? Would this be true if using ESignal instead of IB? (etc.)
IOW, are there not timestamps on the streaming data as it arrives? Are those timestamps not used by AB when building bars?
Is there a way to save a day of streaming data, then set it aside, get the backfill, save that, and then compare the two (either visually on a chart, or in Excel)?
Can a screen movie be used to demonstrate that bar building is based on local clock rather than timestamp of arriving data?
Hello Progster,
As I explained at the start of the article; “The views expressed here are based on personal experiences and/or may be anecdotal”. I am also indebted to my trading partners who helped me put this topic into context with practical observations. Personally, I am by no means an expert on how data are processed between the time the actual trade takes place at the market and the time we see it on our chart. My experience tells me that one can write a book on this. There are many absolutely critical aspects of real-time stock data collection and trading. Most are not covered anywhere.
AFAIK, live IB data doesn’t have a timestamp but IB Backfill data does (the data comes from different servers). eSignal timestamps its data but that doesn’t help because eSignal is subject to delayed market data just like anyone else. Live data should only contain as-it-becomes-available data, live data should never be corrected on the fly else our charts and trading signals would change retroactively. The pleasure of changing signals/prices is reserved for after backfill :-) With regards to the deceptive HG performance with 5-second backfilled data, this is true for IB as well as eSignal, and undoubtedly for all other data providers.
Yes, you can collect Raw data and compare it to backfilled data (that is how I produced the charts). In AmiBroker Backfill cannot be turned OFF during live trading, so you have to be very careful that you do not inadvertently corrupt your raw data with backfill data. To facilitate real-time system development I create a new 5-second data base each day and set it to Local before shutting down. Such personal raw data bases are vital for real-time system development and cannot be purchased of downloaded from anywhere.
Herman
Creating a new 5-sec database each day …
That’s serious business! Do you define it prior to the session start so that the first tick in the DB is the first tick of the session? Otherwise there would be backfill present as you mention, yes?
So if you are successful, you are accumulating dozens and ultimately hundreds of days of data, each in it’s own database. Is there any way to knit these together? If not, do you test your strategies on each day individually and then somehow aggregate the results?
It seems the issues you are raising might possibly make a good argument for the ability to optionally disable all backfill for a database.
Databases can be merged however if you make hundreds of trades a day you may only need a few days of data to develop with. Even so, it is a good idea to backup old data because things will happen that you want to analyze later. To prevent backfill during a new day, set the DB to 24 hours, start the system before the market open but develop with RTH data. Serious development should be done with live data.
RTH = regular trading hours (i.e. the day session)
(I had to look it up … :^)
I can add my personal experience to this thread. I got handed a rude awakening on my backtesting recently. I had been only saving RT data and not requesting any backfills –which would have caused nightly corrected data to overwrite my RT database. Recently, I realized that the AB + IQ Feed has been “backfilling” my previous day’s data every time I log in each day. On investigation, the corrected data always made my backtesting look a lot better than the RT data.
I use a 5 second database and 1 point range bars on ER futures. Rangebars close at one end of the bar or the other. However, with 5 second data, the price could easily move above or below the ideal close by a small amount. My realtime data closed close to the ends of the bars. However, the backfilled data was all over the bar, which is not reasonable. The backfilled data seems to have introduced a great deal more noise into the system. Noise that drives the closing price towards the middle of the bar will always improve the performance of a system based on range bars.
Anyway, I wrote compensating code today that restores integrity to my backtesting data. I used what I know about how range bars running on 5 second data behave to know precisely at what price the bar will close and execute a trade. I now only use H and L from the bars and ignore the close altogether. I kept a close watch on the data for a few days to make sure that there were no differences between RT data and corrected data backtest results.
I really think we should have a mode where backfilling previous data bars is not allowed. We really need to have a good contigous database of the data as it arrived realtime for backtesting. It takes me months to collect enough data for a reasonable backtest of my system.
BTW my system started as a higher speed trading system until I discovered this problem. I had to reduce the frequency of trades by a factor of 10 to achieve good results with the RT data vs the backfilled data.
Thanks for sharing Dennis. Yes, in intraday trading the Open and close really have little significance. filtered High and Low mean more. The problem is too that the market and the way data is delayed and modified before we see it doesn’t seem to be constant. One time I was running a fast (~200 trades/hour) trading system on my simulated account, I made over 50% in less than two hours. Mostly winning trades. I tested the system on several tickers, and even the next day one of my partners had a similar, but not so extreme, experience with the same system. So it would seem we had a good system. However system performance deteriorated over a period of about 24 hours until it failed completely. We tried but never got it to work the same way again. The question is what is the chance of 400-500 trades just being lucky? The merits of most systems are based on far less trades. One also wonders if some market anomely occurred that was quickly detected by automated trading systems which, within 24 hours, overtraded the system and killed it. High speed trading is full of surprises :-)
Re: The question is what is the chance of 400-500 trades just being lucky?
Trading in 1 sec bars we have 6*60*60 == 21600 bars/day which is equivalent to 21600/252 == 85 years of EOD trading. If we only expect to see one black swan every 85 EOD years we can expect to see one once a day in 1 sec bars.
500 consective wins with a fair sided coin has an extreme probability (more like an albino swan) but we don’t need that many consective wins to achieve 50% growth. A small bias on the coin is sufficient. For 500 samples the sample error is approx +- 4% so 56/44 W/L due to sample error is enough, even if ave%win == ave%loss.
I can’t find any antidote to (statistical) uncertainty in trading.
The best (quick) solutions I have come up with so far are:
a) portfolio diversification (by trading more than one system at a time and by using non-correlated systems) disperses the risk even further (if we lose from there we are the unluckiest person on the planet)
b) variance of the means (standard error of the mean) for > 1 OOS gives us more confidence than we can achieve with a single OOS test - since we have ‘unlimited’ data available for intraday testing this is a luxury we can afford (the logic there is if we observe extreme results in 5 OOS sample we either have 5 black swans OR 5 white swans (a good system) - Ocams razor tells us that the latter is more likely to be the case (still no certainty however).
So - microtrading requires different evaluation techniques than macrotrading?
AND low variance between tests is the best measure of confidence?
Im in the process of building my own ATS program.. one of the features is that it will automatically adjust itself the size of the bars in ticks and not timeframe,,,toguether with the detection of triangles ends which ever way they come. HINT: using some range filter to reduce the noise (like heikin) and use its parameters to calculate your formula. maximum loss on a range of self adjustable from 1 to 9 ticks is 4 using a trailing stop also adjustable to the size of bars.
good trading.