Standard Data Feed Requirements for Datafeed-based indexing

The Hawksearch service enables online retailers and publishers the ability to drive a rich, compelling user experience. This experience drives visitors to the products and information that they are seeking. One main step of integration is providing a standard data feed set that the Hawksearch solution expects.

Data Feed Requirements

The Hawksearch service powers search, navigation, and merchandising aspects of a web- based application. The Hawksearch solution does not index your entire database, but it requires necessary data to drive an exceptional search experience. For example, a typical online retailer would need their product catalog indexed. Some data values that would be included in a data feed are product title, description, price, and manufacturer name.

Typically, the following files are included in a set of feeds sent to Hawksearch:

  • items.txt
  • attributes.txt
  • content.txt
  • hierarchy.txt
  • timestamp.txt

Occasionally, there are additional feeds that are required. If you have the need for an additional feed than is described in this document, please discuss with your Hawksearch representative. Some of the common additional feeds are:

  • 3rd party rating feed
  • Custom sorting feed (for facet values)
  • Custom pricing feed

Sample Data Feeds

Full Data Feed

  • items.txt
unique_id	name	url_detail	image	price_retail	price_sale	price_sort	group_id	description_short	description_long	sku	sort_default
ABC123	Leather Sandals	http://onlinestore.com/products/detail.html?pid=ABC123	http://onlinestore.com/images/abc123.jpg	89.95	59.95	57.95		These leather sandals are great for the racetrack.  Stylish and sleek!	These leather sandals are great for the racetrack.  Stylish and sleek!	TEVA9825717	817
ABC12345	Leather Sandals Black	http://onlinestore.com/products/detail.html?pid=ABC123	http://onlinestore.com/images/abc123.jpg	89.95	59.95	57.95	ABC123	These leather sandals are great for the racetrack.  Stylish and sleek!	These leather sandals are great for the racetrack.  Stylish and sleek!	TEVA9825717	817
XYZ789	Panini Maker	http://onlinestore.com/products/detail.html?pid=xyz789	http://onlinestore.com/images/xyz789.jpg	49.95	39.95	32.95		Bring Rome to your kitchen!	This Panini Maker is sure to bring tasty and unique sandwiches to any lunchtime meal.  It is quick and easy to clean up as well.	KRUPS1251-125	571

  • attributes.txt
unique_id	key	value
ABC123	Color	Brown
ABC123	Size	12
ABC123	hierarchy_id	1200
XYZ789	Color	Black
XYZ789	hierarchy_id	1500
XYZ789	hierarchy_id	120
  • content.txt
unique_id	name	url_detail	description_short
C101	The key to happiness?  Never stop learning!	http://blog.onlinestore.com/2010/06/04/key-to-happiness-2/	Waaaay back when my husband David and I began dating, after years of friendship, he told me that most of the people he was working with were very, very beige. Not that they dressed in tan tones, but rather that they lived in tan tones; not a colorful character in the bunch.Of course I just had to ask if I was beige, too.
C102	What do readers say about our pizza crust? "AMAZING!"	http://blog.onlinestore.com/2010/05/22/pizza-crust/	"Wow!! I've been gluten free for 4 months and I thought my other recipes were as good as it was going to get...  This tastes like the REAL thing! My picky non-celiac husband said it tasted like real pizza!" - Ingrid, Maine
  • hierarchy.txt
hierarchy_id	hierarchy_name	parent_hierarchy_id	sort_order
1	Category	0	
100	Shoes	1	2
110	Home	1	1
1100	Athletic	100	3
1200	Sandals	100	4
1300	Bedroom	110	10
1400	Bathroom	110	11
1500	Kitchen	110	5
120	Wedding	1	6
2	Region	0	
200	United States	2	1
210	Canada	2	2
220	Mexico	2	3
2100	Midwest	200	6
2110	Padific Northwest	200	5
2120	Southwest	200	2
2130	West Coast	200	1
2140	East Coast	200	3
2150	Northeast	200	4

  • timestamp.txt
2016-05-29T08:15:30-05:00
dataset	full
attributes.txt	15000
items.txt	2000
hierarchy.txt	500
content.txt	50

Partial Data Feed

  • items.txt
unique_id	name	url_detail	image	price_retail	price_sale	price_sort	group_id	description_short	description_long	sku	sort_default	item_operation
ABC123	Leather Sandals	http://onlinestore.com/products/detail.html?pid=ABC123	http://onlinestore.com/images/abc123.jpg	89.95	59.95	57.95		These leather sandals are great for the racetrack.  Stylish and sleek!	These leather sandals are great for the racetrack.  Stylish and sleek!	TEVA9825717	817	A
ABC12345	Leather Sandals Black	http://onlinestore.com/products/detail.html?pid=ABC123	http://onlinestore.com/images/abc123.jpg	89.95	59.95	57.95	ABC123	These leather sandals are great for the racetrack.  Stylish and sleek!	These leather sandals are great for the racetrack.  Stylish and sleek!	TEVA9825717	817	U
XYZ789	Panini Maker	http://onlinestore.com/products/detail.html?pid=xyz789	http://onlinestore.com/images/xyz789.jpg	49.95	39.95	32.95		Bring Rome to your kitchen!	This Panini Maker is sure to bring tasty and unique sandwiches to any lunchtime meal.  It is quick and easy to clean up as well.	KRUPS1251-125	571	D

  • attributes.txt
unique_id	key	value
ABC123	Color	Green
ABC123	Size	9
ABC123	category_id	1200
ABC12345	Color	Black
ABC12345	Size	12
ABC12345	category_id	1200

  • content.txt
  • unique_id	name	url_detail	description_short	content_operation
    C101	The key to happiness?  Never stop learning!	http://blog.onlinestore.com/2010/06/04/key-to-happiness-2/	Waaaay back when my husband David and I began dating, after years of friendship, he told me that most of the people he was working with were very, very beige. Not that they dressed in tan tones, but rather that they lived in tan tones; not a colorful character in the bunch.Of course I just had to ask if I was beige, too.	A
    C102	What do readers say about our pizza crust? "AMAZING!"	http://blog.onlinestore.com/2010/05/22/pizza-crust/	"Wow!! I've been gluten free for 4 months and I thought my other recipes were as good as it was going to get...  This tastes like the REAL thing! My picky non-celiac husband said it tasted like real pizza!" - Ingrid, Maine	D
    
  • hierarchy.txt
hierarchy_id	hierarchy_name	parent_hierarchy_id	sort_order
1	Category	0	
100	Shoes	1	2
110	Home	1	1
1100	Athletic	100	3
1200	Sandals	100	4
1300	Bedroom	110	10
1400	Bathroom	110	11
1500	Kitchen	110	5
120	Wedding	1	6
2	Region	0	
200	United States	2	1
210	Canada	2	2
220	Mexico	2	3
2100	Midwest	200	6
2110	Padific Northwest	200	5
2120	Southwest	200	2
2130	West Coast	200	1
2140	East Coast	200	3
2150	Northeast	200	4

  • timestamp.txt
2016-05-29T08:15:30-05:00
dataset	partial
attributes.txt	6
items.txt	3
hierarchy.txt	500
content.txt	3

Hawksearch Standard Data Feed Format

Hawksearch has a Standard Data Feed Format. This format is meant to be comprehensive, flexible, and easy for all clients to create. The standard data format is designed to make it easy to add any number of attribute data for the items easily without increasing the file size too much. If you require additional columns or would like to remove a required column, please consult with the Hawksearch representative before making these changes.

Flat-File Format Properties

NameDescription
Column DelimiterTab, Semicolon, Comma
Column HeadersRequired; Must be lower-cased
Row DelimiterUnix Format (in a row delimiter)
File NameLowercase and named as: items.txt, content.txt attributes.txt, hierarchy.txt, timestamp.txt
Data QualityThe data on the file should follow strict CSV standards. For standard CSV format please reference the link below:

http://www.ietf.org/rfc/rfc4180.txt

Double Quote Requirements

In any cases where a Field value contains a double quote, the entire field value must be enclosed in double quote AND each double quote within the value must be escaped by a preceding double quote. This indicates that the interior double quote is not the end of a data value.

  1. Example 1
    The value for a field is: Special rate “1.79”
    The value of the field would be: “Special rate ““1.79”””
  2. Example 2
    The value for a field is: Blue Rug – 36” x 48”
    The value of the field would be: “Blue Rug – 36”” x 48”””

Line/Carriage Return Requirements

In any cases where a Field value contains a line return or a carriage return, the entire field value needs to be enclosed in a double quote. Without this, the import process will interpret the carriage return as the beginning of a new item.

Item Data Feed

File Name: items.txt

The Items Feed is applicable to e-commerce sites. The file consists of records that describe each product. Each record is represented by a unique ID and other values that support this unique record. The additional columns contained in this data feed file have a one-to-one relationship with the unique ID.

The unique IDs cannot be duplicated in the file. A unique ID will never have two product titles or two retail prices.

If your site contains configurable products and has a single parent product and multiple children product that associate to the parent product, you can specify these as separate line items on the item data feed. For any information that is common to both the parent and child (example: description), you can repeat that description information in both columns for the parent and child items. To specify the relationship between the parent and child item please specify the unique_id of the parent item in the group_id column value line item for the child(ren).

Please reference the items.txt sample file that was provided with the data feed guidelines for an example. In that sample item ABC123 is a parent and item with sku ABC12345 is a child item and references the id of the parent in the group_id column to specify the relationship. Group_id will be used to roll up items.

The creation of this data feed file may consist of table joins on the client’s data layer, but Hawksearch expects one file. For each row, values that don’t exist (e.g. sale price) can be left blank. If additional data values are required, they should be added to the attributes.txt file. The items.txt is the filename to use for the Item Data Feed. Also, the column headers must match the column names listed below. The column names must be lower-cased.

Pre-Configured Data Columns

NameData DescriptionRequired
unique_idUnique alphanumeric ID. Must not be duplicated.Yes
nameTitle of itemYes
url_detailURL of Record Detail PageYes
imageURL of Thumbnail ImageYes
price_retailFloating Point Value – Retail PriceYes
price_saleFloating Point Value – Sale PriceYes
price_sortFloating Point Value – Price to be used for sorting purposesNo
group_idRollup key. If used, this field must be filled in for all items.No
description_shortSearchable Short DescriptionNo
description_longLong DescriptionNo
skuAlphanumeric Manufacturer SKU / Model number. Only needed if different than unqiue_id value.No
sort_defaultDefault sort if available on the site end for items based on an integer ran calculated on the site NOTE: For search requests this is used as a secondary sort for all items that have the same score. The score is calculated based on the keyword that the user used and the searchable data associated with the item.No
Item_operationWhen using partial updates this column will need to be filled out on partial files. “D” indicated that the item will need to be deleted. “A” indicated item to be added. “U” indicated an update to an item. For full files, you can leave this column empty since it will be ignored.No

Please note that:

  • The columns not marked as required do not need to exist in items.txt
  • Make sure all required column names are present
  • If you wish to add columns to the file please discuss with the Hawksearch Professional Services Team

Content Data Feed

File Name: content.txt

How-to articles and non-product content can be indexed as well. Similar to the Item Data Feed, each record is represented by a unique ID and other values that support this unique record. The additional columns contained in this data feed file have a one-to-one relationship with the unique ID. Across both the items.txt file and the content.txt file the unique_id needs to be unique. Unique id should also not change, as this would affect tracking.

The creation of this data feed file may consist of table joins on the client’s data layer, but Hawksearch expects one file. For each row, values that don’t exist (e.g. image) can be left blank.

Also, the column headers must match the column names listed below. The column names are case-sensitive.

Pre-Configured Data Columns

NameData DescriptionRequired
unique_idUnique alphanumeric ID. Must not be duplicated.Yes
nameTitleYes
url_detailURL of Record Detail PageYes
imageURL of Thumbnail ImageNo
description_shortDescription – Since this is non-product content, include the full text of the content. Strip out all tab, carriage return, and line feed charactersNo
content_operationWhen using partial updates this column will need to be filled out on partial files. “D” indicated that the content will need to be deleted. “A” indicates content to be added. “U” indicates an update to a piece of content. For full files you can leave this column empty since it will be ignored.Yes for Partial Updates

Please note that:

  • Attributes can be associated with content if the unique_id values match.
  • Attributes can be associated using the Attributes File (as long as unique_id values aren’t duplicated), or by creating a new Content Attributes File. Please consult with your Hawksearch representative to determine the best option for your implementation.
  • When adding a category hierarchy that is specific to content, it must be added to the Hierarchy feed with a new root. (That root will have a parent_hierarchy_id of 0.)

This is a screenshot example of the content.txt file:

Attributes Data Feed

File Name: attributes.txt

The Attributes Data Feed file consists of records that relate to unique IDs. There may be multiple records related to a unique ID. Each record consists of a unique ID, an attribute key name, and an attribute value.

For example, ten rows can exist in the Attributes Data Feed that relate to one unique ID. These ten rows describe that the unique ID is in five different product categories, has three different colors, is for a woman, and is a clearance item.

The creation of this data feed file may consist of table joins on the client’s data layer. Hawksearch will be expecting one file, attributes.txt, to include all related attributes to the unique ID. To add additional attributes in the future, additional records would be added to attributes.txt.

Pre-Configured Data Columns

NameData DescriptionRequired
unique_idUnique alphanumeric ID. Must not be duplicated.Yes
keyThe name of the attributeYes
valueThe value for the named attributeYes

Please Note:

  • Attribute Keys are mapped in the Hawksearch database by lower-casing all values and replacing spaces with an underscore and removing special characters (except underscore). This is for the storing of the field name. The display name will remain for facet display.
  • Attribute Values should be standardized across each Attribute Key (e.g. Blue vs Navy or Indigo) and correctly cased as it should appear on the filters. For example, if color value is blue, then you should stick to the correct casing for all items that have the color value and the casing should be consistent to what you want to have appear on facet values example (Blue)
  • Hierarchy ID values should consist of the lowest level mapping of the hierarchy (category or other). Hawksearch is able to relate the higher level taxonomy items to the item by moving up parent_hierarchy_id column, therefore there is no need to reference the parents.
  • If an item has more than one value for an attribute, insert multiple rows with the same key
  • If an item does not have a value for an attribute, there does not need to be a row present for that unique_id and key combination. For example, if unique_id ABC123 does not have a value for the Color attribute, do not create a row in the Attributes file for ABC123 Color with null or blank for the value column. This makes the file unnecessarily large.
  • For configurable products, the data used for display and sorting will be pulled from the parent product.

This is a screenshot example of attributes.txt:

Hierarchy Data Feed

File Name: hierarchy.txt

Hawksearch supports multi-level hierarchies. For rapid deployment, we require the hierarchy.txt file to represent any hierarchical attributes that your data supports. Usually this is a category hierarchy, but other hierarchical attributes can also be defined. It is a straightforward way to represent the hierarchy in a flat file format and can support multi-level hierarchies. Unique IDs would map to these hierarchies through the Attributes Data Feed (attributes.txt). As with all data feeds, any customization to this feed will involve the Hawksearch Professional Services Team. Multi-dimensional hierarchies can be accommodated with customizations. An example of this is a category that has two parent_hierarchy_id values to map. If your data requires this, please discuss with your Hawksearch representative. It may require additional scoping for your project.

Pre-Configured Data Columns

NameData DescriptionRequired
hierarchy_idUnique alphanumeric ID. Must not be duplicated.Yes
hierarchy_nameThe name of the attributeYes
parent_hierarchy_idThe hierarchy ID of the parent hierarchy. For the top-level properties, use 0 (zero) as the parent_hierarchy_id. Example: category or region would always have parent_hierarchy_id equal to 0. See example file below.Yes
sort_orderThe sort order value that should be used while displaying this in the filter if it is availableNo

This is a screenshot example of hierarchy.txt. This example shows two properties that have a hierarchy structure. These would both be used to create two nested facets. If you have only one hierarchy property to define the attribute at the top (i.e. Category) will always have parent_hierarchy_id of 0:

Example: What is Parent Hierarchy Id?

Timestamp / Control File

File Name: timestamp.txt

To ensure your data feeds are ready for processing, we recommend adding a timestamp/control file to the set of files that is generated by your site process. This ensures that proper checks will be in place to validate the files before they are indexed.

The timestamp file contains the following details:

  • The time that your process finished generating the data feeds in UTC
  • Whether the dataset is a full or partial feed
  • The name of the file followed by the count of records for each of the files

Every time the Hawksearch process runs to index the new data, we use the data provided in the timestamp file to run a series of checks.

Download Timestamp File

The timestamp file is downloaded from the web accessible URL or the FTP server. This will enable the Hawksearch process to verify that the site is generating files at an agreed upon frequency.

Timestamp check

The first check consists of checking the date and time of the files provided. If the Hawksearch process is expecting the site to generate a file every 3 hours. The process checks the timestamp file and it is from 6 hours prior, the Hawksearch will not download or reprocess the files. The process will throw an error to notify the team that files are out of date.

Partial or Full Feed

The next check consists of establishing if the data files are full or partial feeds. Based on this information, we run the appropriate process.

File Name Check

A check will be done on the names of the files. The system will react different based on the scenarios outlined below:

  1. If one of the expected files are not provided, the system will throw an error.
  2. If the name of the file is different than originally provided, then the system will throw an error.
  3. If an additional file is provided with a name that was not previously established, the file will be ignored.

Line Count

The last check is on the counts of records for each of the file. The system will be confirming that the row counts provided in the timestamp file match the counts on the actual files. If the counts match, the system will proceed as normal. If the counts do not match, the system will throw an error. This is to eliminate indexing files that were partially downloaded or were corrupted during downloads. Do not include column headings in the count of rows.

This is a screenshot example of timestamp.txt:

Partial Updates

Requirements

When sending feeds for a partial update, it required to include a full version of:

  • Hierarchy File
  • Timestamp File
  • Any custom feed files

Items and / or Content Files

When sending feeds for a partial update, all required columns must be completed, unless the operation is to delete the record. The action for each item should be indicated in the item_operation column (or content_operation).

Adding – When an A is in the operation column, the record will be added to the database.

Updating – When a U is in the operation column, the existing record with the unique_id will be removed and replaced with the record being sent.

Deleting – When a D is in the operation column, the record with the unique_id will be removed from the database.

Attributes File

When adding or updating a record, ALL entries in the Attributes File belonging to the item must be included. This includes attributes that have not changed. (An update in the system, removes the existing record and replaces it with the item being sent in the partial feeds.

When deleting a record, it is not necessary to include it in the attributes file.

Hierarchy File

The entire hierarchy needs to be included with all partial or full data feeds.

Timestamp File

When building files for a partial update, the value for the dataset row in the timestamp.txt file should be ‘partial.’ Other rows should have accurate information for the number of records included in each feed for the partial update.

Rebuilding Index

HawkSearch’s search index can be rebuilt manually or through API. To build from the dashboard, login to your HawkSearch dashboard engine and click on the Rebuild Index button under the Workbench header.

The REST API can be used to automate or trigger an on-demand index rebuild process in Hawksearch. You will need an API key to use for authenticating with the API. Please contact your Hawksearch Representative to request an API Key.

The API is the preferred method for triggering the indexing process, however, if you need this to be a scheduled process on the Hawksearch end, please discuss with your Hawksearch Representative.

URL for the REST API to rebuild index: Using Dashboard API to Rebuild Indexes

Feed Delivery Method

SFTP

Files can be hosted at an SFTP location. You can host the files on your own SFTP server and provide credentials to the Hawksearch team to access the account. If you do not have your own SFTP server and need an account, ask your Hawksearch Representative to provide one.

Web Accessible URL

Files can also be hosted on a directory on your server. Please provide the Hawksearch Team the path to the files so these can be appropriately downloaded from the server. If you have the path protected, please provide the Hawksearch Team credentials for the same.

Zip File

Feeds should be delivered in a zip format. This reduces the speed at which the files can be downloaded. The file should be a standard Windows zip file with no folders within – the files should all be at the “root” when the zip file is opened.

Custom Sort Feed (Optional)

This section is to support the Custom sort option for facet values that can be selected in the Workbench.

Facet Value Sorting

When one of the built-in sorting options do not meet a client’s requirement, a Custom sort order can be defined. To implement a custom sort, a Custom Sort Feed must be used.

Some example facets that might use this:

  • Size
    • S, M, L, XL
    • 0.5, 1⁄2, 2/3, 0.75
    • 1”, 6”, 1’
    • 28 Regular, 28 Long, 30 Regular, 30 Long o 14.5x32/33,14.5x34/35,15x34/35
  • Days of the week
    • Sun, Mon, Tue, Wed, Thu, Fri, Sat
  • Seasons/Holidays
    • Spring, Summer, Fall, Winter
    • New Year, Mardi Gras, Easter, Mother’s Day, Graduation, Father’s Day
  • Anniversaries/Birthdays
    • First, Second, Third, Fourth o Tenth,Twentieth,Thirtieth
  • Color/Color Families
    • Red, Orange, Yellow, Green, Blue, Purple, Brown, Black o Light, Medium, Dark
    • Silver, Bronze, Black, Gold, Crystal

Hawk Search Custom Sort Feed Format

This format is designed to make it easy to add sort values easily. If certain columns do not apply to your data or you need to add additional columns to existing files those changes can be incorporated. However please consult with the Hawksearch representative before making these changes.

File Format Requirements

To support a successful data import, we require the following format requirements:

NameDescription
EncodingUTF-8
Column DelimiterTab
Column HeadersRequired; Case Sensitive
Row DelimiterUnix Format (\n as a row delimiter)
File NameLowercase (e.g. custom_facet_sort.txt)
Data QualityThe data on the file should follow strict CSV standards. For standard CSV format please reference the link below:

http://www.ietf.org/rfc/rfc4180.txt

Custom Facet Value Sort Feed

The Custom Sort Feed file consists of records that have of a facet_name, a facet_value and a sort order value. The facet_name will be the field name, as defined in the Hawksearch field section of the Workbench, that is used to build a facet that need its values sorted in a way that cannot be accomplished using the built-in Hawk functionality. There will be multiple records related to a facet_name.

For example, if a facet can have 15 possible values in the data, there will be 15 rows in the Custom Sort Feed that relate to the one field name that supplies the data to the facet. All 15 rows will have the same facet_name value, but different values in the facet_value and sort columns.

Custom Facet Sort Feed Columns

NameData DescriptionRequired
facet_nameName of the field that populates the facet (as defined in the Field listing page in the Hawksearch Workbench), all in lowercase and no spaces or symbols.Yes
facet_valueFacet ValueYes
sortSort order of value (e.g. 5, 10, 15, 20 etc)Yes

Please note that:

  • Make sure all possible field values are included for any field that is in the file. If a field value is not included, it will appear at the top of the list of displayed values in the facet.
  • Make sure all sort values are unique within each field.
  • It is recommended that you space your sort values by 5 or more to allow for future additions without rework.

This is a screenshot example of custom_facet_sort.txt:

Add Record to Timestamp/Control File

FILE NAME: timestamp.txt

To ensure your data feeds are ready for processing, we recommend adding a timestamp/control file to the set of files that is generated by your site process. If this is part of your feeds, a Line Count should be added for the Custom Sort Feed. The system will be confirming that the row counts provided in the timestamp file match the counts on the actual files. If the counts match, the system will proceed as normal. If the counts do not match, the system will throw an error. This is to eliminate indexing files that were partially downloaded or were corrupted during downloads.

Other Questions?

If you have any 3rd party feeds that you would like integrated into Hawksearch, please contact your Hawksearch Representative.

For questions about the data feeds that Hawksearch can accept, please contact your Hawksearch Representative.