Databases and Free-Text Searches: Some Advantages and Disadvantages

“Life is much richer than the structure of a database,” Jacek Leociak.

In my capacity as deputy directory of Yad Vashem’s reference and information service, I observe that many genealogists do not realize that limiting their research to databases—to the exclusion of electronically searching text on which those databases are based—may yield less rich results than if they availed themselves of both research mechanisms. Moreover, today’s sophisticated search engines enable efficient information searches in documents that either do not lend themselves to abstracting into a database or that have not yet been abstracted.

Work on Yad Vashem’s data processing efforts has sensitized me to various advantages and disadvantages of databases. Using modern technology, anyone can type or scan a 1,000 page document of any size into a word processor and use the search function efficiently to find a particular word or phrase practically instantaneously. In this era of instant information, why bother with databases? Why not just type in any data as it appears and use the search function that exists in almost every computer program to find a specific item in a document? This approach is called free-text searching. Databases and free-text searches each have several distinct advantages and disadvantages. Several are described below.

Defined Fields

A free-text search for a family with the surname Ezekiel in a document that describes hundreds of people may yield many irrelevant results—for example, for people with the first name Ezekiel. In fact, in addition to a surname, Ezekiel may appear as a given name, a father’s name, a maiden name, or even an address. On the other hand, the ability to search for Ezekiel in a database’s surname field allows a researcher to find the desired appearances of Ezekiel as a surname. The larger the database, the more important searching of this field becomes.

Database designs inevitably are imperfect. Consider the following brief oral history: “His name was Jake, and he worked with his father, Abe, in his business and eventually took over the business. Everybody called him Little Abe, and, after his dad’s death, just Abe.” How might a software engineer incorporate the information in this paragraph into a database? Should he or she define “Abe” as one of Jake’s names? Doing so may cause considerable confusion, especially among Ashkenazic Jews, who typically do not name children after living parents. A user might assume that because his name is Abe, he could not be Abe’s son. On the other hand, everybody called him Abe. What to do?

Databases use a variety of solutions for such a problem. In most cases, the database would reflect the fact that this person’s name was both Jake and Abe (or Abe might be placed in a nickname field). Many users, however, seeing that Abe is both his name and his father’s name, may assume that the record was mistyped. Of course, even if a user understands that Abe was a nickname, the database does not indicate the reason for it—but a view of the original record clarifies everything instantly.

Controlled Language

A researcher may write that someone is “single,” a “bachelor,” or “unmarried”—and mean the same thing. Someone looking for a bachelor in a free-text document must search for all three terms in order to avoid missing important data. Most databases use one standard term consistently for a given concept. Thus, if the word “bachelor” always is used, a researcher may search using only that term and does not need to repeat the search for other synonymous terms. Moreover, in a more sophisticated database, the designers will have built the reference into the system to enable all three terms to yield the same result.

Controlled language frequently expresses a definition determined by the database compilers and may not reflect the intent or nuances of the writer of the original text. For example, in a database where unmarried and bachelor are equivalent and a document reads “unmarried, but lived with Eugenia for 50 years,” would it be accurate to define this individual as a bachelor?

Here is another example: A database based on text about Jewish-American life has a field for denomination. The document reads: “Every Shabbes (Sabbath) we drove to the Orthodox shul (synagogue) where we were members.” How should this person’s denomination field be recorded? The compiler could consider the family to be Orthodox based on synagogue affiliation—or as non-Orthodox, because the Orthodox world view opposes driving on the Sabbath. Others might take a historical approach and examine the norm of behavior at the particular time and place. In some places at some times, Orthodox synagogues were filled with congregants who had driven there on Saturday morning, while at other times and places, driving at that time, even to synagogue, would denote an individual as non-Orthodox. Thus, at certain historical moments, the people described would be considered part of the Orthodox community—and at other times not. Whatever solution the database compilers used would reflect one or another world view and typically would never provide the full flavor of the original document. If the database lacks a synonym system, another problem arises with controlled language: the user may need to guess the term the compilers used. If bachelor were used as a search word, searching for single will yield no results.

Relevance

Database fields are designed to allow users to search for the most relevant data related to a specific endeavor. The engineers of a database rely on their historical and general knowledge to tailor the fields and data according to the purpose of the database and the interests and needs of its target users. In a database built to assist in canvassing for a political candidate, fields for party affiliation and the candidates people voted for in previous elections are essential. In a genealogical database, these would be superfluous and possibly confusing.

Relevance is governed by value judgments. A database, to be useable, cannot have endless fields, so only the most relevant fields—in the compilers’ view—are used. Sometimes, however, what the compilers may see as irrelevant may be highly relevant to a researcher. An example: A person met her grandfather once on a trip to California when she was a child, and her strongest memory of her grandfather is of visiting the humane society (where he was very active) with him. Afterwards the family lost contact with the grandfather. Because his name was Samuel Cohen, it would be difficult to determine which of many records of people with that name refer to him. Assume now that someone is compiling a database of Jews in California. The document upon which the compiler bases the grandfather includes the fact that he was a member of Tifereth Israel Synagogue, a member of B’nai B’rith, and a member of the humane society. The compiler might choose to note the information about his synagogue membership (in fact, there might be a special field for the synagogue he belonged to), and the information about his membership in B’nai B’rith might also make it into the database. The compiler might decide that the grandfather’s humane society activity is irrelevant and not include that information in the entry for him. But the omitted detail is the very information that would allow identification of that individual as the Samuel Cohen who was the person’s grandfather. Although the indexing in the genealogical database will help identify all the Samuel Cohens in California, only a look at the original documents will reveal mention of membership in the humane society.

Yad Vashem’s Approach

In developing its Central Database of Names, Yad Vashem has spent considerable time and money seeking technical solutions to the above-mentioned problems.

Defined Fields. On the one hand, for the reasons described above, Yad Vashem prefers to have many well-defined fields—such as separate fields, for example, for place of birth, place of residence, place during the war, and place of death. On the other hand, because the difference between these fields often is not clear or not known to the user, Yad Vashem also allows a united search in multiple fields. In fact, searching in all the place fields at once is the default in the basic search in the online database.

Controlled Language. Even though Yad Vashem uses controlled language in other fields in the database, when dealing with names of places and people, rather than choosing one standard way to write the name, it tries to use whatever form of the name is written on the original document—unless the form is an obvious error. A synonym system is employed so that, for example, if the original record shows “Itzik,” Yad Vashem writes “Itzik”; if it says “Isaac,” that is the recorded spelling. Since Itzik and Isaac are two forms of the same name, a search for either variant will retrieve all instances of both—as well as the remaining 1,100 variants (!) thereof. Similarly, a search for Pressburg will also will retrieve Bratislava—an alternate name for the same locality. The synonym system is the default in the online search.

Relevance. Because almost any and/or all the information cited in a given source (for example, the fact that Albert-Abraham Blotner was a piano tuner) may be relevant to a particular researcher, Yad Vashem offers an enormous set of fields to choose from to enter almost any data that appears on an original record. This approach also stems from Yad Vashem’s conviction that people who perished in the Holocaust should be commemorated as individuals who lived multifaceted lives and not simply as victims who were killed in a given place.

Most searchers are interested only in basic data, however, and a search or display based on dozens of fields may be unwieldly. Thus, the first view of a record shows only the most important fields. Subsequent screens reveal further details, and users then have the option to see the entire record with a click of a mouse.

The most important feature in the Yad Vashem system, though, is that, whenever feasible, the database entry is linked to a scanned copy of the original document. Our personnel know that a database can never fully reflect the content of a document. All serious researchers understand that, while a database is a wonderful tool to help locate the source being sought, to understand the results fully it is necessary to peruse the original document.

Yad Vashem’s goal is to attach the original document to every record—and more than half the records in its database do, in fact, display the original document. For historical, technical, and other reasons, however, some original documents cannot be displayed. In almost all cases, the records based on Pages of Testimony display the original document, the Page of Testimony itself. Many of the archival sources from the Yad Vashem archives that have been entered into the database, including records of the Special Soviet Commission, also display the original source document. Many additional records in the system from Yad Vashem archival resources currently do not display the original source document, but will do so in the future.

Records from yizkor (memorial) books in the database do not display the original source, but do include a full reference to the source. Although Yad Vashem aims to add the original source in the far future, this is not a high priority at this time, since scans of most of the yizkor books are freely available on the web.

Some records have been donated by other institutions that have compiled these records from their own sources—frequently, from multiple sources. For example, records from the German Gedenkbuch (memorial book) have sources which include a reference to the organization that compiled the records.

Zvi Bernhardt is head of data processing in the Hall of Names and deputy director of Yad Vashem Reference and Information Services.

Share This

Related posts:

About Zvi Bernhard